google / xarray-beam

Distributed Xarray with Apache Beam
https://xarray-beam.readthedocs.io
Apache License 2.0
126 stars 7 forks source link

Indicate variables in xarray-beam keys #9

Closed shoyer closed 2 years ago

shoyer commented 3 years ago

Currently, we identify chunks only by overall offsets along each dimension. This works OK, but hits scalability limits for some pipelines, such as the ERA5 rechunking example in https://github.com/google/xarray-beam/pull/8.

It would be nice to be able to have a SplitVariables() transform, that allowed for applying a pipeline in parallel to each data-variable in a Dataset.

To do so, we need some consistent way to identify a limited set of variables, not just chunk offsets. I propose to do so using a new Key class modeled off of the existing ChunkKey:

Key should support most of the user facing API of ChunkKey, e.g., key | {'time': 0} should still work. However: