Currently, we identify chunks only by overall offsets along each dimension. This works OK, but hits scalability limits for some pipelines, such as the ERA5 rechunking example in https://github.com/google/xarray-beam/pull/8.
It would be nice to be able to have a SplitVariables() transform, that allowed for applying a pipeline in parallel to each data-variable in a Dataset.
To do so, we need some consistent way to identify a limited set of variables, not just chunk offsets. I propose to do so using a new Key class modeled off of the existing ChunkKey:
Key(offset={'x': 0, 'y': 1}, vars={'foo'}) indicates a chunk of a dataset at positional offset x=0, y=1 and with only the variable foo.
Key(offset={'x': 0, 'y': 1}, vars=None) indicates variables are not split.
Key(offset=None, vars={'foo'}) or Key(offset={}, vars={'foo'}) indicates dimensions are not split.
Key should support most of the user facing API of ChunkKey, e.g., key | {'time': 0} should still work. However:
Key now is a frozen dataclass consisting of a frozen dict and a frozen set (rather than a mapping itself), so key[dim] will have to become key.offsets[dim].
Key.to_slices doesn't really make sense (it could apply only to some variables).
To support modification without mutation, we'll add a new replace() method, e.g., key.replace(vars=None).
Currently, we identify chunks only by overall offsets along each dimension. This works OK, but hits scalability limits for some pipelines, such as the ERA5 rechunking example in https://github.com/google/xarray-beam/pull/8.
It would be nice to be able to have a
SplitVariables()
transform, that allowed for applying a pipeline in parallel to each data-variable in a Dataset.To do so, we need some consistent way to identify a limited set of variables, not just chunk offsets. I propose to do so using a new
Key
class modeled off of the existingChunkKey
:Key(offset={'x': 0, 'y': 1}, vars={'foo'})
indicates a chunk of a dataset at positional offsetx=0, y=1
and with only the variablefoo
.Key(offset={'x': 0, 'y': 1}, vars=None)
indicates variables are not split.Key(offset=None, vars={'foo'})
orKey(offset={}, vars={'foo'})
indicates dimensions are not split.Key
should support most of the user facing API ofChunkKey
, e.g.,key | {'time': 0}
should still work. However:Key
now is a frozen dataclass consisting of a frozen dict and a frozen set (rather than a mapping itself), sokey[dim]
will have to becomekey.offsets[dim]
.Key.to_slices
doesn't really make sense (it could apply only to some variables).replace()
method, e.g.,key.replace(vars=None)
.