google / xarray-beam

Distributed Xarray with Apache Beam
https://xarray-beam.readthedocs.io
Apache License 2.0
134 stars 7 forks source link

Consider adding ZarrToChunks() and/or an open_zarr() helper function #26

Closed shoyer closed 2 years ago

shoyer commented 3 years ago

These could facilitate directly opening data from Zarr using idiomatic patterns in Xarray-Beam (e.g., using Xarray's lazy indexing machinery instead of dask).

I'm imaging open_zarr() returning a tuple of values transform, template, chunks providing exactly the information needed to use the dataset in a Zarr-to-Zarr pipeline:

Usage examples:

with beam.Pipeline() as p:
  p | xbeam.ZarrToChunks(..., desired_chunks) | ...
with beam.Pipeline() as p:
  load_data, template, original_chunks = xbeam.open_zarr(...)
  p | load_data | beam.MapTuple(...) | xbeam.ChunksToZarr(..., template, original_chunks)