Initialise zarr metadata without computing dask graph

dougiesquire commented 2 years ago

Is your feature request related to a problem? Please describe. On writing large zarr stores, the xarray docs recommend first creating an initial Zarr store without writing all of its array data. The recommended approach is to first create a dummy dask-backed Dataset, and then call to_zarr with compute=False to write only metadata to Zarr. This works great.

It seems that in one common use case for this approach (including the example in the above docs), the entire dataset to be written to zarr is already represented in a Dataset (let's call this ds). Thus, rather than creating a dummy Dataset with exactly the same metadata as ds, it is more convenient to initialise the zarr Store with ds.to_zarr(..., compute=False). See for example:

https://discourse.pangeo.io/t/many-netcdf-to-single-zarr-store-using-concurrent-futures/2029 https://discourse.pangeo.io/t/map-blocks-and-to-zarr-region/2019 https://discourse.pangeo.io/t/netcdf-to-zarr-best-practices/1119/12 https://discourse.pangeo.io/t/best-practice-for-memory-management-to-iteratively-write-a-large-dataset-with-xarray/1989

However, calling to_zarr with compute=False still computes the dask graph for writing the Zarr store. The graph is never used in this use-case, but computing the graph can take a really long time for large graphs.

Describe the solution you'd like Is there scope to add an option to to_zarr to initialise the store without computing the dask graph? Or perhaps an initialise_zarr method would be cleaner?

shoyer commented 2 years ago

The challenge is that Xarray needs some way to represent the "schema" for the desired entire dataset. I'm very open to alternatives, but so far, the most convenient way to do this has been to load Dask arrays into an xarray.Dataset.

It's worth noting that any dask arrays with the desired chunking scheme will do -- you don't need to use the same dask arrays that you want to compute. When I do this sort of thing, I will often use xarray.zeros_like() to create low overhead versions of dask arrays, e.g., in this example from Xarray-Beam: https://github.com/google/xarray-beam/blob/0.2.0/examples/era5_climatology.py#L61-L68

dcherian commented 2 years ago

What metadata is being determined by computing the whole array?

dougiesquire commented 2 years ago

Thanks @shoyer. I understand the need for the schema, but is there a need to actually generate the dask graph when all the user wants to do is initialise an empty zarr store? E.g., I think skipping this line would save some of the users in my original post a lot of time.

Regardless, your suggestion to just create a low-overhead version of the array being initialised is probably better/cleaner than adding a specific option or method. Would it be worth adding the xarray.zeros_like(ds) recommendation to the docs?

shoyer commented 2 years ago

E.g., I think skipping this line would save some of the users in my original post a lot of time.

I don't think that line adds any measurable overhead. It's just telling dask to delay computation of a single function.

For sure this would be worth elaborating on in the Xarray docs! I wrote a little bit about this in the docs for Xarray-Beam: see "One recommended pattern" in https://xarray-beam.readthedocs.io/en/latest/read-write.html#writing-data-to-zarr

sehoffmann commented 5 months ago

As a user, I find the topic very unclear and would hope that were to be a clear and concise way to do this in the future. In essence, I should just be expected to pass in my coords, dims, shapes and chunks and let xarray handle the rest. Similar to how the xr.Dataset or xr.DataArray ctor needs this meta information from me already as well. This could also be an extra function such as xr.create_zarr(shapes, dims, coords, chunks) for instance.

In general, incrementally writing a zarr array with xarray seems to be very convoluted in my opinion, especially if compared with the actual python zarr API.

@dougiesquire

Regardless, your suggestion to just create a low-overhead version of the array being initialised is probably better/cleaner than adding a specific option or method. Would it be worth adding the xarray.zeros_like(ds) recommendation to the docs?

zeros_like() allocates memory, doesn't it? Same with empty_like. If you are in the business of incrementally writing data to disk, its usually because memory is not big enough anymore. So this unfortunately not an option for me.

max-sixty commented 5 months ago

FYI I think https://github.com/pydata/xarray/pull/8460 should solve most of this. Or would anything remain?

zeros_like() allocates memory, doesn't it? Same with empty_like. If you are in the business of incrementally writing data to disk, its usually because memory is not big enough anymore. So this unfortunately not an option for me.

If we create the array with chunks, then it doesn't allocate memory! There's more context in the linked PR / some links from there...

pydata / xarray

Initialise zarr metadata without computing dask graph #6084