pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
162 stars 25 forks source link

Add consolidated metadata to rechunked zarr #149

Open Metamess opened 6 months ago

Metamess commented 6 months ago

What is the issue? When opening a zarr with xarray, it really helps to have what is called "consolidated metadata". This is a single file (.zmetadata) at the root of the zarr, which combines all the information of all the various .zarray and .zattrs files inside the zarr. While this file is not required to exist in a valid zarr (nor in fact for the zarr to be opened by xarray), it does greatly speed up the process, especially when the data is being read from a remote location.

Sadly, Rechunker currently does not create such a consolidated metadata file for the resulting rechunked zarr. There is also (as far as I have been able to find, at least) no way to enable such behavior via an option parameter.

What would solve the issue? One of the following feature requests would be able to resolve this issue:

  1. Add an optional boolean parameter to the rechunk() function which allows to user to specify that a consolidated metadata file should be created. For backwards compatibility, if that is desired, this parameter would default to False. Potential parameter names could be consolidated, mirroring the parameter name in xarray, write_consolidated to be more explicit that it only impacts writing, or consolidate_metadata to mirror the function in zarr
  2. Automatically detect the existence of a consolidated metadata file (.zmetadata) in the source zarr, and create one (or not) in the result zarr accordingly.
  3. The combination of options 1 and 2. The parameter could default to a str value of "auto", resulting in the behavior as described in (2), or be given a boolean value by the user to override this behavior.

I look forward to hearing what people think about this feature request, and to know if others would also like to see this feature added!

rabernat commented 6 months ago

Thanks for this suggestion @Metamess!

Creating consolidated metadata after rechunking is done is a one-line operation, e.g.

zarr.consolidate_metadata(target_store)

(https://zarr.readthedocs.io/en/stable/tutorial.html#consolidating-metadata)

This can be run on the target store after the rechunking is complete. Would that meet your needs?

Metamess commented 6 months ago

Hey @rabernat , thanks for the reply!

I did already know that you can manually create the consolidated metadata like this afterwards, and it is in fact what I am currently using to work around this issue! But it is a step that I would expect to be possible as part of the rechunk operation. In advocating for this feature, I consider the following:

What do you think? Is it worthwhile as a feature in rechunker?