Consolidating metadata in Zarr holdings

pangeo-forge / cmip6-pipeline

Pipeline for cloud-based CMIP6 data ingestion pipeline

Apache License 2.0

1 stars 5 forks source link

Consolidating metadata in Zarr holdings #13

Open agstephens opened 3 years ago

agstephens commented 3 years ago

Hi @naomi-henderson,

In our CEDA pipeline for CMIP6 cloud data, we do the following:

Read NetCDF files from local disk
Define xarray/dask/zarr chunks
Write directly to object store

We are using the consolidated=True option when we write the data, which is supposed to group all the (small) metadata files into one large consolidated metadata file.

The code does consolidate the metadata, but it also writes all the separate small metadata to object store before the consolidation. We think this is slowing down our write processes.

In your workflow, I think you write the data to POSIX disk, then copy the files into object store. When you do this, do you delete all the extra small metadata files and only preserve the consolidated file?

Also, do you have an idea of the write speeds that you get to object store?

Thanks

naomi-henderson commented 3 years ago

Hi @agstephens, I use the consolidated=True option just to add on additional metadata file, .zmetadata so that anyone who uses the optional consolidated=True when reading the zarr store will not need to access all of the little files. I do not delete the other .z* files, so that the store can still be read with consolidated=False.

The write speeds I get from my posix machine at Columbia to Google Cloud vary quite a bit, but I usually get about 25-35 MiB/s . I am using the -m option to the google command line command gsutil, so the speed depends on the number of threads and parallel processes set in my .boto file. I have cut the defaults (#) down:

#parallel_process_count = 32
#parallel_thread_count = 5
parallel_process_count = 8
parallel_thread_count = 1

because I am running multiple notebooks simultaneously and don't want trouble. So I guess it could be much faster ...

naomi-henderson commented 3 years ago

I never really thought about it much, but leaving, for example, the time/.zattr files in place would allow a more efficient check on the calendar attributes of the zarr stores. So it is likely that someone is already assuming the presence of all of these little files and removing them would cause trouble.

agstephens commented 3 years ago

Thanks @naomi-henderson, that's useful to know. I wondered if our CEDA pipeline was significantly slowed down by writing all the small files. We will do more testing :-)

rabernat commented 3 years ago

I would be very surprised if the cost of writing these tiny metadata files could compare in any way to the cost of writing the chunks themselves.

To speed up metadata access, zarr allows you to store chunks and metadata in separate storage. So for example you could put all the metadata into a database like mongodb and keep the chunks in object storage.

agstephens commented 3 years ago

Hi @rabernat, do you know of any examples of splitting Zarr content between different physical storage systems?

rabernat commented 3 years ago

No but it would not be hard to cook up. You just initialize a store with separate store and chunk_store arguments: https://zarr.readthedocs.io/en/stable/api/hierarchy.html#module-zarr.hierarchy

This could actually be a great way to do search. Put all the metadata for your library into a database and the actual data into object storage.

agstephens commented 3 years ago

@rabernat thanks for your thoughts on this. As you say, there might be ways to use this for optimising performance. Tagging: @philipkershaw @rsmith013