Open agstephens opened 3 years ago
Hi @agstephens,
I use the consolidated=True
option just to add on additional metadata file, .zmetadata
so that anyone who uses the optional consolidated=True
when reading the zarr store will not need to access all of the little files. I do not delete the other .z* files, so that the store can still be read with consolidated=False
.
The write speeds I get from my posix machine at Columbia to Google Cloud vary quite a bit, but I usually get about 25-35 MiB/s . I am using the -m
option to the google command line command gsutil
, so the speed depends on the number of threads and parallel processes set in my .boto
file. I have cut the defaults (#) down:
#parallel_process_count = 32
#parallel_thread_count = 5
parallel_process_count = 8
parallel_thread_count = 1
because I am running multiple notebooks simultaneously and don't want trouble. So I guess it could be much faster ...
I never really thought about it much, but leaving, for example, the time/.zattr
files in place would allow a more efficient check on the calendar
attributes of the zarr stores. So it is likely that someone is already assuming the presence of all of these little files and removing them would cause trouble.
Thanks @naomi-henderson, that's useful to know. I wondered if our CEDA pipeline was significantly slowed down by writing all the small files. We will do more testing :-)
I would be very surprised if the cost of writing these tiny metadata files could compare in any way to the cost of writing the chunks themselves.
To speed up metadata access, zarr allows you to store chunks and metadata in separate storage. So for example you could put all the metadata into a database like mongodb and keep the chunks in object storage.
Hi @rabernat, do you know of any examples of splitting Zarr content between different physical storage systems?
No but it would not be hard to cook up. You just initialize a store with separate store
and chunk_store
arguments: https://zarr.readthedocs.io/en/stable/api/hierarchy.html#module-zarr.hierarchy
This could actually be a great way to do search. Put all the metadata for your library into a database and the actual data into object storage.
@rabernat thanks for your thoughts on this. As you say, there might be ways to use this for optimising performance. Tagging: @philipkershaw @rsmith013
Hi @naomi-henderson,
In our CEDA pipeline for CMIP6 cloud data, we do the following:
We are using the
consolidated=True
option when we write the data, which is supposed to group all the (small) metadata files into one large consolidated metadata file.The code does consolidate the metadata, but it also writes all the separate small metadata to object store before the consolidation. We think this is slowing down our write processes.
In your workflow, I think you write the data to POSIX disk, then copy the files into object store. When you do this, do you delete all the extra small metadata files and only preserve the consolidated file?
Also, do you have an idea of the write speeds that you get to object store?
Thanks