Open slevang opened 2 months ago
Dataset.to_zarr()
is also very slow for Datasets with many variables.
We really need support for concurrency in writing Zarr metadata, which I believe may be coming in Zarr Python 3 (cc @jhamman)
Yes, exactly right. We have build Zarr Python 3's interface around asyncIO. This will allow us to initialize hierarchies with much more concurrency than we do now. We haven't built the API for this yet into Zarr yet though but I think @d-v-b has experimented with this already.
What is your issue?
Repost of https://github.com/xarray-contrib/datatree/issues/277, with some updates.
Test case
Write a tree containing 13 nodes and negligible data to S3/GCS with fsspec:
Gives:
This is a bit better than in the original issue due to improvements elsewhere in the stack, but still really slow for heavily nested but otherwise small datasets.
Potential Improvements
9014 did make some decent improvements to read speed. When reading the dataset written above I get:
We'll need similar optimizations on the write side. The fundamental issue is that
DataTree.to_zarr
relies on serialDataset.to_zarr
calls for each node:https://github.com/pydata/xarray/blob/12c690f4bd72141798d7c3991a95abf88b5d76d3/xarray/core/datatree_io.py#L153-L171
This results in many
fsspec
calls to list dirs, check file existence, and put small metadata and attribute files in the bucket. Here'ssnakeviz
on the example:(The 8s block on the right is metadata consolidation)
Workaround
If your data is small enough to dump locally, this works great:
Takes about 1s.