Closed jbusecke closed 4 years ago
Ok I think I found a solution. It was not immediately clear how to format the chunks argument for a group. I have now written a little wrapper that loads the source store temporarily and gets the dimension size and parses the chunks accordingly.
import zarr
def rechunker_wrapper(source_store, chunks= {'x':180, 'y':90, 'time':60}, target_store, temp_store):
g = zarr.group(source_store)
# get the correct shape from loading the store as xr.dataset and parse the chunks
ds_chunk = xr.open_zarr(source_store)
group_chunks = {}
for var in ds_chunk.variables:
# pick appropriate chunks from above, and default to full length chunks for dimensions that are not in `chunks` above.
group_chunks[var] = tuple([chunks[di] if di in chunks.keys() else len(ds_chunk[di]) for di in ds_chunk[var].dims])
rechunked = rechunk(g, group_chunks , 500e6, target_store, temp_store=temp_store)
rechunked.compute()
zarr.convenience.consolidate_metadata(target_store)
This works pretty good for me now. Thanks again for this cool software!
EDIT: HOLY š®, this just rechunked a 300GB control run in 1 minute!
@jbusecke - groups are supported. I just need to finish the documentation to explain the syntax. I'm literally working on it now š .
Hi Julius--I just got some better docs up: https://rechunker.readthedocs.io/en/latest/tutorial.html
Let me know if you are able to figure out the group syntax.
Already did (good tests FTW!), but I'll take a look anyways!
Looks great, and I can eliminate some of the logic in my example above by passing a nested dict!
I'll close this for now.
I just tried
rechunker
for the first time. Thanks a lot for putting this together, I am very hopeful this will end my long nights trying to diagnose blown up dask workers šMy use case
I want to use this package to preprocess a large amount of local netcdf files to a unified chunking scheme. Previously I have encountered many different problems (like other users) with memory overflow, and quirks with the
xarray.to_zarr()
chunking. Circumventing all that and preprocessing on a lower level seems like a great solution to these problems.I pulled the latest master and started some preliminary tests. My overall workflow looks something like this:
Identify netdcf files that belong toghether, load them with
xr.open_mfdataset
, rechunk (usually into single time slices). My original dataset used to create a test store looks like this:Then save to a first zarr store (this usually works).
ds.to_zarr('first_write.zarr', mode='w')
I reload the zarr store:
Then I would like to use
rechunk
to rewrite this store with a predefined chunk structure.This however fails with the following error:
The rechunking is working when I apply it to a single array like this:
rechunk(g.po4, chunks , 500e6, f'first_write_rechunked.zarr', temp_store='temp_store.zarr')
Am I misunderstanding how to specify the chunks?