Open rafa-guedes opened 5 years ago
This seems to be an ongoing problem (Unexpected behaviour when chunking with multiple netcdf files in xarray/dask, Performance of chunking in xarray / dask when opening and re-chunking a dataset) that has not been resolved nor has feedback been provided.
I've been running into this problem trying to handle netcdfs that are larger than my RAM. From my testing, chunks must be passed with open_mfdataset to be of any use. The chunks method on the datatset after opening seems to do nothing in this use case.
What happens is that dask first constructs chunks of size specified in open_mfdataset
and then breaks those up to new chunk sizes specified in the .chunk()
call.
A similar behaviour is present for repeated chunk calls .chunk().chunk()
, these do not get optimized to a single chunk call yet.
So yes, you should pass appropriate chunk sizes in open_mfdataset
I was wondering if the chunking behaviour would be expected to be equivalent under two different use cases:
(1) When opening a dataset using the
chunks
option; (2) When re-chunking an existing dataset usingDataset.chunk
method.I'm interested in performance for slicing across different dimensions. In my case the performance is quite different, please see the example below:
Open dataset with one single chunk along
station
dimension (fast for slicing one time)Open dataset with many size=1 chunks along
station
dimension (fast for slicing one station, slow for slicing one time)Try rechunk
station
into one single chunk (still slow to slice one time)