Closed agstephens closed 4 years ago
Maybe use xarray rechunker: https://rechunker.readthedocs.io/en/latest/
See memory usage example: https://github.com/cehbrecht/jupyterlab-notebooks/blob/master/xarray-demo/memory-usage.ipynb
The chunks are managed by dask
. One can use the auto
chunk size option for one dimension which will use the configured chunk size:
dask.config.get('array.chunk-size')
128MiB
chunked_ds = ds.chunk({'time': 'auto'})
chunked_ds.ta.unify_chunks()
See: https://docs.dask.org/en/latest/array-chunks.html
The default chunk size can be changed with a config file or environment variable:
dask.config.set({'array.chunk-size': '256MiB'})
See: https://docs.dask.org/en/latest/configuration.html
The available memory for a subset (etc) operation could be configured using slurm:
salloc --mem=1024
@cehbrecht if I understand your findings correctly, the simplest solution is:
dask.config.set({'array.chunk-size', .....})
Sounds great.
But we do also need rules on:
Closing - now being implemented in other issues.
Sooner or later a user makes a "larger than memory" request.
We need to implement an appropriate level of Dask chunking in open dataset operations so that we avoid memory errors.
This needs some thought, but this example may be of use: