Closed norlandrhagen closed 2 months ago
Hi there 👋
I'm trying out xarray-beam to do some Zarr rechunking. I'm wondering if anyone has any tips to control max_mem used in the rechunk operation.
max_mem
I first tried rechunking a 128GB Cmip6 dataset and ran into an OOM error around 120GB of ram.
Failing 128GB Zarr store
Next, I tried rechunking a much smaller dataset (~10GB) and the memory consumption climbed to about 34GB before finishing.
Completing 10GB Zarr store (with high memory usage)
I'm basing my rechunking pipeline off of the ERA5_rechunking.py example in the docs, using the local runner and using the max_mem default of 1 GB.
Small dataset example:
import apache_beam as beam import xarray_beam as xbeam store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/pangeo-forge/EOBS-feedstock/eobs-wind-speed.zarr' # store = 'https://cpdataeuwest.blob.core.windows.net/cp-cmip/version1/data/DeepSD/ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.DeepSD.pr.zarr' output_store = 'eobs_test.zarr' source_dataset, source_chunks = xbeam.open_zarr(store) template = xbeam.make_template(source_dataset) target_chunks = {'time':80, 'latitude':350, 'longitude':511} itemsize = max(variable.dtype.itemsize for variable in template.values()) with beam.Pipeline(runner="DirectRunner", options=beam.pipeline.PipelineOptions(["--num_workers", '2'])) as root: ( root | xbeam.DatasetToChunks(source_dataset, source_chunks, split_vars=True) | xbeam.Rechunk( source_dataset.sizes, source_chunks, target_chunks, itemsize=itemsize, ) | xbeam.ChunksToZarr(output_store, template, zarr_chunks=target_chunks) )
Thanks!
I think this is probably an issue with the DirectRunner, which stores all data in memory rather than writing to disk.
Thanks @shoyer! I tried with a pyspark runner and the OOM issues disappeared.
Hi there 👋
I'm trying out xarray-beam to do some Zarr rechunking. I'm wondering if anyone has any tips to control
max_mem
used in the rechunk operation.I first tried rechunking a 128GB Cmip6 dataset and ran into an OOM error around 120GB of ram.
Failing 128GB Zarr store![image](https://github.com/google/xarray-beam/assets/22455466/e03c02b3-ffc9-44e5-a242-a269f6b660b1)
Next, I tried rechunking a much smaller dataset (~10GB) and the memory consumption climbed to about 34GB before finishing.
Completing 10GB Zarr store (with high memory usage)![image](https://github.com/google/xarray-beam/assets/22455466/03116d62-4a12-4a8e-8d17-6a9c883003ca)
I'm basing my rechunking pipeline off of the ERA5_rechunking.py example in the docs, using the local runner and using the
max_mem
default of 1 GB.Small dataset example:
Thanks!