google / xarray-beam

Distributed Xarray with Apache Beam
https://xarray-beam.readthedocs.io
Apache License 2.0
125 stars 7 forks source link

Rechunk memory usage seems over `max_mem` limit #96

Closed norlandrhagen closed 2 months ago

norlandrhagen commented 3 months ago

Hi there 👋

I'm trying out xarray-beam to do some Zarr rechunking. I'm wondering if anyone has any tips to control max_mem used in the rechunk operation.

I first tried rechunking a 128GB Cmip6 dataset and ran into an OOM error around 120GB of ram.

Failing 128GB Zarr store image

Next, I tried rechunking a much smaller dataset (~10GB) and the memory consumption climbed to about 34GB before finishing.

Completing 10GB Zarr store (with high memory usage) image

I'm basing my rechunking pipeline off of the ERA5_rechunking.py example in the docs, using the local runner and using the max_mem default of 1 GB.

Small dataset example:


import apache_beam as beam
import xarray_beam as xbeam

store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/pangeo-forge/EOBS-feedstock/eobs-wind-speed.zarr'
# store = 'https://cpdataeuwest.blob.core.windows.net/cp-cmip/version1/data/DeepSD/ScenarioMIP.CCCma.CanESM5.ssp245.r1i1p1f1.day.DeepSD.pr.zarr'
output_store = 'eobs_test.zarr'

source_dataset, source_chunks = xbeam.open_zarr(store)
template = xbeam.make_template(source_dataset)

target_chunks = {'time':80, 'latitude':350, 'longitude':511}
itemsize = max(variable.dtype.itemsize for variable in template.values())

with beam.Pipeline(runner="DirectRunner", options=beam.pipeline.PipelineOptions(["--num_workers", '2'])) as root:
    (
    root
    | xbeam.DatasetToChunks(source_dataset, source_chunks, split_vars=True)
    | xbeam.Rechunk(  
        source_dataset.sizes,
        source_chunks,
        target_chunks,
        itemsize=itemsize,
    )
    | xbeam.ChunksToZarr(output_store, template, zarr_chunks=target_chunks)
)

Thanks!

shoyer commented 3 months ago

I think this is probably an issue with the DirectRunner, which stores all data in memory rather than writing to disk.

norlandrhagen commented 2 months ago

Thanks @shoyer! I tried with a pyspark runner and the OOM issues disappeared.