memory issues regridding dataset with many variables

ckoven commented 1 year ago

Hi All, We are trying to use xESMF to do a conservative regrid data of a dataset from 1/4 degree to 4 degree global. The file we are trying to regrid (which is from here, specifically this file, 16 GB) has ~100 variables, and I am finding that if I try to regrid the full file as an xarray dataset, it causes the script to crash after memory usage exceeds ~300 GB. It is possible to make it work by looping over each of the variables as separate DataArrays, but I was just curious if this is a known issue here, or if there is some other way of regridding an entire xarray dataset that isn't as memory intensive.

Thanks!

Script below to replicate the issue, I am using XESMF 0.7.0 on a macOS arm64 installed via conda.

import xesmf as xe
import xarray as xr
import numpy as np

print(xe.__version__)

## load LUH2 transition matrix file
fin = xr.open_dataset('transitions.nc', decode_times=False)

# make some changes to metadata for xESMF conservative regrid
finb = fin.drop(labels=['lat_bounds','lon_bounds'])
finb["lat_b"] = np.insert(fin.lat_bounds[:,1].data,0,fin.lat_bounds[0,0].data)
finb["lon_b"] = np.insert(fin.lon_bounds[:,1].data,0,fin.lon_bounds[0,0].data)
finb["time"] = np.arange(len(fin["time"]), dtype=np.int16) + 850

# make mask of LUH2 data
finb["mask"] = xr.where(~np.isnan(fin["primf_to_range"].isel(time=0)), 1, 0)

# load CLM surface data file
fin2 = xr.open_dataset('surfdata_4x5_hist_16pfts_Irrig_CMIP6_simyr2000_c190214.nc')

# make some changes to CLM surface file metadata
fin2b = fin2.rename_dims(dims_dict={'lsmlat':'latitude','lsmlon':'longitude'})
fin2b['longitude'] = fin2b.LONGXY.isel(latitude=0)
fin2b['latitude'] = fin2b.LATIXY.isel(longitude=0)

# make mask of CLM surface data file
fin2b["mask"] = fin2b["PCT_NATVEG"]> 0.

# define the regridder transformation
regridder = xe.Regridder(finb, fin2b, "conservative")

# regrid the data
fin_transitions_regrid = regridder(finb)

cc @glemieux

huard commented 1 year ago

Hi @ckoven

Thanks for the report, this is useful info and not something I've heard about before. We're hoping to have an intern working on dask with xesmf this summer, I'll flag this as an issue to look into. If you have ideas on how to approach this (beyond sequential looping), let us know.

ckoven commented 1 year ago

Thanks @huard -- that's great. Though I don't have any ideas how to approach this, sorry.

aulemahal commented 1 year ago

Hi @ckoven, I don't see any mention of dask in your example script, maybe that could help?

The process of regridding is essentially a dot product of 2D data with a 4D weights matrix (in dims x out dims). The weights are stored as a sparse matrix, but there's a step where some minimal expansion must be done so the dot product is done. Also, regridding a dataset will regrid each variable individually, so this expansion is done multiple times.

With dask, python will "know" the series of step before hand and be able to schedule it, i.e. to run a certain number of steps in parallel to go faster, but also to limit RAM usage.

I suggest adding the following. In the imports:

from dask.distributed import Client
Client(n_workers=1, threads_per_worker=4, memory_limit='30 GB')

With numbers suiting your machine, but beware that increasing the number of threads too much might increase the RAM usage too much, depending on the chunk size decided below.

When opening the files:

fin = xr.open_dataset('transitions.nc', decode_times=False, chunk={'time': 1})

this will create a single chunk for each variable and time slice. A larger number might be more performant, the goal would be to have ~ 30 Mo per chunk (totally arbitrary and based on my intuition).

And to actually execute the regridding:

fin_transitions_regrid = regridder(finb).load()

ckoven commented 1 year ago

Thanks @aulemahal, that worked!

aulemahal commented 1 year ago

Nice to hear! I'll then close the issue, feel free to reopen if you have further problems related to this memory issue!

pangeo-data / xESMF

memory issues regridding dataset with many variables #245