pangeo-data / xESMF

Universal Regridder for Geospatial Data
http://xesmf.readthedocs.io/
MIT License
182 stars 32 forks source link

v0.8.5 memory issues #371

Open malmans2 opened 5 days ago

malmans2 commented 5 days ago

Hi there,

We started experiencing memory issues in various workflows after upgrading from 0.8.4 to 0.8.5. I guess we have been relying on the internal chunking applied between versions 0.8.0 and 0.8.4 (sorry, I don't have time to dig deeper right now).

Is it possible to somehow activate the former chunking behavior? If not, would it be worth making it an opt-in feature?

Thanks!

aulemahal commented 5 days ago

Hi! Really sorry for this change. In 0.8.5 we decided to go back to the chunking behaviour of 0.7 because it seemed the one that worked the best for the average use of xESMF.

In the regridder call there is a output_chunks argument you can use to prescribe the resulting chunking. The behaviour from 0.8.0 to 0.8.4 was to have this the same as the input chunks. The behaviour from 0.7 and 0.8.5 is the same EXCEPT if the input has a single chunk in the spatial dimension, in which case the output will also have a single chunk.

I suggest you explicitly pass an output_chunks with a value that makes sense for your workflow. In your case, I'm guessing a good starting value would be the shape of the input data.

FYI: From the prior feedback and my own experiments, it seems that in many cases of upsampling (a larger destination grid), chunking the output with the same chunksize as the input was increasing the number of dask tasks too fast and this had a worst performance effect than allowing large chunks (i.e. single chunks) in the output. Of course, this is as long as your memory can take it. I'm guessing your are increasing the grid size by a much larger factor here.

malmans2 commented 5 days ago

Hi Pascal,

Thanks for the details! OK, I'll try to use output_chunks and will let you know how it goes.