pangeo-data / xESMF

Universal Regridder for Geospatial Data
http://xesmf.readthedocs.io/
MIT License
182 stars 32 forks source link

Do not chunk when the input isn't chunked #348

Closed aulemahal closed 2 months ago

aulemahal commented 3 months ago

280 introduced support for chunking along the spatial dimensions. It also introduced a new behaviour : the regridded output has chunks of the same size as the input, unless otherwise specified. For regridding that increases the resolution and for a dataset that has no chunks along the spatial dims, this is a different behaviour than the previous code, and it might be unexpected.

For example, going from a 64x64 grid with a single 64x64 chunk to a 128x128 grid, increases the number of chunks to 4. The regridding always adds 3 blockwise tasks in the dask graph, and thus the graph can grow faster than expected.

I think our default makes sense, but I think adding a warning might help the users understand what's happening.

I am waiting on a colleague to see if explicitly setting output_chunks is enough to retrieve the expected performance in some edge cases. Or if the 0.8 update has more unexpected performance regressions.

juliettelavoie commented 3 months ago

I tried different things and it never works as well as version 0.7.

  1. With the input having spatial chunks -1 and output_chunks -1 for all steps (technically same chunks as 0.7), it takes much much longer to perform the same task with 0.8.4. The workers need to much memory and waste time writing to disk.

  2. I tried with initial chunks of -1 with output_chunks of 20. In that case, I killed the process after 1h of the scheduler running and nothing happening with the workers.

  3. I tried: initial chunks: 30, grid1: 60, grid2:120, final loc: 600. similar behavior as # 1 (takes too long).

juliettelavoie commented 3 months ago

EDIT: I was not actually doing -1 at every step when testing # 1. Putting -1 chunks everywhere does reproduce the previous behavior of regridding in a reasonnable amount of time.

aulemahal commented 3 months ago

May be it was erroneous to think that defaulting to preserving the chunk size was the best default ? We added an argument in a somewhat unexpected place (the call instead of the init) and the default drastically changed the behaviour for the common case of upscaling a large dataset. One might say this is a regression.

Suggestion : output_chunks defaults to (-1, -1) if there are no chunks along the spatial dimensions (chunksize == shape). Otherwise, preserve the chunk size (output_chunks == input_chunks).

What do you think @juliettelavoie , @raphaeldussin, @huard ?

juliettelavoie commented 3 months ago

I like that. It think it's good to preserve the previous behavior when only -1 chunks were accepted. It might also be a bit easier to debug. If -1 chunks are too big and overwhelms dask workers, it is pretty typical to use smaller chunks to fix the issue.

huard commented 3 months ago

I would support the principle of least surprise and pick defaults that reproduce past behavior.

review-notebook-app[bot] commented 3 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

aulemahal commented 3 months ago

Hi all,

The last commits add a special pathway for reproducing the pre-0.8 behaviour. For each dimension, the output chunk defaults to "-1" if the input is not chunked[^1] along the corresponding[^2] dimension in the input. If the input is chunked, than the same chunk size is used.

I also updated the tests, docstring and notebook.

@charlesgauthier-udm. so you know.

[^1]: "not chunked" as in "has only one chunk". [^2]: "corresponding" index-wise. The first dimension of the input vs the first of the output. Sometimes this can feel completely arbitrary, but I didn't know any better.