Memory explode when sel using plain numpy array under Dask chunk mode

chenyangkang commented 1 month ago

What happened?

I was mapping a bunch of three dimensional points (time, longitude, latitude) to a large climate dataset, trying to "annotate" each point using the climate data. Because the dataset is large, I used Dask chunks. Then when I "query" the climate dataset using sel method with numpy array query, the memory seems to linearly/exponentially explode as I increase the among of query.

Loading environmental data:


ERA5 = xr.open_mfdataset('../../data/02.Env_data/ERA5/download_ERA5_2023_*_processed.nc',
                     engine='netcdf4',
                    chunks={'time':1},
                     concat_dim="time", combine="nested",
                     data_vars="minimal", coords="minimal", compat="override")

ERA5['longitude'] = ERA5['longitude'].astype('float32') ERA5['latitude'] = ERA5['latitude'].astype('float32')


2. Query using custom points:

ERA5_features = ERA5[ ['u10','v10','u100','v100','t2m','tp','cbh','tcc','e','rsn','sd','stl1','cvh','cvl'] ].sel( time=my_data['time'].values, longitude=my_data['longitude'].values, latitude=my_data['latitude'].values, method='nearest' )


compute:
```py
ERA5_features.load()

And interestingly, this memory issue is solve by transforming the query variables first:

time = xr.DataArray(my_data['time'].values.astype('datetime64[ns]'), dims="points")
longitude = xr.DataArray(my_data['longitude'].values, dims="points")
latitude = xr.DataArray(my_data['latitude'].values, dims="points")

ERA5_features = ERA5[
    ['u10','v10','u100','v100','t2m','tp','cbh','tcc','e','rsn','sd','stl1','cvh','cvl']
].sel(
        time= time,
        longitude= longitude,
        latitude= latitude,
        method='nearest'
    )

This won't cause any memory issue.

What did you expect to happen?

I would expect the memory to be constant since the data is chunked by Dask.

Minimal Complete Verifiable Example

No response

MVCE confirmation

[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[ ] Complete example — the example is self-contained, including all data and the text of any traceback.
[ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.
[X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.10 (main, Oct 3 2024, 07:29:13) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.119.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.3-development xarray: 2024.9.0 pandas: 2.2.3 numpy: 1.26.4 scipy: 1.14.1 netCDF4: 1.7.1.post2 pydap: None h5netcdf: 1.4.0 h5py: 3.12.1 zarr: 2.18.3 cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: 1.4.2 dask: 2024.10.0 distributed: None matplotlib: 3.9.2 cartopy: None seaborn: None numbagg: None fsspec: 2024.9.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 75.1.0 pip: 24.2 conda: 24.9.2 pytest: None mypy: None IPython: 8.27.0 sphinx: None

welcome[bot] commented 1 month ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

max-sixty commented 1 month ago

Can we make an MCVE? (otherwise I'll leave it open for a while in case anyone has immediate ideas)

dcherian commented 1 month ago

You're doing two different things: orthogonal (or outer) indexing vs vectorized or pointwise indexing. See https://tutorial.xarray.dev/intermediate/indexing/advanced-indexing.html to build some understanding on this topic

pydata / xarray