pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Memory explode when sel using plain numpy array under Dask chunk mode #9663

Closed chenyangkang closed 1 month ago

chenyangkang commented 1 month ago

What happened?

I was mapping a bunch of three dimensional points (time, longitude, latitude) to a large climate dataset, trying to "annotate" each point using the climate data. Because the dataset is large, I used Dask chunks. Then when I "query" the climate dataset using sel method with numpy array query, the memory seems to linearly/exponentially explode as I increase the among of query.

  1. Loading environmental data:
    
    ERA5 = xr.open_mfdataset('../../data/02.Env_data/ERA5/download_ERA5_2023_*_processed.nc',
                         engine='netcdf4',
                        chunks={'time':1},
                         concat_dim="time", combine="nested",
                         data_vars="minimal", coords="minimal", compat="override")

ERA5['longitude'] = ERA5['longitude'].astype('float32') ERA5['latitude'] = ERA5['latitude'].astype('float32')


2. Query using custom points:

ERA5_features = ERA5[ ['u10','v10','u100','v100','t2m','tp','cbh','tcc','e','rsn','sd','stl1','cvh','cvl'] ].sel( time=my_data['time'].values, longitude=my_data['longitude'].values, latitude=my_data['latitude'].values, method='nearest' )


compute:
```py
ERA5_features.load()

And interestingly, this memory issue is solve by transforming the query variables first:

time = xr.DataArray(my_data['time'].values.astype('datetime64[ns]'), dims="points")
longitude = xr.DataArray(my_data['longitude'].values, dims="points")
latitude = xr.DataArray(my_data['latitude'].values, dims="points")
ERA5_features = ERA5[
    ['u10','v10','u100','v100','t2m','tp','cbh','tcc','e','rsn','sd','stl1','cvh','cvl']
].sel(
        time= time,
        longitude= longitude,
        latitude= latitude,
        method='nearest'
    )

This won't cause any memory issue.

What did you expect to happen?

I would expect the memory to be constant since the data is chunked by Dask.

Minimal Complete Verifiable Example

No response

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.10 (main, Oct 3 2024, 07:29:13) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.119.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.3-development xarray: 2024.9.0 pandas: 2.2.3 numpy: 1.26.4 scipy: 1.14.1 netCDF4: 1.7.1.post2 pydap: None h5netcdf: 1.4.0 h5py: 3.12.1 zarr: 2.18.3 cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: 1.4.2 dask: 2024.10.0 distributed: None matplotlib: 3.9.2 cartopy: None seaborn: None numbagg: None fsspec: 2024.9.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 75.1.0 pip: 24.2 conda: 24.9.2 pytest: None mypy: None IPython: 8.27.0 sphinx: None
welcome[bot] commented 1 month ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

max-sixty commented 1 month ago

Can we make an MCVE? (otherwise I'll leave it open for a while in case anyone has immediate ideas)

dcherian commented 1 month ago

You're doing two different things: orthogonal (or outer) indexing vs vectorized or pointwise indexing. See https://tutorial.xarray.dev/intermediate/indexing/advanced-indexing.html to build some understanding on this topic