Closed chenyangkang closed 1 month ago
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
Can we make an MCVE? (otherwise I'll leave it open for a while in case anyone has immediate ideas)
You're doing two different things: orthogonal (or outer) indexing vs vectorized or pointwise indexing. See https://tutorial.xarray.dev/intermediate/indexing/advanced-indexing.html to build some understanding on this topic
What happened?
I was mapping a bunch of three dimensional points (time, longitude, latitude) to a large climate dataset, trying to "annotate" each point using the climate data. Because the dataset is large, I used Dask chunks. Then when I "query" the climate dataset using sel method with numpy array query, the memory seems to linearly/exponentially explode as I increase the among of query.
ERA5['longitude'] = ERA5['longitude'].astype('float32') ERA5['latitude'] = ERA5['latitude'].astype('float32')
ERA5_features = ERA5[ ['u10','v10','u100','v100','t2m','tp','cbh','tcc','e','rsn','sd','stl1','cvh','cvl'] ].sel( time=my_data['time'].values, longitude=my_data['longitude'].values, latitude=my_data['latitude'].values, method='nearest' )
And interestingly, this memory issue is solve by transforming the query variables first:
This won't cause any memory issue.
What did you expect to happen?
I would expect the memory to be constant since the data is chunked by Dask.
Minimal Complete Verifiable Example
No response
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment