Open cgahr opened 1 year ago
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
note that for computing dask
arrays you should use da.compute()
instead of da.as_numpy()
, which doesn't have that issue.
That said, this seems to be a bug in the implementation of .as_numpy
. I'm not sure how to fix that without special-casing PandasMultiIndex
, but maybe the answer is to remove the multi-index coordinate (r
) instead?
Thanks for the report @cgahr.
as_numpy()
has not been refactored during the Xarray index refactor, which indeed may cause some issues especially with multi-indexes (in the current implementation each coordinate variable is treated independently of each other whereas we should treat multi-index coordinates together).
but maybe the answer is to remove the multi-index coordinate (r) instead?
The docstrings of DataArray.as_numpy()
says "Coerces wrapped data and coordinates into numpy arrays, returning a DataArray", but this is actually not true for indexed coordinates (even for single dimension coordinates):
>>> da.as_numpy()._coords["z"]._data # not a numpy array!
PandasIndexingAdapter(array=Int64Index([0, 1, 2, 3, 4], dtype='int64'), dtype=dtype('int64'))
If we want to actually coerce all the coordinates into numpy arrays, then we should remove their index (if any). When the same underlying data is shared between the coordinates and their index (as it is the case for Xarray's PandasIndex
and PandasMultiIndex
wrappers), changing the coordinate data would make it out-of-sync with the index.
Alternatively, we could fix the multi-index issue reported here and clarify in the documentation that as_numpy()
does not coerce indexed coordinates.
Or we could add an option for choosing between these two behaviors.
note that for computing
dask
arrays you should useda.compute()
instead ofda.as_numpy()
, which doesn't have that issue.
I wanted to convert the DataArrays underlying data type to numpy. And as far as I understood, the proper way to do this is using as_numpy()
.
Alternatively, we could fix the multi-index issue reported here and clarify in the documentation that as_numpy() does not coerce indexed coordinates.
For me, it was actually not clear at all that as_numpy()
should also coerce the indices. In my mind, it simply converts da.data
to a numpy array. And in this case, IMO it should not touch the indices.
And in this case, IMO it should not touch the indices.
I mostly agree, but I'm not sure considering that:
The behavior that would make the most sense to me in terms of clarity and predictability would be to either convert all the coordinates or not touch them at all. Practically, however, perhaps best is to convert all "pure" array data (i.e., data variables + unindexed coordinates but not the indexed coordinates).
What happened?
I have a DataArray with a MultiIndex. In some cases, I use dask, so I call
.as_numpy()
to compute the array and store it in memory. I would expect that this call does NOT change the MultiIndex.This is the case for
.persist()
, however, it's not the case for.as_numpy()
.In the following MWE the original coordinates are:
After
.persist()
we get the same result. After.as_numpy()
we getwhich is not the same and can lead to issues down the line.
What did you expect to happen?
No response
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment