pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

`as_numpy` changes MultiIndex #8001

Open cgahr opened 1 year ago

cgahr commented 1 year ago

What happened?

I have a DataArray with a MultiIndex. In some cases, I use dask, so I call .as_numpy() to compute the array and store it in memory. I would expect that this call does NOT change the MultiIndex.

This is the case for .persist(), however, it's not the case for .as_numpy().

In the following MWE the original coordinates are:

Coordinates:
  * z        (z) int64 0 1 2 3 4
  * r        (r) object MultiIndex
  * x        (r) int64 0 0 0 1 1 1
  * y        (r) int64 0 1 2 0 1 2

After .persist() we get the same result. After .as_numpy() we get

Coordinates:
  * z        (z) int64 0 1 2 3 4
  * r        (r) object (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2)
  * x        (r) int64 0 0 0 1 1 1
  * y        (r) int64 0 1 2 0 1 2

which is not the same and can lead to issues down the line.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

import xarray as xr

da = xr.DataArray(1, dims=('x', 'y', 'z'), coords={'x': [0, 1], 'y': [0, 1, 2], 'z': [0, 1, 2, 3, 4]})
da = da.stack(r=('x', 'y'))

xr.testing.assert_equal(da['r'], da.persist()['r'])   # ok
xr.testing.assert_equal(da['r'], da.as_numpy()['r'])  # err

xr.testing.assert_equal(da, da.persist())   # ok
xr.testing.assert_equal(da, da.as_numpy())  # ok but should err

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.3 | packaged by conda-forge | (main, Apr 6 2023, 08:57:19) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 4.12.14-122.162-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.0 libnetcdf: None xarray: 2023.6.0 pandas: 1.5.3 numpy: 1.25.0 scipy: 1.10.1 netCDF4: None pydap: None h5netcdf: 1.2.0 h5py: 3.8.0 Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.4.1 distributed: 2023.4.1 matplotlib: 3.7.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.4.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.7.2 pip: 23.1.2 conda: None pytest: 7.4.0 mypy: 1.4.1 IPython: 8.13.2 sphinx: None
welcome[bot] commented 1 year ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

keewis commented 1 year ago

note that for computing dask arrays you should use da.compute() instead of da.as_numpy(), which doesn't have that issue.

That said, this seems to be a bug in the implementation of .as_numpy. I'm not sure how to fix that without special-casing PandasMultiIndex, but maybe the answer is to remove the multi-index coordinate (r) instead?

benbovy commented 1 year ago

Thanks for the report @cgahr.

as_numpy() has not been refactored during the Xarray index refactor, which indeed may cause some issues especially with multi-indexes (in the current implementation each coordinate variable is treated independently of each other whereas we should treat multi-index coordinates together).

but maybe the answer is to remove the multi-index coordinate (r) instead?

The docstrings of DataArray.as_numpy() says "Coerces wrapped data and coordinates into numpy arrays, returning a DataArray", but this is actually not true for indexed coordinates (even for single dimension coordinates):

>>> da.as_numpy()._coords["z"]._data   # not a numpy array!
PandasIndexingAdapter(array=Int64Index([0, 1, 2, 3, 4], dtype='int64'), dtype=dtype('int64'))

If we want to actually coerce all the coordinates into numpy arrays, then we should remove their index (if any). When the same underlying data is shared between the coordinates and their index (as it is the case for Xarray's PandasIndex and PandasMultiIndex wrappers), changing the coordinate data would make it out-of-sync with the index.

Alternatively, we could fix the multi-index issue reported here and clarify in the documentation that as_numpy() does not coerce indexed coordinates.

Or we could add an option for choosing between these two behaviors.

cgahr commented 1 year ago

note that for computing dask arrays you should use da.compute() instead of da.as_numpy(), which doesn't have that issue.

I wanted to convert the DataArrays underlying data type to numpy. And as far as I understood, the proper way to do this is using as_numpy().

Alternatively, we could fix the multi-index issue reported here and clarify in the documentation that as_numpy() does not coerce indexed coordinates.

For me, it was actually not clear at all that as_numpy() should also coerce the indices. In my mind, it simply converts da.data to a numpy array. And in this case, IMO it should not touch the indices.

benbovy commented 1 year ago

And in this case, IMO it should not touch the indices.

I mostly agree, but I'm not sure considering that:

The behavior that would make the most sense to me in terms of clarity and predictability would be to either convert all the coordinates or not touch them at all. Practically, however, perhaps best is to convert all "pure" array data (i.e., data variables + unindexed coordinates but not the indexed coordinates).