pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Can't unstack concatenated DataArrays #7076

Open DWesl opened 2 years ago

DWesl commented 2 years ago

What happened?

I had a collection of DataArrays with a stacked dimension (dimension whose corresponding index is a MultiIndex). I concatenated them into a single DataArray, then tried to unstack the stacked dimension, which failed. Performing the operations in the other order works (unstacking each DataArray, then concatenating the unstacked arrays).

What did you expect to happen?

I expected that concatenating the arrays then unstacking them would produce the same array as unstacking them then concatenating them, but with the possibility of saving the intermediate concatenated-but-still-stacked DataArray for later use as a template.

Minimal Complete Verifiable Example

import pandas as pd
import xarray
index = pd.MultiIndex.from_product([range(3), range(5)])
arr = xarray.DataArray.from_series(pd.Series(range(15), index=index)).stack(index0=["level_0", "level_1"])
arr.unstack("index0")

arr2 = xarray.concat([arr, arr], dim="index2")
arr2.unstack("index0")

MVCE confirmation

Relevant log output

<xarray.DataArray (level_0: 3, level_1: 5)>
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
Coordinates:
  * level_0  (level_0) int64 0 1 2
  * level_1  (level_1) int64 0 1 2 3 4

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/.conda/envs/plotting/lib/python3.10/site-packages/xarray/core/dataarray.py", line 2402, in unstack
    ds = self._to_temp_dataset().unstack(dim, fill_value, sparse)
  File "~/.conda/envs/plotting/lib/python3.10/site-packages/xarray/core/dataset.py", line 4618, in unstack
    raise ValueError(
ValueError: cannot unstack dimensions that do not have exactly one multi-index: ('index0',)

Anything else we need to know?

The eventual problem to which I wish to apply the solution has two stacked dimensions rather than one, but that's likely irrelevant.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.76.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.6.0 pandas: 1.4.2 numpy: 1.22.3 scipy: 1.8.0 netCDF4: 1.6.0 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.1.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: 3.2.1.post0 bottleneck: 1.3.5 dask: 2022.7.1 distributed: 2022.7.1 matplotlib: 3.5.1 cartopy: 0.20.3 seaborn: 0.12.0 numbagg: None fsspec: 2022.5.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 61.3.1 pip: 22.0.4 conda: 4.14.0 pytest: 7.1.3 IPython: None sphinx: None
benbovy commented 2 years ago

Hi @DWesl,

Your example is working with the main branch, it has been fixed in #6889.

# after #6889
arr2.unstack("index0")
# <xarray.DataArray (index2: 2, level_0: 3, level_1: 5)>
# array([[[ 0,  1,  2,  3,  4],
#         [ 5,  6,  7,  8,  9],
#         [10, 11, 12, 13, 14]],
# 
#        [[ 0,  1,  2,  3,  4],
#         [ 5,  6,  7,  8,  9],
#         [10, 11, 12, 13, 14]]])
# Coordinates:
#   * level_0  (level_0) int64 0 1 2
#   * level_1  (level_1) int64 0 1 2 3 4
# Dimensions without coordinates: index2
DWesl commented 2 years ago

Fix confirmed, thank you.

ACHMartin commented 1 year ago

Hi @DWesl, @benbovy

I got a bug very similar to what was initially report with the following minimal example (xarray version 2022.11.0) with the error:

ValueError: cannot unstack dimensions that do not have exactly one multi-index: ('z',)

It is working fine with my previous version (xarray 0.16.0)

import xarray as xr
ds = xr.Dataset(
    data_vars=dict(
        mydata = ( ['across', 'along'], [[0,1],[2,3]])
    )
)
stacked = ds.stack(z=("across", "along"))
newlist = [None] * stacked.z.size
for ii, zindex in enumerate(stacked.z.data):
    newlist[ii] = stacked.mydata.sel(z=zindex)
newds = xr.concat(newlist, dim='z')
newds['z'] = stacked.z
print('xarray version:' + xr.__version__)
newds.unstack(dim='z')

If I do newds = newds.reset_index('z') just before the last line, it is solved for my minimal example. But it still doesn't work for my real problem and I don't think it is the best way to proceed.

Many thanks for your help, Adrien

benbovy commented 1 year ago

@ACHMartin the issue is when you do newds['z'] = stacked.z. In the last versions of Xarray multi-index levels have each their own (real) coordinates, for consistency and clarity we soon won't support assigning a multi-index to a single coordinate of a Dataset / DataArray like that.

I think that in other places we still do support it with a deprecation notice, but apparently in your example this is not the case. unstack doesn't work because the multi-index(es) and the coordinates of newds are not consistent.

I don't know exactly what is your real problem, but from now on you should avoid implicitly assign a multi-index with xr_obj["my_coord"] = ... or xr_obj.assign(my_coord=...). Instead you should re-create the multi-index, e.g., in your minimal example newds = newds.set_index(z=["across", "along"]).

ACHMartin commented 1 year ago

Many thanks for the explanation and the rapid answer. I am not very familiar with multi-index.

My real problem works with a DataSet with different sort of variables:

ipdb> lmmap
<xarray.Dataset>
Dimensions:        (x_variables: 4, Observables: 2, Antenna: 4, z: 4,
                    Ambiguities: 4, fun_variables: 8, extrema: 2)
Coordinates:
  * x_variables    (x_variables) <U3 'u' 'v' 'c_u' 'c_v'
  * Observables    (Observables) <U6 'sigma0' 'RVL'
  * Antenna        (Antenna) <U4 'Fore' 'MidV' 'MidH' 'Aft'
    fun_variables  (Observables, Antenna) int64 0 1 2 3 4 5 6 7
Dimensions without coordinates: z, Ambiguities, extrema
Data variables: (12/17)
    x              (z, Ambiguities, x_variables) float64 -4.503 7.8 ... -0.4265
    active_mask    (z, Ambiguities, x_variables) int64 0 0 0 0 0 0 ... 0 0 0 0 0
    grad           (z, Ambiguities, x_variables) float64 7.816e-11 ... -1.873...
    fun            (z, Ambiguities, Observables, Antenna) float64 0.001137 .....
    jac            (z, Ambiguities, fun_variables, x_variables) float64 0.359...
    x0             (z, Ambiguities, x_variables) float64 -7.002 10.03 ... 1.219
    ...             ...
    message        (z, Ambiguities) <U42 '`xtol` termination condition is sat...
    success        (z, Ambiguities) bool True True True True ... True True True
    method         (z) <U3 'trf' 'trf' 'trf' 'trf'
    xtol           (z) float64 0.001 0.001 0.001 0.001
    x_scale        (z, x_variables) float64 7.0 7.0 0.5 0.5 ... 7.0 7.0 0.5 0.5
    bounds         (z, extrema, x_variables) int64 -30 -30 -5 -5 ... 30 30 5 5

I don't know where I should apply set_index(z=["across", "along"]) I tried on the full DataSet or a single DataArray but without success. I get the following error:

ValueError: across, along variable(s) do not exist

Which is obvious as across and along variables disappeared. Thank you, I will dig in this direction why it happens.

However, it works perfectly with the minimal example. Many thanks

dcherian commented 1 year ago

the issue is when you do newds['z'] = stacked.z

Are we raising a warning here?

ACHMartin commented 1 year ago

I didn't see any warning for this

ACHMartin commented 1 year ago

Many thanks for the explanation and the rapid answer. I am not very familiar with multi-index.

My real problem works with a DataSet with different sort of variables:

ipdb> lmmap
<xarray.Dataset>
Dimensions:        (x_variables: 4, Observables: 2, Antenna: 4, z: 4,
                    Ambiguities: 4, fun_variables: 8, extrema: 2)
Coordinates:
  * x_variables    (x_variables) <U3 'u' 'v' 'c_u' 'c_v'
  * Observables    (Observables) <U6 'sigma0' 'RVL'
  * Antenna        (Antenna) <U4 'Fore' 'MidV' 'MidH' 'Aft'
    fun_variables  (Observables, Antenna) int64 0 1 2 3 4 5 6 7
Dimensions without coordinates: z, Ambiguities, extrema
Data variables: (12/17)
    x              (z, Ambiguities, x_variables) float64 -4.503 7.8 ... -0.4265
    active_mask    (z, Ambiguities, x_variables) int64 0 0 0 0 0 0 ... 0 0 0 0 0
    grad           (z, Ambiguities, x_variables) float64 7.816e-11 ... -1.873...
    fun            (z, Ambiguities, Observables, Antenna) float64 0.001137 .....
    jac            (z, Ambiguities, fun_variables, x_variables) float64 0.359...
    x0             (z, Ambiguities, x_variables) float64 -7.002 10.03 ... 1.219
    ...             ...
    message        (z, Ambiguities) <U42 '`xtol` termination condition is sat...
    success        (z, Ambiguities) bool True True True True ... True True True
    method         (z) <U3 'trf' 'trf' 'trf' 'trf'
    xtol           (z) float64 0.001 0.001 0.001 0.001
    x_scale        (z, x_variables) float64 7.0 7.0 0.5 0.5 ... 7.0 7.0 0.5 0.5
    bounds         (z, extrema, x_variables) int64 -30 -30 -5 -5 ... 30 30 5 5

I don't know where I should apply set_index(z=["across", "along"]) I tried on the full DataSet or a single DataArray but without success. I get the following error:

ValueError: across, along variable(s) do not exist

Which is obvious as across and along variables disappeared. Thank you, I will dig in this direction why it happens.

However, it works perfectly with the minimal example. Many thanks

Thanks @benbovy, my issue is now resolved, once I have enabled the coords to follow.

ACHMartin commented 1 year ago

Hi @benbovy, my code works for 2 dimensions, but it doesn't with 1D. I think the error comes from .set_index(z=['across']) which delivers a "PandasIndex(Int64Index(" and not the expected "PandasIndex(MultiIndex(" for unstack.

I created a Pandas-MultiIndex "MultiIndex([((0,), (1,))], names=['z', 'across'])" (see code just below), but it is not exactly the same one as the one created by .stack() "PandasIndex(MultiIndex([(0,),(1,)], name='z'))" and I didn't find out how to pass the 'index' to xarray.

import pandas as pd
tuples = [(0,),(1,)],
index = pd.MultiIndex.from_tuples(tuples, names=["z","across"])
index

My modified minimal example is as below:

import xarray as xr
ds = xr.Dataset(
    data_vars=dict(
        mydata = ( ['across', 'along'], [[0,1],[2,3]])
    ),
    coords=dict( # <- changed 
        across=[0,1], # <- changed 
        along=[0,1], # <- changed 
    ),    # <- changed
)
stacked = ds.isel(along=0).stack(z=["across"]) # <- changed
newlist = [None] * stacked.z.size
for ii, zindex in enumerate(stacked.z.data):
    newlist[ii] = stacked.mydata.sel(z=zindex)
newds = xr.concat(newlist, dim='z')
newds = newds.set_index(z=['across']) # <- changed
newds.unstack(dim='z') # Fail with ValueError: cannot unstack dimensions that do not have exactly one multi-index: ('z',)
max-sixty commented 1 month ago

Is the underlying cause of this a dupe of https://github.com/pydata/xarray/issues/7148?

benbovy commented 1 month ago

I don't think it is a duplicate.

I don't know what @ACHMartin's last minimal example is trying to do, but now the best way to assign a pandas multi-index to a Xarray object is like this:

index = pd.MultiIndex.from_tuples(..., names=["across"])
coords = xr.Coordinates.from_pandas_multiindex(index, dim="z")
ds = xr.Dataset(coords=coords)

It looks like ds.set_index(z=["across"]) demotes to a single index in the case when a single item list is passed. I'm not sure what was the old behavior (a while ago... before we switched to calendar versioning) and whether or not it should be considered as a bug, though.

Regarding assignment of a new coordinate to an existing multi-index coordinate (also reported here above), if we still don't raise an error or a warning we should probably do so.