pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

to_unstacked_dataset unable to reconstruct original dimensions #9541

Open aFarchi opened 2 hours ago

aFarchi commented 2 hours ago

What is your issue?

Hello,

I am trying to stack/unstack a dataset. According to the doc, I am supposed to recover the original dataset, but this is not what I observe.

>>> import xarray as xr
>>> import numpy as np
>>> ds = xr.Dataset(
...     data_vars=dict(
...         var_a=(('sample', 'dim_a'), np.random.randn(2, 1)),
...         var_b=(('sample', 'dim_b'), np.random.randn(2, 4)),
...     ),
... )
>>> ds
<xarray.Dataset> Size: 80B
Dimensions:  (sample: 2, dim_a: 1, dim_b: 4)
Dimensions without coordinates: sample, dim_a, dim_b
Data variables:
    var_a    (sample, dim_a) float64 16B -0.5696 -0.8579
    var_b    (sample, dim_b) float64 64B 0.0585 -1.219 1.702 ... 1.244 0.7397

Stacking the dataset looks correct:

>>> stacked = ds.to_stacked_array('output_feature', sample_dims=('sample',))
>>> stacked
<xarray.DataArray 'var_a' (sample: 2, output_feature: 5)> Size: 80B
array([[-0.56958696,  0.058498  , -1.21899832,  1.70180735, -0.06674016],
       [-0.85787833,  1.86201164, -1.71474761,  1.24400992,  0.73965765]])
Coordinates:
  * output_feature  (output_feature) object 40B MultiIndex
  * variable        (output_feature) <U5 100B 'var_a' 'var_b' ... 'var_b'
  * dim_a           (output_feature) object 40B 0 nan nan nan nan
  * dim_b           (output_feature) object 40B nan 0 1 2 3
Dimensions without coordinates: sample

But unstacking seems incorrect:

>>> stacked.to_unstacked_dataset('output_feature')
<xarray.Dataset> Size: 176B
Dimensions:         (sample: 2, output_feature: 4)
Coordinates:
  * output_feature  (output_feature) object 32B MultiIndex
  * dim_a           (output_feature) object 32B nan nan nan nan
  * dim_b           (output_feature) object 32B 0 1 2 3
Dimensions without coordinates: sample
Data variables:
    var_a           (sample) float64 16B -0.5696 -0.8579
    var_b           (sample, output_feature) float64 64B 0.0585 ... 0.7397

var_a should have dimensions (sample, dim_a) and var_b should have (sample, dim_b).

The issue seems even worse when len(dim_a)>1:

>>> import xarray as xr
>>> import numpy as np
>>> ds = xr.Dataset(
...     data_vars=dict(
...         var_a=(('sample', 'dim_a'), np.random.randn(2, 2)),
...         var_b=(('sample', 'dim_b'), np.random.randn(2, 4)),
...     ),
... )
>>> stacked = ds.to_stacked_array('output_feature', sample_dims=('sample',))
>>> stacked.to_unstacked_dataset('output_feature', level=0)
<xarray.Dataset> Size: 336B
Dimensions:         (output_feature: 6, sample: 2)
Coordinates:
  * output_feature  (output_feature) object 48B MultiIndex
  * dim_a           (output_feature) object 48B 0 1 nan nan nan nan
  * dim_b           (output_feature) object 48B nan nan 0 1 2 3
Dimensions without coordinates: sample
Data variables:
    var_a           (sample, output_feature) float64 96B 0.6215 -1.72 ... nan
    var_b           (sample, output_feature) float64 96B nan nan ... 0.2421

Could it be related to the level argument of to_unstacked_dataset()?

Note that I have been using the last version for this test:

>>> xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.12.6 | packaged by conda-forge | (main, Sep 22 2024, 14:07:06) [Clang 17.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 23.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2024.9.0
pandas: 2.2.3
numpy: 2.1.1
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 74.1.2
pip: 24.2
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None
welcome[bot] commented 2 hours ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!