pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Dataset.to_dataframe() dimension order is not alphabetically sorted by default #9653

Closed mgunyho closed 1 month ago

mgunyho commented 1 month ago

What happened?

Hi, I noticed that the documentation for Dataset.to_dataframe() says that "by default, dimensions are sorted alphabetically". This is contrast with DataArray.to_dataframe(), where the order is given by the order of the dimensions in the DataArray, which was discussed in this comment.

However, it appears that Dataset.to_dataframe() doesn't in fact sort the orders alphabetically with this example on current main 8f6e45ba:

import xarray as xr
ds = xr.Dataset({
    "foo": xr.DataArray(0, coords=[("y", [1, 2, 3]), ("x", [4, 5, 6])]), 
})
print(ds.to_dataframe()) 

I get

     foo
y x     
1 4    0
  5    0
  6    0
2 4    0
  5    0
  6    0
3 4    0
  5    0
  6    0

What did you expect to happen?

The dimensions in the output should be sorted alphabetically, like this:

     foo
x y     
4 1    0
  2    0
  3    0
5 1    0
  2    0
  3    0
6 1    0
  2    0
  3    0

Minimal Complete Verifiable Example

See above

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.7 (main, Oct 1 2024, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)] python-bits: 64 OS: Linux OS-release: 6.11.3-200.fc40.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2024.9.1.dev73+g8f6e45ba pandas: 2.2.3 numpy: 1.26.4 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 69.0.3 pip: 24.0 conda: None pytest: None mypy: None IPython: None sphinx: None
keewis commented 1 month ago

this looks like a documentation bug: we can't really sort non-string names alphabetically, so instead we should remove that claim. PRs welcome!

mgunyho commented 1 month ago

Makes sense, I was also a bit surprised to find this inconsistent behavior discussed in that issue comment.

I suppose the correct wording would be something like "the dimensions are in the order in which they appear in the DataArrays in the dataset"? This seems to be the behavior, based on trying different orders of the dictionary elements in this example:

import xarray as xr

ds = xr.Dataset({
    "foo": xr.DataArray(coords=[("x", [1, 2, 3]), ("y", [1, 2, 3])]),
    "bar": xr.DataArray(coords=[("y", [1, 2, 3]), ("x", [1, 2, 3])]),
    "baz": xr.DataArray(coords=[("x", [1, 2, 3])]),
    "qux": xr.DataArray(coords=[("y", [1, 2, 3])]),
})

print(ds.to_dataframe())
shoyer commented 1 month ago

We used to sort dimension names in Dataset.dims, which in turn were used by DataFrame levels. This is no longer the case: https://github.com/pydata/xarray/pull/4753

So yes, this is definitely worthy of updating/fixing the documentation!

shoyer commented 1 month ago

I suppose the correct wording would be something like "the dimensions are in the order in which they appear in the DataArrays in the dataset"? This seems to be the behavior, based on trying different orders of the dictionary elements in this example:

I would say Dimensions appear in the same order as Dataset.sizes (which is also order of appearance on variables)