Open tomalrussell opened 4 years ago
Thanks for the issue @tomalrussell , that's very clear.
I agree; that's a counter-intuitive result.
Confusingly, passing it into xr.Dataset
directly does as you expect:
In [7]: xr.Dataset(df_multi)
Out[7]:
<xarray.Dataset>
Dimensions: (dim_0: 3)
Coordinates:
* dim_0 (dim_0) MultiIndex
- test_multi (dim_0) object 'b' 'a' 'c'
Data variables:
test (dim_0) int64 1 2 3
I think this is because we attempt to take the dataframe as a spare representation and convert to a dense xarray representation; which is clearer if there's an actual multiindex; note that one representation has shape (6), and the other (2,3):
In [9]: index_multi = pandas.MultiIndex.from_product(
...: [['b', 'a', 'c'], [1,2]],
...: names=['test_multi', 'test_2']
...: )
...: df_multi = pandas.DataFrame({'test': [1,2,3,4,5,6]}, index=index_multi)
...:
In [10]: df_multi
Out[10]:
test
test_multi test_2
b 1 1
2 2
a 1 3
2 4
c 1 5
2 6
In [11]: xr_multi = xarray.Dataset.from_dataframe(df_multi)
...:
In [12]: xr_multi
Out[12]:
<xarray.Dataset>
Dimensions: (test_2: 2, test_multi: 3)
Coordinates:
* test_multi (test_multi) object 'a' 'b' 'c'
* test_2 (test_2) int64 1 2
Data variables:
test (test_multi, test_2) int64 3 4 1 2 5 6
In [13]: xr.Dataset(df_multi)
Out[13]:
<xarray.Dataset>
Dimensions: (dim_0: 6)
Coordinates:
* dim_0 (dim_0) MultiIndex
- test_multi (dim_0) object 'b' 'b' 'a' 'a' 'c' 'c'
- test_2 (dim_0) int64 1 2 1 2 1 2
Data variables:
test (dim_0) int64 1 2 3 4 5 6
I think there are reasons to do each, and potentially unavoidable sorting with the former. But the methods should be explicit; and (at first glance after a while looking at this) it's currently unclear how to map the method names to the behavior.
Any thoughts from others?
While it might not solve this issue; one thing we could do to make these operations a bit more explicit is to allow for passing a list of dimensions to be on the index & columns. I'm often running to_dataframe
, seeing what comes out, and then stacking or transposing as needed.
This might be a bit of a corner case - or misunderstanding of how to use the relevant methods - but it tripped me up when working with arbitrary-dimension dataframes/datasets and setting up a
pandas.MultiIndex.from_product
then callingto_xarray
on the resultant dataframe.The case here is one-dimensional data (i.e. a single-level MultiIndex) with out-of-order index/coordinates labels.
MCVE Code Sample
Expected Output
I would expect the assertions to pass - either the coordinates labels not to be sorted, or the data to be reordered to match. Similar examples work fine with a simple
Index
or two-levelMultiIndex
:Problem Description
Creating a Dataset from a DataFrame with a single-level MultiIndex (where the labels are not in alphabetical order) results in the index/coordinates labels being sorted, but the data values are not reordered to match.
Output of
xr.show_versions()