pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Calling Dataset.from_dataframe with single-level MultiIndex incorrectly orders data #3798

Open tomalrussell opened 4 years ago

tomalrussell commented 4 years ago

This might be a bit of a corner case - or misunderstanding of how to use the relevant methods - but it tripped me up when working with arbitrary-dimension dataframes/datasets and setting up a pandas.MultiIndex.from_product then calling to_xarray on the resultant dataframe.

The case here is one-dimensional data (i.e. a single-level MultiIndex) with out-of-order index/coordinates labels.

MCVE Code Sample

import xarray
import pandas

#
# Create a DataFrame with a single-level MultiIndex, where the labels are not in alphabetical
# order
#
index_multi = pandas.MultiIndex.from_product(
    [['b', 'a', 'c']], 
    names=['test_multi']
)
df_multi = pandas.DataFrame({'test': [1,2,3]}, index=index_multi)

print(df_multi)
#             test
# test_multi
# b              1
# a              2
# c              3

# Convert to Dataset
xr_multi = xarray.Dataset.from_dataframe(df_multi)

#
# The index values have been sorted, but the data values have not been matched
#
print(xr_multi)
# <xarray.Dataset>
# Dimensions:     (test_multi: 3)
# Coordinates:
#   * test_index  (test_multi) object 'a' 'b' 'c'
# Data variables:
#     test        (test_multi) int64 1 2 3

assert xr_multi.test.sel(test_multi='a').data == 2
assert xr_multi.test.sel(test_multi='b').data == 1
assert xr_multi.test.sel(test_multi='c').data == 3

Expected Output

I would expect the assertions to pass - either the coordinates labels not to be sorted, or the data to be reordered to match. Similar examples work fine with a simple Index or two-level MultiIndex:

#
# For reference, the desired behaviour with a simple Index
#
index_simple = pandas.Index(
    ['b', 'a', 'c'], 
    name='test_simple'
)
df_simple = pandas.DataFrame({'test': [1,2,3]}, index=index_simple)

print(df_simple)
#              test
# test_simple
# b               1
# a               2
# c               3

xr_simple = xarray.Dataset.from_dataframe(df_simple)
print(xr_simple)
# <xarray.Dataset>
# Dimensions:      (test_simple: 3)
# Coordinates:
#   * test_simple  (test_simple) object 'b' 'a' 'c'
# Data variables:
#     test         (test_simple) int64 1 2 3

assert xr_simple.test.sel(test_simple='a').data == 2
assert xr_simple.test.sel(test_simple='b').data == 1
assert xr_simple.test.sel(test_simple='c').data == 3

#
# For reference, the desired behavior with a two-level MultiIndex
#
index_multi2 = pandas.MultiIndex.from_tuples(
    [('b', 'b'), ('a', 'a'), ('c', 'c')], 
    names=['test_multi1', 'test_multi2']
)
df_multi2 = pandas.DataFrame({'test': [1,2,3]}, index=index_multi2)

print(df_multi2)
#                          test
# test_multi1 test_multi2
# b           b               1
# a           a               2
# c           c               3

# Convert to Dataset
xr_multi2 = xarray.Dataset.from_dataframe(df_multi2)

#
# The index values have been sorted, and data is reordered and filled out with nans
#
print(xr_multi2)
# <xarray.Dataset>
# Dimensions:      (test_multi1: 3, test_multi2: 3)
# Coordinates:
#   * test_multi1  (test_multi1) object 'a' 'b' 'c'
#   * test_multi2  (test_multi2) object 'a' 'b' 'c'
# Data variables:
#     test         (test_multi1, test_multi2) float64 2.0 nan nan ... nan nan 3.0

assert xr_multi2.test.sel(test_multi1='a', test_multi2='a').data == 2
assert xr_multi2.test.sel(test_multi1='b', test_multi2='b').data == 1
assert xr_multi2.test.sel(test_multi1='c', test_multi2='c').data == 3

Problem Description

Creating a Dataset from a DataFrame with a single-level MultiIndex (where the labels are not in alphabetical order) results in the index/coordinates labels being sorted, but the data values are not reordered to match.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.8.1 | packaged by conda-forge | (default, Jan 29 2020, 14:24:10) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: English_United Kingdom.1252 libhdf5: None libnetcdf: None xarray: 0.15.0 pandas: 1.0.1 numpy: 1.18.1 scipy: 1.4.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.2.0rc3 cartopy: None seaborn: None numbagg: None setuptools: 45.2.0.post20200209 pip: 20.0.2 conda: None pytest: 5.3.5 IPython: None sphinx: None
max-sixty commented 4 years ago

Thanks for the issue @tomalrussell , that's very clear.

I agree; that's a counter-intuitive result.

Confusingly, passing it into xr.Dataset directly does as you expect:

In [7]: xr.Dataset(df_multi)                                                                                                                                        
Out[7]: 
<xarray.Dataset>
Dimensions:     (dim_0: 3)
Coordinates:
  * dim_0       (dim_0) MultiIndex
  - test_multi  (dim_0) object 'b' 'a' 'c'
Data variables:
    test        (dim_0) int64 1 2 3

I think this is because we attempt to take the dataframe as a spare representation and convert to a dense xarray representation; which is clearer if there's an actual multiindex; note that one representation has shape (6), and the other (2,3):

In [9]: index_multi = pandas.MultiIndex.from_product( 
   ...:     [['b', 'a', 'c'], [1,2]],  
   ...:     names=['test_multi', 'test_2'] 
   ...: ) 
   ...: df_multi = pandas.DataFrame({'test': [1,2,3,4,5,6]}, index=index_multi) 
   ...:                                                                                                                                                             

In [10]: df_multi                                                                                                                                                   
Out[10]: 
                   test
test_multi test_2      
b          1          1
           2          2
a          1          3
           2          4
c          1          5
           2          6

In [11]: xr_multi = xarray.Dataset.from_dataframe(df_multi) 
    ...:                                                                                                                                                            

In [12]: xr_multi                                                                                                                                                   
Out[12]: 
<xarray.Dataset>
Dimensions:     (test_2: 2, test_multi: 3)
Coordinates:
  * test_multi  (test_multi) object 'a' 'b' 'c'
  * test_2      (test_2) int64 1 2
Data variables:
    test        (test_multi, test_2) int64 3 4 1 2 5 6

In [13]: xr.Dataset(df_multi)                                                                                                                                       
Out[13]: 
<xarray.Dataset>
Dimensions:     (dim_0: 6)
Coordinates:
  * dim_0       (dim_0) MultiIndex
  - test_multi  (dim_0) object 'b' 'b' 'a' 'a' 'c' 'c'
  - test_2      (dim_0) int64 1 2 1 2 1 2
Data variables:
    test        (dim_0) int64 1 2 3 4 5 6

I think there are reasons to do each, and potentially unavoidable sorting with the former. But the methods should be explicit; and (at first glance after a while looking at this) it's currently unclear how to map the method names to the behavior.

Any thoughts from others?

max-sixty commented 4 years ago

While it might not solve this issue; one thing we could do to make these operations a bit more explicit is to allow for passing a list of dimensions to be on the index & columns. I'm often running to_dataframe, seeing what comes out, and then stacking or transposing as needed.