pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.59k stars 1.08k forks source link

Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661

Open shoyer opened 2 hours ago

shoyer commented 2 hours ago

It appears that #9520 may have broken some upstream pandas tests, specifically testing round-trips with various index types: https://github.com/pandas-dev/pandas/blob/e78ebd3f845c086af1d71c0604701ec49df97228/pandas/tests/generic/test_to_xarray.py#L32

Here's a minimal test case:

import pandas as pd
import numpy as np

cat = pd.Categorical(list("abcd"))
df = pd.DataFrame({"f": cat}, index=cat)
restored = df.to_xarray().to_dataframe()
print(restored.index)  # Index(['a', 'b', 'c', 'd'], dtype='object', name='index')
print(df.index)  # CategoricalIndex(['a', 'b', 'c', 'd'], categories=['a', 'b', 'c', 'd'], ordered=False, dtype='category')

I'm not sure if this is a pandas or xarray issue, but it's one or the other!

(My guess is that most of these tests in pandas should probably live in xarray instead, given that we implement all the conversion logic.)

Originally posted by @shoyer in https://github.com/pydata/xarray/issues/9520#issuecomment-2386077534

shoyer commented 2 hours ago

Here's the error message from pandas's TestDataFrameToXArray.test_to_xarray_index_types[string]:

AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are differentAttribute "dtype" are different[left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)[right]: objectself = <pandas.tests.generic.test_to_xarray.TestDataFrameToXArray object at 0x13d4fa7cbe90>index_flat = Index(['pandas_0', 'pandas_1', 'pandas_2', 'pandas_3', 'pandas_4', 'pandas_5',       'pandas_6', 'pandas_7', 'pandas_...pandas_93', 'pandas_94', 'pandas_95',       'pandas_96', 'pandas_97', 'pandas_98', 'pandas_99'],      dtype='object')df = bar       a  b  c    d      e  f          g                         hfoo                                             ....0   True  c 2013-01-03 2013-01-03 00:00:00-05:00pandas_3  d  4  6  7.0  False  d 2013-01-04 2013-01-04 00:00:00-05:00using_infer_string = False    def test_to_xarray_index_types(self, index_flat, df, using_infer_string):        index = index_flat        # MultiIndex is tested in test_to_xarray_with_multiindex        if len(index) == 0:            pytest.skip("Test doesn't make sense for empty index")            from xarray import Dataset            df.index = index[:4]        df.index.name = "foo"        df.columns.name = "bar"        result = df.to_xarray()        assert result.sizes["foo"] == 4        assert len(result.coords) == 1        assert len(result.data_vars) == 8        tm.assert_almost_equal(list(result.coords.keys()), ["foo"])        assert isinstance(result, Dataset)            # idempotency        # datetimes w/tz are preserved        # column names are lost        expected = df.copy()        expected["f"] = expected["f"].astype(            object if not using_infer_string else "string[pyarrow_numpy]"        )        expected.columns.name = None>       tm.assert_frame_equal(result.to_dataframe(), expected)E       AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are differentE       E       Attribute "dtype" are differentE       [left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)E       [right]: objecttests/generic/test_to_xarray.py:58: AssertionError
Failed

<br class="Apple-interchange-newline">AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are different

Attribute "dtype" are different
[left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)
[right]: object
self = <pandas.tests.generic.test_to_xarray.TestDataFrameToXArray object at 0x13d4fa7cbe90>
index_flat = Index(['pandas_0', 'pandas_1', 'pandas_2', 'pandas_3', 'pandas_4', 'pandas_5',
       'pandas_6', 'pandas_7', 'pandas_...pandas_93', 'pandas_94', 'pandas_95',
       'pandas_96', 'pandas_97', 'pandas_98', 'pandas_99'],
      dtype='object')
df = bar       a  b  c    d      e  f          g                         h
foo                                             ....0   True  c 2013-01-03 2013-01-03 00:00:00-05:00
pandas_3  d  4  6  7.0  False  d 2013-01-04 2013-01-04 00:00:00-05:00
using_infer_string = False

    def test_to_xarray_index_types(self, index_flat, df, using_infer_string):
        index = index_flat
        # MultiIndex is tested in test_to_xarray_with_multiindex
        if len(index) == 0:
            pytest.skip("Test doesn't make sense for empty index")

        from xarray import Dataset

        df.index = index[:4]
        [df.index.name](https://www.google.com/url?q=http://df.index.name&sa=D) = "foo"
        [df.columns.name](https://www.google.com/url?q=http://df.columns.name&sa=D) = "bar"
        result = df.to_xarray()
        assert result.sizes["foo"] == 4
        assert len(result.coords) == 1
        assert len(result.data_vars) == 8
        tm.assert_almost_equal(list(result.coords.keys()), ["foo"])
        assert isinstance(result, Dataset)

        # idempotency
        # datetimes w/tz are preserved
        # column names are lost
        expected = df.copy()
        expected["f"] = expected["f"].astype(
            object if not using_infer_string else "string[pyarrow_numpy]"
        )
        [expected.columns.name](https://www.google.com/url?q=http://expected.columns.name&sa=D) = None
>       tm.assert_frame_equal(result.to_dataframe(), expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)
E       [right]: object

tests/generic/test_to_xarray.py:58: AssertionError
shoyer commented 2 hours ago

cc @ilan-gold