vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] `describe_categorical` in interchange columns is a tuple, not a dict #2113

Closed honno closed 2 years ago

honno commented 2 years ago

In the interchange protocol, describe_categorical should return a dict (mind the spec's API type annotation is faulty), but Vaex returns a tuple

https://github.com/vaexio/vaex/blob/35c250d585f889272b8ef1096de6fa5462816f52/packages/vaex-core/vaex/dataframe_protocol.py#L443

This prevents interchanging dataframes with categorical columns, e.g. with https://github.com/pandas-dev/pandas/pull/46141

>>> import numpy as np
>>> import vaex
>>> df = vaex.from_items(("foo", np.asarray([4, 2, 1, 3, 3], dtype="int8")))
>>> df = df.categorize("foo")
>>> from pandas.api.exchange import from_dataframe
>>> from_dataframe(df)
.../pandas/core/exchange/from_dataframe.py:184, in categorical_column_to_series(col)
    169 """
    170 Convert a column holding categorical data to a pandas Series.
    171 
   (...)
    180     that keeps the memory alive.
    181 """
    182 categorical = col.describe_categorical
--> 184 if not categorical["is_dictionary"]:
    185     raise NotImplementedError("Non-dictionary categoricals not supported yet")
    187 mapping = categorical["mapping"]
TypeError: tuple indices must be integers or slices, not str
honno commented 2 years ago

I'll submit a PR for this