vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Interchange `Column.dtype` can return `kind` as `int`, not an `IntEnum` #2118

Open honno opened 2 years ago

honno commented 2 years ago

vaex's interchange column returns a dtype tuple which holds a kind (first element) of type int for string columns, as opposed to an IntEnum (as specified by the interchange protocol).

>>> df1 = vaex.from_dict({"foo": ["bar"]})
>>> interchange_df1 = df1.__dataframe__()
>>> interchange_col1 = interchange_df1.get_column_by_name("foo")
>>> kind1, *_ = interchange_col1.dtype
>>> from enum import IntEnum
>>> isinstance(kind1, IntEnum)
False  # should be True
>>> kind1
21

This seems to be because the following code assigns kind as the int value, rather than the respective enum.

https://github.com/vaexio/vaex/blob/35c250d585f889272b8ef1096de6fa5462816f52/packages/vaex-core/vaex/dataframe_protocol.py#L393-L399

This practically effects interop as libraries might use the value property of an enum, which of-course doesn't exist for ints.

Interestingly there seems to a scenario with categorical columns that can also return int as oppopsed to IntEnum kinds, but I couldn't seem to reproduce it... mind I'm pretty unfamiliar with vaex and pyarrow :sweat_smile:

>>> df2 = vaex.from_dict({"baz": [7, 42, 3]})
>>> df2 = df2.categorize("baz")
>>> interchange_df2 = df2.__dataframe__()
>>> interchange_col2 = interchange_df2.get_column_by_name("baz")
>>> kind2, *_ = interchange_col2.dtype
>>> kind2
<_DtypeKind.CATEGORICAL: 23>
...
>>> import pyarrow as pa
>>> table = pa.Table.from_pydict(
...     {"qux": pa.DictionaryArray.from_arrays([0, 1, 0], ["alice", "bob"])}
... )
>>> df3 = vaex.from_arrow_table(table)
>>> interchange_df3 = df3.__dataframe__()
>>> interchange_col3 = interchange_df3.get_column_by_name("qux")
>>> kind3, *_ = interchange_col3.dtype
>>> kind3
<_DtypeKind.CATEGORICAL: 23>