vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Interchange `Column.dtype` returns format strings in NumPy-style, instead of Arrow-style #2139

Open honno opened 2 years ago

honno commented 2 years ago

In the interchange protocol, Column.dtype should return an Arrow-style format string, but instead a NumPy-styled one is returned

>>> df = vaex.from_items(("foo", np.asarray([0, 1, 2], dtype="int64")))
>>> interchange_df = df.__dataframe__()
>>> interchange_col = interchange_df.get_column_by_name("foo")
>>> interchange_col.dtype
(<_DtypeKind.INT: 0>, 8, '<i8', '|')

This happens with Arrow-backend columns too

>>> table = pa.Table.from_pydict({"foo": [0, 1, 2]})
>>> df = vaex.from_arrow_table(table)
>>> interchange_df = df.__dataframe__()
>>> interchange_col = interchange_df.get_column_by_name("foo")
>>> interchange_col.dtype
(<_DtypeKind.INT: 0>, 64, '<i8', '=')

It looks like currently the .str attribute of the equivalent NumPy dtype objects is returned as-is

https://github.com/vaexio/vaex/blob/35c250d585f889272b8ef1096de6fa5462816f52/packages/vaex-core/vaex/dataframe_protocol.py#L407-L410