vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Passing `n_chunks` to interchange `Column.get_chunks()` erroneously raises #2121

Open honno opened 2 years ago

honno commented 2 years ago

vaex's interchange protocol Column doesn't support passing anything but n_chunks=None (the default) to get_chunks()

>>> df = vaex.from_dict({"foo": [42]})
>>> interchange_df = df.__dataframe__()
>>> interchange_col = interchange_df.get_column_by_name("foo")
>>> interchange_col.get_chunks()  # i.e. n_chunks=None
[<vaex.dataframe_protocol._VaexColumn at 0x7f7756a4ea30>]
>>> interchange_col.get_chunks(n_chunks=1)
.../vaex/dataframe_protocol.py:541, in _VaexColumn.get_chunks(self, n_chunks)
    538     return iterator
    540 else:
--> 541     raise ValueError(f"Column {self._col.expression} is already chunked.")
ValueError: Column foo is already chunked.
# should return [<vaex.dataframe_protocol._VaexColumn at 0x7f7756a4ea30>] or equivalent

n_chunks=1 should be valid, as it is (always) a multiple of Column.num_chunks()

>>> interchange_col.num_chunks()
1

It seems this line should be calling num_chunks(), as current is just compares the function object to 1

https://github.com/vaexio/vaex/blob/35c250d585f889272b8ef1096de6fa5462816f52/packages/vaex-core/vaex/dataframe_protocol.py#L532