vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Interchange `Column.get_buffers()` erroneously raises for some categorical columns constructed with NumPy arrays #2122

Closed honno closed 2 years ago

honno commented 2 years ago

Using Column.get_buffers() (from vaex's interchange protocol implementation) doesn't seem to work for some categorical columns when they're originally constructed with NumPy arrays.

>>> df1 = vaex.from_items(("foo", np.asarray([1, 1])))
>>> df1 = df1.categorize("foo")
>>> interchange_df1 = df1.__dataframe__()
>>> interchange_col1 = interchange_df1.get_column_by_name("foo")
>>> interchange_col1.get_buffers()
Traceback (most recent call last)

    File .../vaex/dataframe_protocol.py:565, in _VaexColumn.get_buffers(self)
        544 """
        545 Return a dictionary containing the underlying buffers.
        546 
      (...)
        562                  buffer.
        563 """
        564 buffers = {}
    --> 565 buffers["data"] = self._get_data_buffer()
        566 try:
        567     buffers["validity"] = self._get_validity_buffer()

    File .../vaex/dataframe_protocol.py:602, in _VaexColumn._get_data_buffer(self)
        600 if self._col.values[0] in labels:
        601     for i in self._col.values:
    --> 602         codes[np.where(codes == i)] = np.where(labels == i)
        603 buffer = _VaexBuffer(self._col.values)
        604 dtype = self._dtype_from_vaexdtype(self._col.dtype)

ValueError: shape mismatch: value array of shape (1,0) could not be broadcast to indexing result of shape (2,)

For this example, when reaching

https://github.com/vaexio/vaex/blob/35c250d585f889272b8ef1096de6fa5462816f52/packages/vaex-core/vaex/dataframe_protocol.py#L601-L602

the relevant values are

>>> codes
array([0, 0])
>>> labels
[1]
>>> i
0

Column.get_buffers() works just fine when the column was constructed with say a builtin list, assumedly because pyarrow is used internally and thus a different branch of logic is used.

>>> df2 = vaex.from_dict({"foo": [1]})
>>> df2 = df2.categorize("foo")
>>> interchange_df2 = df2.__dataframe__()
>>> interchange_col2 = interchange_df2.get_column_by_name("foo")
>>> interchange_col2.get_buffers()
{'data': ..., 'validity': None, 'offsets': None}
maartenbreddels commented 2 years ago

Hi Matthew,

Yeah, that seems like a legit bug. I think we should generalize the tests to cover both cases. Do you feel like exposing this by opening a PR with a failing test.

Regards,

Maarten