vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[FEATURE-REQUEST] Support interchanging vaex dataframes with Arrow-backend columns #2134

Closed honno closed 2 years ago

honno commented 2 years ago

Initialising an interchange protocol buffer (_VaexBuffer) only works for vaex columns with NumPy backends

https://github.com/vaexio/vaex/blob/35c250d585f889272b8ef1096de6fa5462816f52/packages/vaex-core/vaex/dataframe_protocol.py#L241-L244

_VaexBuffer.__init__() is private API, but affects interchange with different libraries as this is called when using the public API of Column.get_buffers()

packages/vaex-core/vaex/dataframe_protocol.py:565: in get_buffers
    buffers["data"] = self._get_data_buffer()
packages/vaex-core/vaex/dataframe_protocol.py:603: in _get_data_buffer
    buffer = _VaexBuffer(self._col.values)

So obviously it'd be nice (if not practically essential?) if vaex supported interchanging Arrow-backend columns too. I just thought to raise this issue as a tracker, as I didn't quite see relevant conversation in https://github.com/vaexio/vaex/pull/1509. cc @maartenbreddels

maartenbreddels commented 2 years ago

Even if the buffer is stored as numpy array, it can still mean the underlying data is an arrow array.

I think it should be possible to do arrow->protocol->arrow without a memory copy. At least that's how we designed the spec AFAIKR. It could be that the implementation is missing some parts still.

honno commented 2 years ago

Ah so you fixed the issue I was alluding to in #2122

-                buffer = _VaexBuffer(self._col.values)
+                buffer = _VaexBuffer(indices.to_numpy())

Before a test like the following would fail

def test_smoke_get_buffers(df_factory):
    x = np.arange(5)
    df = df_factory(x=x)
    df = df.categorize("x")
    interchange_df = df.__dataframe__()
    interchange_col = interchange_df.get_column_by_name("x")
    interchange_col.get_buffers()

for the pyarrow(+chunked) dataframe. So I think you're all good? I'll get to forcibly generate Arrow-backend examples for dataframe-interchange-tests.

honno commented 2 years ago

Wrote a regression test https://github.com/vaexio/vaex/pull/2135