vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.32k stars 794 forks source link

Support Arrow PyCapsule Interface #3568

Open kylebarron opened 1 month ago

kylebarron commented 1 month ago

What is your suggestion?

πŸ‘‹ The Arrow project recently created the Arrow PyCapsule Interface, a new protocol for sharing Arrow data in Python. Among its goals is allowing Arrow data interchange without requiring the use of pyarrow, but I'm also excited about the prospect of an ecosystem that can share data only by the presence of dunder methods, where producer and consumer don't have to have prior knowledge of each other.

This would allow Altair to work out of the box with any Arrow-based object that supports this interface.

I've been working to promote the PyCapsule Interface across the ecosystem, with many libraries having adopted support so far.

Given that altair already has an optional dependency on pyarrow, the easiest implementation would be a simple addition in here: https://github.com/vega/altair/blob/5207768b6e533c0509218376942309d1c7bac22f/altair/utils/data.py#L417-L434

to first call

if hasattr(dfi_df, "__arrow_c_stream__"):
    # pa.table() will automatically check for `__arrow_c_stream__` and call that
    # todo: add pyarrow version check; I forget which version added support for the PyCapsule Interface
    return pa.table(dfi_df)

This would also allow you to remove your polars-specific code, because polars implements the Arrow PyCapsule interface (https://github.com/pola-rs/polars/pull/17676) as of Polars 1.3.

Alternatively, this interface would enable you to accept Arrow input data without a pyarrow dependency, if that's attractive.

I figure @MarcoGorelli also has opinions about this given https://github.com/vega/altair/issues/3445. Narwhals also supports PyCapsule Interface export: https://github.com/narwhals-dev/narwhals/pull/786.

Have you considered any alternative solutions?

Altair already supports the DataFrame Interchange Protocol, but that is not a direct replacement for the PyCapsule Interface. The PyCapsule Interface is much easier to implement for Arrow-based libraries and allows zero-copy data exchange with very little overhead. There are many libraries that would implement the PyCapsule Interface without wanting to go through the trouble of implementing the DataFrame Interchange Protocol.

Also relevant is that vegafusion is planning to adopt this, notwithstanding a Rust technical issue https://github.com/vega/vegafusion/pull/501

MarcoGorelli commented 1 month ago

Hi @kylebarron !

I'm certainly interested in doing what I can to facilitate this

This would also allow you to remove your polars-specific code, because polars implements the Arrow PyCapsule interface (pola-rs/polars#17676) as of Polars 1.3.

Could you show how this would work please? There's only one teeny-tiny Polars-specific piece of code in Altair, and it's not clear to me how the Arrow C Interface would address it, but I might be missing something

kylebarron commented 1 month ago

I'm referring to these lines:

https://github.com/vega/altair/blob/5207768b6e533c0509218376942309d1c7bac22f/altair/utils/data.py#L421-L431

Those may not be solely for Polars, but a primary goal of the PyCapsule Interface is to standardize the method name by which one library exports data to others. So instead of checking for all these possible names, you can use __arrow_c_stream__ under the hood.

So essentially:

-for convert_method_name in ("arrow", "to_arrow", "to_arrow_table", "to_pyarrow"): 
-    convert_method = getattr(dfi_df, convert_method_name, None) 
-    if callable(convert_method): 
-        result = convert_method() 
-        if isinstance(result, pa.Table): 
-            return result 
+ if hasattr(dfi_df, "__arrow_c_stream__"):
+    return pa.table(dfi_df)
MarcoGorelli commented 1 month ago

Polars input wouldn't go down those lines anyway, as it would already have been handled in the Narwhals path πŸ˜‰ (and it wouldn't involve any conversion to pyarrow)

Isn't pycapsule interface adoption a bit too recent for it to be used here as the only way to covert to pyarrow? It would cut off support for oldish versions of Ibis / DuckDB for whom the current code works fine. But using it if it's available, in addition to the current code but before the interchange protocol, sounds like a good idea πŸ‘

kylebarron commented 1 month ago

Polars input wouldn't go down those lines anyway, as it would already have been handled in the Narwhals path πŸ˜‰ (and it wouldn't involve any conversion to pyarrow)

Ah, I hadn't noticed that.

Isn't pycapsule interface adoption a bit too recent for it to be used here as the only way to covert to pyarrow? It would cut off support for oldish versions of Ibis / DuckDB for whom the current code works fine. But using it if it's available, in addition to the current code but before the interchange protocol, sounds like a good idea πŸ‘

Yes, I should've been more clear about sometime in the future when you're ok with the pyarrow version constraint, you can remove those lines of code.

For the time being, I'd suggest it as an addition, not a replacement, to those existing checks.