Open kylebarron opened 1 month ago
Hi @kylebarron !
I'm certainly interested in doing what I can to facilitate this
This would also allow you to remove your polars-specific code, because polars implements the Arrow PyCapsule interface (pola-rs/polars#17676) as of Polars 1.3.
Could you show how this would work please? There's only one teeny-tiny Polars-specific piece of code in Altair, and it's not clear to me how the Arrow C Interface would address it, but I might be missing something
I'm referring to these lines:
Those may not be solely for Polars, but a primary goal of the PyCapsule Interface is to standardize the method name by which one library exports data to others. So instead of checking for all these possible names, you can use __arrow_c_stream__
under the hood.
So essentially:
-for convert_method_name in ("arrow", "to_arrow", "to_arrow_table", "to_pyarrow"):
- convert_method = getattr(dfi_df, convert_method_name, None)
- if callable(convert_method):
- result = convert_method()
- if isinstance(result, pa.Table):
- return result
+ if hasattr(dfi_df, "__arrow_c_stream__"):
+ return pa.table(dfi_df)
Polars input wouldn't go down those lines anyway, as it would already have been handled in the Narwhals path π (and it wouldn't involve any conversion to pyarrow)
Isn't pycapsule interface adoption a bit too recent for it to be used here as the only way to covert to pyarrow? It would cut off support for oldish versions of Ibis / DuckDB for whom the current code works fine. But using it if it's available, in addition to the current code but before the interchange protocol, sounds like a good idea π
Polars input wouldn't go down those lines anyway, as it would already have been handled in the Narwhals path π (and it wouldn't involve any conversion to pyarrow)
Ah, I hadn't noticed that.
Isn't pycapsule interface adoption a bit too recent for it to be used here as the only way to covert to pyarrow? It would cut off support for oldish versions of Ibis / DuckDB for whom the current code works fine. But using it if it's available, in addition to the current code but before the interchange protocol, sounds like a good idea π
Yes, I should've been more clear about sometime in the future when you're ok with the pyarrow version constraint, you can remove those lines of code.
For the time being, I'd suggest it as an addition, not a replacement, to those existing checks.
What is your suggestion?
π The Arrow project recently created the Arrow PyCapsule Interface, a new protocol for sharing Arrow data in Python. Among its goals is allowing Arrow data interchange without requiring the use of pyarrow, but I'm also excited about the prospect of an ecosystem that can share data only by the presence of dunder methods, where producer and consumer don't have to have prior knowledge of each other.
This would allow Altair to work out of the box with any Arrow-based object that supports this interface.
I've been working to promote the PyCapsule Interface across the ecosystem, with many libraries having adopted support so far.
Given that altair already has an optional dependency on pyarrow, the easiest implementation would be a simple addition in here: https://github.com/vega/altair/blob/5207768b6e533c0509218376942309d1c7bac22f/altair/utils/data.py#L417-L434
to first call
This would also allow you to remove your polars-specific code, because polars implements the Arrow PyCapsule interface (https://github.com/pola-rs/polars/pull/17676) as of Polars 1.3.
Alternatively, this interface would enable you to accept Arrow input data without a pyarrow dependency, if that's attractive.
I figure @MarcoGorelli also has opinions about this given https://github.com/vega/altair/issues/3445. Narwhals also supports PyCapsule Interface export: https://github.com/narwhals-dev/narwhals/pull/786.
Have you considered any alternative solutions?
Altair already supports the DataFrame Interchange Protocol, but that is not a direct replacement for the PyCapsule Interface. The PyCapsule Interface is much easier to implement for Arrow-based libraries and allows zero-copy data exchange with very little overhead. There are many libraries that would implement the PyCapsule Interface without wanting to go through the trouble of implementing the DataFrame Interchange Protocol.
Also relevant is that vegafusion is planning to adopt this, notwithstanding a Rust technical issue https://github.com/vega/vegafusion/pull/501