vega / vegafusion

Serverside scaling for Vega and Altair visualizations
https://vegafusion.io
BSD 3-Clause "New" or "Revised" License
315 stars 18 forks source link

Support ingesting objects that support the Arrow PyCapsule API #498

Open jonmmease opened 1 month ago

jonmmease commented 1 month ago

We could support ingesting objects that implement the Arrow PyCapsule API.

Compared to the current support for the DataFrame Interchange Protocol, accepting objects that implement the Arrow PyCapsule API wouldn't require pyarrow, and wouldn't require converting to a pyarrow Table on the Python side.

I think we could use @kylebarron's new pyo3-arrow crate for this (since it doesn't require the pyarrow dependency). In fact, I think we could drop pyarrow as a hard dependency using this approach, since pyarrow itself supports the PyCapsule API.


cc @MarcoGorelli based on comment in https://github.com/vega/altair/pull/3452#issuecomment-2205819070

I've caught up with Polars devs, and they're on board with using Altair in polars.DataFrame.plot if the plots can be done directly without going via pandas

In order for VegaFusion (which powers Vega-Altair's optional "vegafusion" data transformer) to support polars without pyarrow (so that operations like Vega-Altair's histogram binning and aggregation are performed in the Python kernel rather than in the browser), I think we'll need polars to support the PyCapsule API as discussed in https://github.com/pola-rs/polars/issues/12530.

kylebarron commented 1 month ago

I think we could drop pyarrow as a hard dependency using this approach, since pyarrow itself supports the PyCapsule API.

Yes, though it does require the user to have a relatively recent version of pyarrow.

Let me know if I can help with pyo3-arrow at all! I've only published a version support arrow version 52, so you'll have to upgrade before you can use pyo3-arrow. I figure you only care about importing data, not exporting data?

jonmmease commented 1 month ago

Thanks for chiming in @kylebarron

I've only published a version support arrow version 52

Yeah, I need to update DataFusion and Arrow soon anyway.

I figure you only care about importing data, not exporting data?

Thats correct

Yes, though it does require the user to have a relatively recent version of pyarrow.

Thanks for the call out, that's a good point.

kylebarron commented 1 month ago

I started a PR for polars pycapsule export here: https://github.com/pola-rs/polars/pull/17676

kylebarron commented 1 month ago

If you pointed me to where the arrow ingest happens, I could probably make a PR for this if you'd like

jonmmease commented 1 month ago

Thanks for the offer!

Here's is where the pyarrow tables are imported

https://github.com/vega/vegafusion/blob/007bd44188676de7259bc02a61693b3dc7586072/vegafusion-common/src/data/table.rs#L270-L286

This is invoked from the PyO3 Rust code in:

https://github.com/vega/vegafusion/blob/007bd44188676de7259bc02a61693b3dc7586072/vegafusion-python-embed/src/lib.rs#L189-L196

I'm imagining there would be a VegaFusionDataset.from_arrow_pycapsule or something, then another branch in process_inline_datasets that would check from the PyCapsule interface and use this API. Definitely happy for you to do the whole PR, but even if you only implement this from_arrow_pycapsule method that would be really helpful and I can do the rest of the routing later.

If this is blocked by updating arrow-rs, I can ping this thread once that's done.

kylebarron commented 1 month ago

If this is blocked by updating arrow-rs, I can ping this thread once that's done

I think that's primarily a question of whether you're ok vendoring the relevant PyCapsule code (on top of arrow-rs' FFI code). It's a relatively small amount of code (polars pr for reference), and then you don't have to add a dependency on pyo3-arrow if you don't want.