narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
468 stars 81 forks source link

[Enh]: Construct DataFrame from Arrow PyCapsule object #1158

Open jonmmease opened 5 days ago

jonmmease commented 5 days ago

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

This request is towards using narwhals to remove the pandas/pyarrow dependencies from VegaFusion 2.0

Please describe the purpose of the new feature or describe the problem to solve.

The flow I'm aiming for with VegaFusion 2.0 is that I'd like to use Narwhals for basic column projection and schema inspection and then use the Arrow PyCapsule API to pass the result to Rust. Then in some cases, the Rust logic will return a new Arrow result in PyCapsule form, and it would be great to be able to use Narwhals to wrap this result using the same backend as the input.

Suggest a solution if possible.

I was picturing perhaps a constructor method in the same family as from_dict, accepting an arrow PyCapsule object.

nw.from_arrow_capsule(cap, native_namespace=nw.get_native_namespace(input_df))

cc @kylebarron for all things Arrow PyCapsule 😄

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

kylebarron commented 5 days ago

I think there can be a limited use case for passing around raw capsules, but the more general API is if you exported an object from your rust code with an __arrow_c_stream__ dunder method, which then could be imported into narwhals using its existing PyCapsule Interface support. This also reduces the user's reliance on narwhals, and lets them use any Arrow-compatible library of their choosing.

In my own libraries, when I control both sides of the connection, I sometimes do have a from_arrow_capsule method. This can be useful when I want to ensure the user only has one version of arro3.core in their environment, and when I'm using arro3 as the transmission to the user's desired choice of library.

jonmmease commented 5 days ago

which then could be imported into narwhals using its existing PyCapsule Interface support

Is import already possible in Narwhals? I was under the impression that it was currently only supported on export.

A from_arrow method like you have in arro3 (with an additional namespace argument) would work. But since Narwhals supports wrapping pyarrow already, it seemed like it could be confusing for end users. But maybe this would be a powerful way to convert between libraries.

kylebarron commented 5 days ago

Oh maybe it's only for export? I'm not up to date.

MarcoGorelli commented 2 days ago

This is definitely in-scope, thanks for the request, I'll try to put something together soon-ish and we can figure out the details