jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 222 forks source link

Added buffer interoperability with arrow-rs #1437

Closed tustvold closed 1 year ago

tustvold commented 1 year ago

As part of #1429 we want to provide an interoperability story between arrow2 and arrow-rs.

The original proposal involved porting arrow-rs and arrow2 to have a common base array representation. This was to preserve the original spirit of @jorgecarleitao 's proposal in https://github.com/apache/arrow-rs/issues/1176#issuecomment-1430883886. However, doing this in an incremental fashion whilst not introducing performance regressions or major breaking changes is complicated and extremely time consuming.

Taking a step-back, all we really want is a reasonably fast way to convert between array representations, to facilitate interoperability and potentially incremental migration of codebases. Whilst perhaps less "pure", simply providing a safe API to convert between ArrayData and Box<dyn arrow2::Array> is likely sufficient.

The major things this would change are:

However, it would allow us to provide an interoperability story in a matter of days instead of weeks/months.

In this vein, this PR adds zero-copy conversion between the buffer representations, as this is all that is really necessary to permit this. The rest of the conversion logic is fairly mechanical, I already have it mostly implemented but wanted to get feedback first.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02 :tada:

Comparison is base (db87f71) 83.76% compared to head (a952e10) 83.78%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1437 +/- ## ========================================== + Coverage 83.76% 83.78% +0.02% ========================================== Files 375 376 +1 Lines 41024 41074 +50 ========================================== + Hits 34364 34415 +51 + Misses 6660 6659 -1 ``` | [Impacted Files](https://codecov.io/gh/jorgecarleitao/arrow2/pull/1437?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) | Coverage Δ | | |---|---|---| | [src/types/native.rs](https://codecov.io/gh/jorgecarleitao/arrow2/pull/1437?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL3R5cGVzL25hdGl2ZS5ycw==) | `68.46% <ø> (ø)` | | | [src/bitmap/immutable.rs](https://codecov.io/gh/jorgecarleitao/arrow2/pull/1437?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2JpdG1hcC9pbW11dGFibGUucnM=) | `88.42% <100.00%> (+1.21%)` | :arrow_up: | | [src/buffer/immutable.rs](https://codecov.io/gh/jorgecarleitao/arrow2/pull/1437?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2J1ZmZlci9pbW11dGFibGUucnM=) | `87.38% <100.00%> (+0.72%)` | :arrow_up: | | [src/buffer/mod.rs](https://codecov.io/gh/jorgecarleitao/arrow2/pull/1437?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2J1ZmZlci9tb2QucnM=) | `100.00% <100.00%> (ø)` | | | [src/ffi/array.rs](https://codecov.io/gh/jorgecarleitao/arrow2/pull/1437?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2ZmaS9hcnJheS5ycw==) | `83.12% <100.00%> (ø)` | | ... and [4 files with indirect coverage changes](https://codecov.io/gh/jorgecarleitao/arrow2/pull/1437/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

alamb commented 1 year ago

Is there anything about this API that would preclude an (eventual) unification of the underlying buffer types? If not, it then seems quite reasonable to me to introduce an (optional) migration path and then work on the unifying buffer types to get Vec convertibility over time / as resources allow

tustvold commented 1 year ago

Is there anything about this API that would preclude an (eventual) unification of the underlying buffer types

No, although if this approach is given the green light, it is unclear that such a unification would be worth the fairly significant effort, I certainly would not be intending to undertake it.

tustvold commented 1 year ago

Integration test failure does not appear to be related to this PR

ritchie46 commented 1 year ago

I do think it is a regression if we cannot get back to Vec anymore, In polars we convert back sometimes. Could we make this a feature gate? I could feature gate that behavior out in polars as well.

Whilst perhaps less "pure", simply providing a safe API to convert between ArrayData and Box is likely sufficient.

Couldn't we already do this with arrow FFI spec? What are the pro's and cons against this route? As we would still need to compile both libraries if we convert between the two.

tustvold commented 1 year ago

if we cannot get back to Vec anymore

It's only you can't go back to Vec from an array created initially by the other library and then converted, i.e. the conversion loses the ability to go back to a vec. Arrow-rs arrays created from vec can still be converted back, and the same for arrow2

ffi

The conversion is safe and ergonomic, ffi is neither 😅

This approach should also be marginally faster as it doesn't need to marshal back and forth from the c data layout (which may need to recompute null buffers)

still need to compile both

You only need to compile an extremely small part of arrow-rs, it won't register in the compile times at all

alamb commented 1 year ago

It's only you can't go back to Vec from an array created initially by the other library and then converted, i.e. the conversion loses the ability to go back to a vec. Arrow-rs arrays created from vec can still be converted back, and the same for arrow2

Maybe we can add a test demonstrating going back/forth to vec (and when it doesn't work) as a way to document the limitiation?

ritchie46 commented 1 year ago

It's only you can't go back to Vec from an array created initially by the other library and then converted, i.e. the conversion loses the ability to go back to a vec. Arrow-rs arrays created from vec can still be converted back, and the same for arrow2

Right, I misunderstood that part. In that case this looks great! :+1:

alamb commented 1 year ago

My only minor concern is that because arrow-buffer bumps major version every 2 weeks, we need to update this repo every 2 weeks, but this is only a procedural issue as the crate is not changing much.

We might be able to publish new versions of arrow2 with minor (e.g. 0.16.1) with just version updates if that turns out to be an issue. I think bumping dependents is semantically compatible