Closed tustvold closed 1 year ago
Patch coverage: 100.00
% and project coverage change: +0.02
:tada:
Comparison is base (
db87f71
) 83.76% compared to head (a952e10
) 83.78%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Is there anything about this API that would preclude an (eventual) unification of the underlying buffer types? If not, it then seems quite reasonable to me to introduce an (optional) migration path and then work on the unifying buffer types to get Vec
convertibility over time / as resources allow
Is there anything about this API that would preclude an (eventual) unification of the underlying buffer types
No, although if this approach is given the green light, it is unclear that such a unification would be worth the fairly significant effort, I certainly would not be intending to undertake it.
Integration test failure does not appear to be related to this PR
I do think it is a regression if we cannot get back to Vec
anymore, In polars we convert back sometimes. Could we make this a feature gate? I could feature gate that behavior out in polars as well.
Whilst perhaps less "pure", simply providing a safe API to convert between ArrayData and Box
is likely sufficient.
Couldn't we already do this with arrow FFI spec? What are the pro's and cons against this route? As we would still need to compile both libraries if we convert between the two.
if we cannot get back to Vec anymore
It's only you can't go back to Vec from an array created initially by the other library and then converted, i.e. the conversion loses the ability to go back to a vec. Arrow-rs arrays created from vec can still be converted back, and the same for arrow2
ffi
The conversion is safe and ergonomic, ffi is neither 😅
This approach should also be marginally faster as it doesn't need to marshal back and forth from the c data layout (which may need to recompute null buffers)
still need to compile both
You only need to compile an extremely small part of arrow-rs, it won't register in the compile times at all
It's only you can't go back to Vec from an array created initially by the other library and then converted, i.e. the conversion loses the ability to go back to a vec. Arrow-rs arrays created from vec can still be converted back, and the same for arrow2
Maybe we can add a test demonstrating going back/forth to vec (and when it doesn't work) as a way to document the limitiation?
It's only you can't go back to Vec from an array created initially by the other library and then converted, i.e. the conversion loses the ability to go back to a vec. Arrow-rs arrays created from vec can still be converted back, and the same for arrow2
Right, I misunderstood that part. In that case this looks great! :+1:
My only minor concern is that because arrow-buffer bumps major version every 2 weeks, we need to update this repo every 2 weeks, but this is only a procedural issue as the crate is not changing much.
We might be able to publish new versions of arrow2 with minor (e.g. 0.16.1
) with just version updates if that turns out to be an issue. I think bumping dependents is semantically compatible
As part of #1429 we want to provide an interoperability story between arrow2 and arrow-rs.
The original proposal involved porting arrow-rs and arrow2 to have a common base array representation. This was to preserve the original spirit of @jorgecarleitao 's proposal in https://github.com/apache/arrow-rs/issues/1176#issuecomment-1430883886. However, doing this in an incremental fashion whilst not introducing performance regressions or major breaking changes is complicated and extremely time consuming.
Taking a step-back, all we really want is a reasonably fast way to convert between array representations, to facilitate interoperability and potentially incremental migration of codebases. Whilst perhaps less "pure", simply providing a safe API to convert between
ArrayData
andBox<dyn arrow2::Array>
is likely sufficient.The major things this would change are:
Vec
as they would be opaque allocationsHowever, it would allow us to provide an interoperability story in a matter of days instead of weeks/months.
In this vein, this PR adds zero-copy conversion between the buffer representations, as this is all that is really necessary to permit this. The rest of the conversion logic is fairly mechanical, I already have it mostly implemented but wanted to get feedback first.