jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 223 forks source link

Implement into_inner for Utf8Array and BinaryArray for reusing Buffer allocations #1491

Closed hzuo closed 1 year ago

hzuo commented 1 year ago

Many kernels operate directly on the underlying buffer, creating a new allocation for a new buffer with updated values, and re-use the other components of the array like the offsets, bitmap, etc.

In many cases, the the output of the kernel is transient, and the allocation for the buffer can be reused - as we run the kernel over a big set of arrays, we can reuse the same scratch buffer (similar to the scratch buffers used in the io modules).

Exposing into_inner allows applications to re-use the scratch allocation instead of calling out to the allocator for every array.

PrimitiveArrays and BooleanArrays already have an into_inner which deconstruct themselves into their constituent components - here we just bring Utf8Arrays and BinaryArrays in line with the other array types.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage has no change and project coverage change: -0.03 :warning:

Comparison is base (38a2e90) 83.66% compared to head (702eaba) 83.63%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1491 +/- ## ========================================== - Coverage 83.66% 83.63% -0.03% ========================================== Files 389 389 Lines 41874 41892 +18 ========================================== + Hits 35032 35035 +3 - Misses 6842 6857 +15 ``` | [Impacted Files](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1491?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) | Coverage Δ | | |---|---|---| | [src/array/binary/mod.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1491?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L2JpbmFyeS9tb2QucnM=) | `86.63% <0.00%> (-3.75%)` | :arrow_down: | | [src/array/utf8/mod.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1491?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L3V0ZjgvbW9kLnJz) | `79.62% <0.00%> (-2.75%)` | :arrow_down: | ... and [4 files with indirect coverage changes](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1491/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

sundy-li commented 1 year ago

LGTM. BTW, The Buffer is copy free, so it's free to call array.values().clone()

ritchie46 commented 1 year ago

Yeap, seems good. :+1: