jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 222 forks source link

Add deserialization of Bytes -> Decimal #1534

Open jaychia opened 1 year ago

jaychia commented 1 year ago

Arrow2 already has support for Parquet FixedLenByteArray -> Decimal conversion

This PR adds support for Parquet (variable-length) ByteArray -> Decimal conversion, re-using most of the logic from FixedLenByteArray conversion

ritchie46 commented 1 year ago

This PR adds support for Parquet (variable-length) ByteArray

I don't understand. Why would decimal be encoded in variable length binary?

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage has no change and project coverage change: -0.05% :warning:

Comparison is base (87ab844) 83.02% compared to head (ab04856) 82.98%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1534 +/- ## ========================================== - Coverage 83.02% 82.98% -0.05% ========================================== Files 391 391 Lines 42786 42814 +28 ========================================== + Hits 35523 35529 +6 - Misses 7263 7285 +22 ``` | [Files Changed](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1534?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) | Coverage Δ | | |---|---|---| | [src/io/parquet/read/deserialize/simple.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1534?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2lvL3BhcnF1ZXQvcmVhZC9kZXNlcmlhbGl6ZS9zaW1wbGUucnM=) | `82.73% <0.00%> (-3.54%)` | :arrow_down: | | [src/io/parquet/read/schema/convert.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1534?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2lvL3BhcnF1ZXQvcmVhZC9zY2hlbWEvY29udmVydC5ycw==) | `93.73% <0.00%> (-0.49%)` | :arrow_down: | ... and [6 files with indirect coverage changes](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1534/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jaychia commented 1 year ago

Hi @ritchie46, apologies for the late reply!

Going by the Parquet spec, decimals are actually able to be encoded as int32, int64, fixed_len_byte_array and binary.

See: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

ariesdevil commented 1 year ago

Also need to impl for nested https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/nested.rs