jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 222 forks source link

Add SchemaInferenceOptions options to infer_schema and option to configure int96 inference #1533

Closed jaychia closed 1 year ago

jaychia commented 1 year ago

This PR addresses part 2 of #1527

It solves the problem of configuring arrow2's Parquet schema inference to infer Timestamp fields from Parquet Int96 fields differently based on user input.

  1. Adds a new SchemaInferenceOptions struct which allows for configurability of how schema inference on Parquet files
  2. Adds a int96_coerce_to_timeunit flag to configure how Parquet int96 fields are inferred as arrow Timestamps
  3. Adds *_with_options variants of the infer_schema and parquet_to_arrow_schema APIs to take in the options
codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 95.23% and project coverage change: +0.02% :tada:

Comparison is base (87ab844) 83.02% compared to head (4e4279f) 83.05%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1533 +/- ## ========================================== + Coverage 83.02% 83.05% +0.02% ========================================== Files 391 391 Lines 42786 42866 +80 ========================================== + Hits 35523 35602 +79 - Misses 7263 7264 +1 ``` | [Files Changed](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1533?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) | Coverage Δ | | |---|---|---| | [src/io/parquet/read/schema/convert.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1533?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2lvL3BhcnF1ZXQvcmVhZC9zY2hlbWEvY29udmVydC5ycw==) | `94.68% <94.62%> (+0.47%)` | :arrow_up: | | [src/io/parquet/read/schema/mod.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1533?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2lvL3BhcnF1ZXQvcmVhZC9zY2hlbWEvbW9kLnJz) | `100.00% <100.00%> (ø)` | | ... and [5 files with indirect coverage changes](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1533/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jaychia commented 1 year ago

Hi @ritchie46 and @sundy-li

Here's a follow-up PR to https://github.com/jorgecarleitao/arrow2/pull/1532

(see PR description for more details)

sundy-li commented 1 year ago

BTW, int96 seems to be deprecated in parquet, it's not a stable feature. https://[issues.apache.org/jira/browse/PARQUET-323](https://issues.apache.org/jira/browse/PARQUET-323)

jaychia commented 1 year ago

BTW, int96 seems to be deprecated in parquet, it's not a stable feature. https://[issues.apache.org/jira/browse/PARQUET-323](https://issues.apache.org/jira/browse/PARQUET-323)

Indeed, but it is still widely used and supported by many systems for backwards-compatibility reasons

Unfortunately because Parquet is a long-lived format, and many enterprises use old versions of data frameworks, these deprecated features tend to live long after their deprecation :)