jorgecarleitao / parquet2

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow
Other
356 stars 59 forks source link

Added functionality to optionally write bloom filters #218

Closed ozgrakkurt closed 1 year ago

ozgrakkurt commented 1 year ago

closes https://github.com/jorgecarleitao/parquet2/issues/213

ozgrakkurt commented 1 year ago

@jorgecarleitao I think it is all done except how to get the filters themselves. Can you give guidence on how can I construct the filter and pass it to write_column_chunk ?

ozgrakkurt commented 1 year ago

Should be done now. Downstream code can optionally pass it in RowGroupIter along with the columns if the bloom_filter feature is enabled.

This is a breaking change

ozgrakkurt commented 1 year ago

One comment resolved but not sure about the other one

ozgrakkurt commented 1 year ago

@jorgecarleitao how can I write tests for this?

ozgrakkurt commented 1 year ago

@jorgecarleitao should be fixed

jorgecarleitao commented 1 year ago

I think this just needs a test - would it be possible to perform a round-trip of a file with a written bloom filter to confirm that the filter we wrote is the filter we read?

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 93.52% and project coverage change: +0.35 :tada:

Comparison is base (ed0e1ff) 85.05% compared to head (e75da23) 85.40%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #218 +/- ## ========================================== + Coverage 85.05% 85.40% +0.35% ========================================== Files 86 87 +1 Lines 8289 8415 +126 ========================================== + Hits 7050 7187 +137 + Misses 1239 1228 -11 ``` | [Impacted Files](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) | Coverage Δ | | |---|---|---| | [src/bloom\_filter/mod.rs](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2Jsb29tX2ZpbHRlci9tb2QucnM=) | `100.00% <ø> (ø)` | | | [src/error.rs](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2Vycm9yLnJz) | `20.51% <ø> (ø)` | | | [src/read/column/stream.rs](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL3JlYWQvY29sdW1uL3N0cmVhbS5ycw==) | `0.00% <0.00%> (ø)` | | | [src/write/mod.rs](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL3dyaXRlL21vZC5ycw==) | `75.00% <ø> (ø)` | | | [src/bloom\_filter/read.rs](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2Jsb29tX2ZpbHRlci9yZWFkLnJz) | `77.77% <78.57%> (+77.77%)` | :arrow_up: | | [src/write/row\_group.rs](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL3dyaXRlL3Jvd19ncm91cC5ycw==) | `94.01% <88.88%> (-0.51%)` | :arrow_down: | | [src/write/column\_chunk.rs](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL3dyaXRlL2NvbHVtbl9jaHVuay5ycw==) | `90.95% <97.50%> (+1.38%)` | :arrow_up: | | [src/bloom\_filter/write.rs](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2Jsb29tX2ZpbHRlci93cml0ZS5ycw==) | `100.00% <100.00%> (ø)` | | | [src/write/statistics.rs](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL3dyaXRlL3N0YXRpc3RpY3MucnM=) | `93.38% <100.00%> (+0.94%)` | :arrow_up: | ... and [1 file with indirect coverage changes](https://codecov.io/gh/jorgecarleitao/parquet2/pull/218/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

ozgrakkurt commented 1 year ago

Hey @jorgecarleitao, I wrote tests and also implemented async method to read bloom filters.

ozgrakkurt commented 1 year ago

@jorgecarleitao can you check this when you have time?

ozgrakkurt commented 1 year ago

nvm