jorgecarleitao / parquet2

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow
Other
355 stars 59 forks source link

Enbaled setting `selected_rows` in the runtime. #205

Closed RinChanNOWWW closed 1 year ago

RinChanNOWWW commented 1 year ago

It's useful to let user change the selected_rows during iteration.

For example:

// Prefetch one column and apply predicates to it to get a bitmap.
let bitmap = pre_fetch_and_filter(pre);
let bitmap = Arc::new(Mutex::new(bitmap));
// Use this bitmap to iterate the remaining column(s) and select rows;
let pages = PageReader::new_with_page_meta(
        reader,
        reader_meta,
        pages_filter,
        scratch,
        max_header_size,
    )
    .map(move |page| {
        page.map(|page| {
            page.select_rows(use_bitmap(bitmap));
        })
    });

let array_iter = column_iter_to_arrays(pages, ...);
// ...
RinChanNOWWW commented 1 year ago

And I have a question that how can we select rows of a nested type? For example: Struct.

When I add selected_rows to Struct array, I meet such problem:

Decoding Int32 "Plain"-encoded required , index-filtered parquet pages.

I can guarantee all columns in the Struct is the same length in my use case.

cc @jorgecarleitao

This happens because it not allows selected_rows in nested type.

// Nested Decoder
fn build_state(
      &self,
      page: &'a DataPage,
      dict: Option<&'a Self::Dictionary>,
  ) -> Result<Self::State> {
      let is_optional =
          page.descriptor.primitive_type.field_info.repetition == Repetition::Optional;
      let is_filtered = page.selected_rows().is_some();

      match (page.encoding(), dict, is_optional, is_filtered) {
          (Encoding::PlainDictionary | Encoding::RleDictionary, Some(dict), false, false) => {
              ValuesDictionary::try_new(page, dict).map(State::RequiredDictionary)
          }
          (Encoding::PlainDictionary | Encoding::RleDictionary, Some(dict), true, false) => {
              ValuesDictionary::try_new(page, dict).map(State::OptionalDictionary)
          }
          (Encoding::Plain, _, true, false) => Values::try_new::<P>(page).map(State::Optional),
          (Encoding::Plain, _, false, false) => Values::try_new::<P>(page).map(State::Required),
          _ => Err(utils::not_implemented(page)),
      }
  }

How about enabling selected_rows in nested type and assert the length of every columns to be the same in the StructIterator?

jorgecarleitao commented 1 year ago

Ohh, that is correct - yes, we should add support for that in nested types also.

codecov-commenter commented 1 year ago

Codecov Report

Base: 85.12% // Head: 85.09% // Decreases project coverage by -0.03% :warning:

Coverage data is based on head (c35aecd) compared to base (06f0675). Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #205 +/- ## ========================================== - Coverage 85.12% 85.09% -0.04% ========================================== Files 86 86 Lines 8289 8292 +3 ========================================== Hits 7056 7056 - Misses 1233 1236 +3 ``` | [Impacted Files](https://codecov.io/gh/jorgecarleitao/parquet2/pull/205?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) | Coverage Δ | | |---|---|---| | [src/page/mod.rs](https://codecov.io/gh/jorgecarleitao/parquet2/pull/205/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL3BhZ2UvbW9kLnJz) | `74.24% <0.00%> (-0.86%)` | :arrow_down: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.