jorgecarleitao / parquet2

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow
Other
355 stars 59 forks source link

Deserialisation Error for Nested Types #134

Open sundeepks opened 2 years ago

sundeepks commented 2 years ago

Hi while deserialising the parquet with nested types facing error, do we have the implementation for the following code snippet (got from the examples section)

Below code executes when page.descriptor.max_rep_level > 0, do we have the primitive_nested implementation for byte array ?


_ => match page.dictionary_page() {
            None => match physical_type {
                PhysicalType::Int64 => Ok(primitive_nested::page_to_array::<i64>(page)?),
                _ => {
                   todo!()
                }
            },
            Some(_) => match physical_type {
                PhysicalType::Int64 => Ok(primitive_nested::page_dict_to_array::<i64>(page)?),
                _ => {
                   todo!()
                }
            },
        },
jorgecarleitao commented 2 years ago

Hey!

I know of 2: one in arrow2 and one under tests/.

The general idea is:

  1. split the page buffer in rep,def,values

  2. attach 3 decoders, one for rep, one for def, one for values - the rep and def should be HybridRleDecoder; the values should be whatever encoding is being used for that (the nested logic is independent of the primitive type). Something like:

    let (rep_levels, def_levels, _) = split_buffer(page);
    
    let max_rep_level = page.descriptor.max_rep_level;
    let max_def_level = page.descriptor.max_def_level;
    
    let reps =
        HybridRleDecoder::new(rep_levels, get_bit_width(max_rep_level), page.num_values());
    let defs =
        HybridRleDecoder::new(def_levels, get_bit_width(max_def_level), page.num_values());
    
    let iter = reps.zip(defs);

    (see https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/nested_utils.rs#L271)

  3. advance the iterators and reconstruct the nested type according to the dremel logic. This depends on how the specific format stores nested types (e.g. Vec<Vec<i32>> vs Vec<i32> + offsets). See e.g. https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/nested_utils.rs#L391 for how arrow2 does it.

One important thing to remember is that the length of the rep and def iterators (page.num_values) is not the number of values in the values iterator. For example:

# [[0, None], [], [10]]
reps, defs = list(
    zip(
        *[
            (0, 2),  # 0
            (1, 1),  # 1
            (0, 0),  #
            (0, 2),  # 10
        ]
    )
)

the values in this case contain 2 entries (0 and 10), the rep and levels contain 4 each.

sundeepks commented 2 years ago

Hey, Thanks for the response, I was referring to the one in the tests https://github.com/jorgecarleitao/parquet2/blob/fa6fa3ca3848c29d8efa80fbf42ee6a5a58cb077/tests/it/read/mod.rs.. Is it possible to complete the todo placeholder what you have in tests or any reference code so I can complete the todo part ?