Open sundeepks opened 2 years ago
Hey!
I know of 2: one in arrow2 and one under tests/.
The general idea is:
split the page buffer in rep,def,values
attach 3 decoders, one for rep
, one for def
, one for values
- the rep
and def
should be HybridRleDecoder
; the values
should be whatever encoding is being used for that (the nested logic is independent of the primitive type). Something like:
let (rep_levels, def_levels, _) = split_buffer(page);
let max_rep_level = page.descriptor.max_rep_level;
let max_def_level = page.descriptor.max_def_level;
let reps =
HybridRleDecoder::new(rep_levels, get_bit_width(max_rep_level), page.num_values());
let defs =
HybridRleDecoder::new(def_levels, get_bit_width(max_def_level), page.num_values());
let iter = reps.zip(defs);
advance the iterators and reconstruct the nested type according to the dremel logic. This depends on how the specific format stores nested types (e.g. Vec<Vec<i32>>
vs Vec<i32> + offsets
). See e.g. https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/nested_utils.rs#L391 for how arrow2 does it.
One important thing to remember is that the length of the rep
and def
iterators (page.num_values
) is not the number of values in the values
iterator. For example:
# [[0, None], [], [10]]
reps, defs = list(
zip(
*[
(0, 2), # 0
(1, 1), # 1
(0, 0), #
(0, 2), # 10
]
)
)
the values in this case contain 2 entries (0 and 10), the rep and levels contain 4 each.
Hey, Thanks for the response, I was referring to the one in the tests https://github.com/jorgecarleitao/parquet2/blob/fa6fa3ca3848c29d8efa80fbf42ee6a5a58cb077/tests/it/read/mod.rs.. Is it possible to complete the todo placeholder what you have in tests or any reference code so I can complete the todo part ?
Hi while deserialising the parquet with nested types facing error, do we have the implementation for the following code snippet (got from the examples section)
Below code executes when page.descriptor.max_rep_level > 0, do we have the primitive_nested implementation for byte array ?