sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Fix reading PLAIN_DICTIONARY field #13

Closed sadikovi closed 7 years ago

sadikovi commented 7 years ago

This PR fixes reading columns with PLAIN_DICTIONARY encoding.

Before the patch, reading of such columns would fail with unsupported encoding and/or would return empty buffer. Example of column that returns empty buffer:

column type: INT32
column path: "id"
encodings: RLE PLAIN_DICTIONARY PLAIN
file path: N/A
file offset: 55
num of values: 2
total compressed size (in bytes): 51
total uncompressed size (in bytes): 47
data page offset: 27
index page offset: N/A
dictionary page offset: 4

Patch removes seen_num_values increment for dictionary pages, because they are usually followed by data page that contains "values"/indices. Also fixed encoding reassignment.

Updated unittest to reflect these changes + tested manually.

Fixes #10.

sadikovi commented 7 years ago

@sunchao please review. Thanks!

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.3%) to 84.123% when pulling 6c7a316c9fb5041ad149bef0085e8bcd7a38afc1 on sadikovi:plain-dictionary-fix into ab570a6285d67d178c7d2bee03b512d00e887cbf on sunchao:master.

sadikovi commented 7 years ago

@sunchao updated code, so now I just reassign encoding like you suggested, though I have to use mut encoding in match statement for page. Is this okay?

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.2%) to 84.093% when pulling b076be172eb006c2cff63494895b5dae9f402e6e on sadikovi:plain-dictionary-fix into ab570a6285d67d178c7d2bee03b512d00e887cbf on sunchao:master.

sunchao commented 7 years ago

@sadikovi Yes I think it should be OK: the page is consumed upon usage so this modification won't be an issue.