sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Benchmark record_reader_10k_collect fails with Int96 error display conversion #201

Closed sadikovi closed 5 years ago

sadikovi commented 5 years ago

When running benchmark record_reader_10k_collect, it fails with the following error:

running 1 test
test record_reader_10k_collect             ... FAILED

failures:

---- record_reader_10k_collect stdout ----
thread 'main' panicked at 'Expected non-negative milliseconds when converting Int96, found -210866803200000', src/record/api.rs:504:7
note: Run with `RUST_BACKTRACE=1` for a backtrace.

failures:
    record_reader_10k_collect

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 2 filtered out

It relates to the work we did for Int96-timestamp conversion. I think we should solve it somehow or prioritise work on Int96 conversion.

sadikovi commented 5 years ago

@sunchao do you think we should update the benchmark or finally fix this Int96 conversion? Thanks!

sadikovi commented 5 years ago

I was also thinking about updating the internal Group field to use simple list without strings. Here are the benchmark results:

Current master branch:

running 1 test
test record_reader_stock_simulated_collect ... bench: 356,025,412 ns/iter (+/- 42,326,086) = 3 MB/s

api::Row::Group field without String components (only fields):

running 1 test
test record_reader_stock_simulated_collect ... bench: 132,889,130 ns/iter (+/- 17,200,142) = 9 MB/s

Shall we make the change? It does change public API a little bit, but speeds up reading and we can print field names by passing schema pointer.

Let me know what you think. Thanks!

sunchao commented 5 years ago

@sunchao do you think we should update the benchmark or finally fix this Int96 conversion? Thanks!

I'm inclined to update the benchmark for now, and address the INT96 conversion later. It seems non-trivial to fix the issue.

Shall we make the change? It does change public API a little bit, but speeds up reading and we can print field names by passing schema pointer.

Yes I think it is a good idea. I also tried to do this before. It doesn't make much sense for each Row to carry the field names. Like you suggested, perhaps we can supply the field names via other ways so API won't be changed much.

sadikovi commented 5 years ago

I am curious how the benchmark was working previously. Sounds like a regression with this Int96 conversion. Anyway, I am going to simply discard the column for now.

sunchao commented 5 years ago

Interesting. Looks like this benchmark started to fail since this commit, so previously they did not fail but just return garbage result?

sadikovi commented 5 years ago

I reverted this commit and tried reading the file, and it panicked (I checked that the code is the first version we wrote).

$ git stash
Saved working directory and index state WIP on patch-reader-benchmark: 91704de patch benchmark
$ git revert 02402983b9f6d15a22e7cab5d647c683f844dff3
[patch-reader-benchmark 837ba27] Revert "update int96 code (#184)"
 1 file changed, 9 insertions(+), 29 deletions(-)
$ cargo run --bin parquet-read ./data/10k-v2.parquet 10
   Compiling parquet v0.4.2 (/parquet-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 14.60s
     Running `target/debug/parquet-read ./data/10k-v2.parquet 10`
thread 'main' panicked at 'attempt to subtract with overflow', src/record/api.rs:497:28
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Actually it fails to read the file, but benchmark runs! Ha, interesting.

sadikovi commented 5 years ago

Anyway, it does not matter that much.