Closed sadikovi closed 5 years ago
@sunchao do you think we should update the benchmark or finally fix this Int96 conversion? Thanks!
I was also thinking about updating the internal Group
field to use simple list without strings. Here are the benchmark results:
Current master branch:
running 1 test
test record_reader_stock_simulated_collect ... bench: 356,025,412 ns/iter (+/- 42,326,086) = 3 MB/s
api::Row::Group field without String components (only fields):
running 1 test
test record_reader_stock_simulated_collect ... bench: 132,889,130 ns/iter (+/- 17,200,142) = 9 MB/s
Shall we make the change? It does change public API a little bit, but speeds up reading and we can print field names by passing schema pointer.
Let me know what you think. Thanks!
@sunchao do you think we should update the benchmark or finally fix this Int96 conversion? Thanks!
I'm inclined to update the benchmark for now, and address the INT96 conversion later. It seems non-trivial to fix the issue.
Shall we make the change? It does change public API a little bit, but speeds up reading and we can print field names by passing schema pointer.
Yes I think it is a good idea. I also tried to do this before. It doesn't make much sense for each Row
to carry the field names. Like you suggested, perhaps we can supply the field names via other ways so API won't be changed much.
I am curious how the benchmark was working previously. Sounds like a regression with this Int96 conversion. Anyway, I am going to simply discard the column for now.
Interesting. Looks like this benchmark started to fail since this commit, so previously they did not fail but just return garbage result?
I reverted this commit and tried reading the file, and it panicked (I checked that the code is the first version we wrote).
$ git stash
Saved working directory and index state WIP on patch-reader-benchmark: 91704de patch benchmark
$ git revert 02402983b9f6d15a22e7cab5d647c683f844dff3
[patch-reader-benchmark 837ba27] Revert "update int96 code (#184)"
1 file changed, 9 insertions(+), 29 deletions(-)
$ cargo run --bin parquet-read ./data/10k-v2.parquet 10
Compiling parquet v0.4.2 (/parquet-rs)
Finished dev [unoptimized + debuginfo] target(s) in 14.60s
Running `target/debug/parquet-read ./data/10k-v2.parquet 10`
thread 'main' panicked at 'attempt to subtract with overflow', src/record/api.rs:497:28
note: Run with `RUST_BACKTRACE=1` for a backtrace.
Actually it fails to read the file, but benchmark runs! Ha, interesting.
Anyway, it does not matter that much.
When running benchmark
record_reader_10k_collect
, it fails with the following error:It relates to the work we did for Int96-timestamp conversion. I think we should solve it somehow or prioritise work on Int96 conversion.