Closed gnieto closed 6 years ago
Thanks @gnieto . My concern is that this makes the API more verbose and more confusing. For instance, user of RowIter
now needs to know what is the type for PageReader
and RowGroupReader
, which they shouldn't (not sure if associated types help here). Making RowGroupReader
sized also bring some restrictions.
The concern about dynamic dispatching is valid though - is it possible that we can have some micro-benchmark to show the actual benefit before making the decision?
Hi, sorry for the late reply.
I've written a benchmark (which, to be honest, I do not know if it's representative) to check how the dynamic dispatch affects.
The benchmark is the following one:
#[bench]
fn bench_get_column_reader(bench: &mut Bencher) {
use ::parquet::file::reader::FileReader;
let file = ::std::fs::File::open("data/10k-v2.parquet").unwrap();
let fr = ::parquet::file::reader::SerializedFileReader::new(file).unwrap();
let metadata = fr.metadata();
let group_amount = metadata.num_row_groups();
bench.iter(|| {
fr.get_row_group(0).unwrap();
})
}
#[bench]
fn bench_get_column_reader_rand(bench: &mut Bencher) {
use ::parquet::file::reader::FileReader;
let file = ::std::fs::File::open("data/10k-v2.parquet").unwrap();
let fr = ::parquet::file::reader::SerializedFileReader::new(file).unwrap();
let metadata = fr.metadata();
let group_amount = metadata.num_row_groups();
bench.iter(|| {
let rg = ::rand::random::<usize>() % group_amount;
fr.get_row_group(rg).unwrap();
})
}
I've ran the benchmarks 3 times.
On master, the results are the following:
test bench_get_column_reader ... bench: 267 ns/iter (+/- 1)
test bench_get_column_reader ... bench: 274 ns/iter (+/- 8)
test bench_get_column_reader ... bench: 267 ns/iter (+/- 2)
test bench_get_column_reader_rand ... bench: 293 ns/iter (+/- 5)
test bench_get_column_reader_rand ... bench: 297 ns/iter (+/- 7)
test bench_get_column_reader_rand ... bench: 292 ns/iter (+/- 3)
On this branch:
test bench_get_column_reader ... bench: 251 ns/iter (+/- 3)
test bench_get_column_reader ... bench: 252 ns/iter (+/- 4)
test bench_get_column_reader ... bench: 253 ns/iter (+/- 1)
test bench_get_column_reader_rand ... bench: 274 ns/iter (+/- 3)
test bench_get_column_reader_rand ... bench: 275 ns/iter (+/- 19)
test bench_get_column_reader_rand ... bench: 274 ns/iter (+/- 13)
I got the feeling that this is bound by IO, but it seems a little bit better w/o dynamic dispatch.
Even that, I fully agree that the API gets worse and I doubt that the change is worth. But, on top of https://github.com/sunchao/parquet-rs/pull/135/files, maybe PageReader
and/or RowGroupReader
can be removed and make SerializedPageReader
and SerializedRowGroupReader
generic to Read
.
But I'm happy to close this PRs if you think that the API gets too much complicated.
@gnieto thanks for doing this! yeah I'm not fully convinced that the perf gain is worthing changing the API... I do think though that we should replace hardcoded File
with something more generic such as ParquetReader
so that in future we can support more types of data sources such as HDFS. Is it possible to do that without making the API change?
I was thinking if it is worth making this PR a part of overall performance improvement work that we may need to do and discussing directions.
Yes we definitely need to address the perf issue. From the benchmark Iām not sure if this is the bottleneck though. Maybe we can keep this in mind and come back later if necessary, after we have more findings?
On Wed, Aug 8, 2018 at 1:09 AM Ivan notifications@github.com wrote:
I was thinking if it is worth making this PR a part of overall performance improvement work that we may need to do and discussing directions.
ā You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/136#issuecomment-411323848, or mute the thread https://github.com/notifications/unsubscribe-auth/AAe7N2obVSJRYCCXkm29va6SHqVo_F5Bks5uOpyigaJpZM4VlCu9 .
@sunchao Ok, let me close this and rework the previous PR. I will try to minimize the changes, but I think that I won't be able to avoid adding some generic parameters over Read
.
I've also parametrized FileReader with PagerReader and RowGroupReader, which avoids boxing and some dynamic dispatching.
This is an intermediate step to make FileReader generic on any
Read
. More context on: https://github.com/sunchao/parquet-rs/pull/135