sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Make FileReader generic to Page and Row reader #136

Closed gnieto closed 6 years ago

gnieto commented 6 years ago

I've also parametrized FileReader with PagerReader and RowGroupReader, which avoids boxing and some dynamic dispatching.

This is an intermediate step to make FileReader generic on any Read. More context on: https://github.com/sunchao/parquet-rs/pull/135

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 568


Totals Coverage Status
Change from base Build 564: 0.0%
Covered Lines: 11349
Relevant Lines: 11890

šŸ’› - Coveralls
sunchao commented 6 years ago

Thanks @gnieto . My concern is that this makes the API more verbose and more confusing. For instance, user of RowIter now needs to know what is the type for PageReader and RowGroupReader, which they shouldn't (not sure if associated types help here). Making RowGroupReader sized also bring some restrictions.

The concern about dynamic dispatching is valid though - is it possible that we can have some micro-benchmark to show the actual benefit before making the decision?

gnieto commented 6 years ago

Hi, sorry for the late reply.

I've written a benchmark (which, to be honest, I do not know if it's representative) to check how the dynamic dispatch affects.

The benchmark is the following one:

#[bench]
fn bench_get_column_reader(bench: &mut Bencher) {
  use ::parquet::file::reader::FileReader;

  let file = ::std::fs::File::open("data/10k-v2.parquet").unwrap();
  let fr = ::parquet::file::reader::SerializedFileReader::new(file).unwrap();

  let metadata = fr.metadata();
  let group_amount = metadata.num_row_groups();

  bench.iter(|| {
    fr.get_row_group(0).unwrap();
  })
}

#[bench]
fn bench_get_column_reader_rand(bench: &mut Bencher) {
  use ::parquet::file::reader::FileReader;

  let file = ::std::fs::File::open("data/10k-v2.parquet").unwrap();
  let fr = ::parquet::file::reader::SerializedFileReader::new(file).unwrap();

  let metadata = fr.metadata();
  let group_amount = metadata.num_row_groups();

  bench.iter(|| {
    let rg = ::rand::random::<usize>() % group_amount;
    fr.get_row_group(rg).unwrap();
  })
}

I've ran the benchmarks 3 times.

On master, the results are the following:

test bench_get_column_reader      ... bench:         267 ns/iter (+/- 1)
test bench_get_column_reader      ... bench:         274 ns/iter (+/- 8)
test bench_get_column_reader      ... bench:         267 ns/iter (+/- 2)

test bench_get_column_reader_rand ... bench:         293 ns/iter (+/- 5)
test bench_get_column_reader_rand ... bench:         297 ns/iter (+/- 7)
test bench_get_column_reader_rand ... bench:         292 ns/iter (+/- 3)

On this branch:

test bench_get_column_reader      ... bench:         251 ns/iter (+/- 3)
test bench_get_column_reader      ... bench:         252 ns/iter (+/- 4)
test bench_get_column_reader      ... bench:         253 ns/iter (+/- 1)

test bench_get_column_reader_rand ... bench:         274 ns/iter (+/- 3)
test bench_get_column_reader_rand ... bench:         275 ns/iter (+/- 19)
test bench_get_column_reader_rand ... bench:         274 ns/iter (+/- 13)

I got the feeling that this is bound by IO, but it seems a little bit better w/o dynamic dispatch.


Even that, I fully agree that the API gets worse and I doubt that the change is worth. But, on top of https://github.com/sunchao/parquet-rs/pull/135/files, maybe PageReader and/or RowGroupReader can be removed and make SerializedPageReader and SerializedRowGroupReader generic to Read.

But I'm happy to close this PRs if you think that the API gets too much complicated.

sunchao commented 6 years ago

@gnieto thanks for doing this! yeah I'm not fully convinced that the perf gain is worthing changing the API... I do think though that we should replace hardcoded File with something more generic such as ParquetReader so that in future we can support more types of data sources such as HDFS. Is it possible to do that without making the API change?

sadikovi commented 6 years ago

I was thinking if it is worth making this PR a part of overall performance improvement work that we may need to do and discussing directions.

sunchao commented 6 years ago

Yes we definitely need to address the perf issue. From the benchmark Iā€™m not sure if this is the bottleneck though. Maybe we can keep this in mind and come back later if necessary, after we have more findings?

On Wed, Aug 8, 2018 at 1:09 AM Ivan notifications@github.com wrote:

I was thinking if it is worth making this PR a part of overall performance improvement work that we may need to do and discussing directions.

ā€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/pull/136#issuecomment-411323848, or mute the thread https://github.com/notifications/unsubscribe-auth/AAe7N2obVSJRYCCXkm29va6SHqVo_F5Bks5uOpyigaJpZM4VlCu9 .

gnieto commented 6 years ago

@sunchao Ok, let me close this and rework the previous PR. I will try to minimize the changes, but I think that I won't be able to avoid adding some generic parameters over Read.