Read is much slower than `pyarrow.parquet.read_table`

asdetrefle commented 6 years ago

I know it’s still a project on progress but when I am testing it with some 40 files, I found it much slower than the normal way we read it from python. Is that normal?

Each of my file has 21 columns all double, containing between 10k-40k rows. In total around 600k rows.

The code I used to test is just:


let t1 = Instant::now();
let d = reader.get_row_iter(None).unwrap()
     .collect::<Vec<_>>();
let dt = Instant::now() - t1;

It takes around 3 seconds for all 40 files. But using python it’s around 0.3 seconds. I use:

t = pq.read_table(file).to_pandas()

Am I missing something?

sadikovi commented 6 years ago

Did you build the library with release profile?

sadikovi commented 6 years ago

Plus, I am not sure if the test is comparable, in Rust you are collecting into vector, but in pandas you just return pandas DataFrame.

asdetrefle commented 6 years ago

Yes with release. I agree it is not directly comparable but I just wanted to have an idea how much time it costs to load everything into memory and store in a collection.

But in theory, for example there is a file with 40k row and it took around 300 milliseconds - Nearly 7 microseconds per row. This is slow in absolute terms and is unlikely to be caused by vector allocation.

sadikovi commented 6 years ago

Unfortunately, I do not have any figures to compare the record assembly time right now. We will look into this performance issue, thanks.

asdetrefle commented 6 years ago

Thanks. I actually update to 1.30 nightly and retested this morning. It only takes 1.7 seconds now. This is quite strange. I think previously I was using 1.28.

sunchao commented 6 years ago

Thanks for reporting this @asdetrefle ! Is there anyway that you can share the files? It will be great if we can reproduce this on own end and start investigating..

asdetrefle commented 6 years ago

I'll try to reproduce it on my personal PC and share the data later.

sadikovi commented 6 years ago

Yes, the record assembly is very slow. I wrote the benchmark on our test 10k file, here are the results:

file: parquet-rs/benches/record.rs cargo bench bench_record_reader

#![feature(test)]
extern crate parquet;
extern crate test;

use parquet::file::reader::{FileReader, SerializedFileReader};
use std::fs::File;
use std::path::Path;
use test::Bencher;

#[bench]
fn bench_record_reader(bench: &mut Bencher) {
  let path = Path::new("/parquet-rs/data/10k-v2.parquet");
  let file = File::open(&path).unwrap();
  let len = file.metadata().unwrap().len();
  let parquet_reader = SerializedFileReader::new(file).unwrap();

  bench.bytes = len;
  bench.iter(|| {
    let iter = parquet_reader.get_row_iter(None).unwrap();
    let _ = iter.collect::<Vec<_>>();
  })
}

test bench_record_reader ... bench:  45,327,329 ns/iter (+/- 7,139,954) = 14 MB/s
test bench_record_reader ... bench:  47,012,510 ns/iter (+/- 1,836,735) = 14 MB/s

It is actually very slow, increasing batch size to 4096 does not improve the time, but I think we should do it anyway. There are some minor improvements with regard to vectors in the record tree, but not much. It looks like half of the time is spent in read_batch method of a column reader.

It is most likely that I introduced a regression when refactoring column reader.

@asdetrefle @sunchao Could you verify numbers by running the benchmark above?

sunchao commented 6 years ago

Thanks @sadikovi for doing this. I did the bench on my machine and it shows similar result:

running 1 test
test bench_record_reader ... bench:  31,935,280 ns/iter (+/- 2,408,977) = 20 MB/s

Can you file a PR to add this benchmark to the code base? we may need to find where the time is spent most - maybe using flamegraph.

asdetrefle commented 6 years ago

Thanks! Mine is similar:

running 1 test
test bench_record_reader ... bench:  19,153,415 ns/iter (+/- 203,102) = 34 MB/s

sunchao commented 6 years ago

I did some profiling on the benchmark and generated this flamegraph:

around 23% is spent on the bench itself, which we should refine the bench to exclude this.

45.65% is spent on RowIter#next() method, which is:

It seems majority of the time is spent in allocation/deallocation.

The flamegraph link is here.

sadikovi commented 6 years ago

Nice! Thanks! Are there any action items that we can extract from the profiling? Addressing creation of structs/values in record readers, maybe improving how we fetch values from encoders?

sadikovi commented 6 years ago

How difficult is it to generate flamegraphs? It would be great to profile decoders with it as well!

sunchao commented 6 years ago

@sadikovi

Nice! Thanks! Are there any action items that we can extract from the profiling? Addressing creation of structs/values in record readers, maybe improving how we fetch values from encoders?

For detailed action items, I think we'll need to improve memory management since most of the costs seem from that. I'll look into details on the code to see whether there's any "low hanging fruit" we can grab.

How difficult is it to generate flamegraphs? It would be great to profile decoders with it as well!

First you'll need a Linux system such as Ubuntu, then you can follow the descriptions I put in #35 to run the commands. Feel free to comment there if you are stuck. I can help to run the profiling on decoder as well.

asdetrefle commented 6 years ago

There is a method Reader::read() in the module record/readers.rs. I noticed that if I change

let mut fields = Vec::new();

to

let mut fields = Vec::with_capacity(readers.len());

It did improve my use case by 25% but it didn’t change a lot for the 10k test file when I run cargo bench. I assume this is because the test file has less columns than my files.

Furthermore, the allocation in the next line String::from(reader.field_name()) did cost at least 30% of time in both case. I just tested by changing them to String::new() where no allocation is needed.

I don’t know if we really need a copy of field names in each Row? This seems to me very inefficient.

After these two changes, cargo bench now shows the read speed is now 50M/s for me. But for my use case it is still 3-4 times slower than python.

sadikovi commented 6 years ago

Yes, I found the same thing, when writing the benchmark, but the performance improvements of Vector::with_capacity were minor when I ran benchmark, so were from adding String or using &str; and were 20-30% combined at most (still good reduction!) - but this could be my dev environment. It looks like there is a significant chunk of time spent on actual value extraction and collection, which is what @sunchao pointed out in his comment.

What I found after stripping all of the stuff, that column reader is quite slow and takes about 40-50% of all time.

But you are absolutely right, we should not be copying field names for each row and have a more compact representation, though one could argue that this is part of struct. At the time I found it was easier to read/parse Rows, so I coded it that way. It would be good to address that part of the code to come up with a different light version of Row.

asdetrefle commented 6 years ago

Should we implement Column and Table as they do in the arrow c++ implementation? Because IMHO a key feature of parquet is the column-oriented storage. I don't know theoretically if we can get good perf by collect on RowIter for this format.

I thought Column and Table should be found in the crate arrow but it's not implemented yet. But I'm happy to help.

sadikovi commented 6 years ago

Yes, this is one of the items that we plan to implement in the future - Arrow support for reads. Since it takes time for the community to build all of the functionality that we need, we chose to go with our custom record assembly and iterators first and then integrate Arrow. Unfortunately, we read row-wise data, so we have to build rows instead of working with columns (for that we have low-level column reader API).

I think it is a good idea to try implementing similar concepts of table and column in this crate directly. If you want you can scope the work, post a high level overview of how it is going to work, and experiment.

asdetrefle commented 6 years ago

Let me just go back to the original issue. Some more testing gives me more ideas.

Actually the time spent on the value extraction is minimal in my test case. (I am now using one of my files which has 40k row and 21 columns because it has more columns than the bench 10k file so make the difference more significant). Some numbers:

The total time for get_row_iter().collect() is ~80ms.
The cumulative time for pushing fields in the file record/readers.rs line 326-328 is ~67ms.
The cumulative time for column.read_next().unwrap() in the file record/readers.rs line 341 is ~27ms.
The cumulative time for low level value extraction read_batch in the filerecord/triplet.rs line 305-309 is ~7ms!

Actually, the method read_batch only got called 260 times while its parent method read_next got called 900k times! (Using 4096 as batch size)

We can also see a significant chunk of time is spent on the pushing to fields vector. This line also got called #rows x # cols times.

So it’s quite clear to me that there is no problem with the low level column reader but we need to optimize the design of Triplet Reader.

sadikovi commented 6 years ago

Thanks for the numbers! I got a different time distribution when testing on my machine, for some reason. Interesting, read_next is only supposed to increment index for most of those calls.

Everything needs performance improvements, I quite like the idea of having Table similar to Arrow.

Would be good to add your benchmarks, so we can run them afterwards and see the difference. We might also want to start scoping the task and try fixing the problem somehow.

Cheers! On Mon, 20 Aug 2018 at 9:25 AM, asdetrefle notifications@github.com wrote:

Let me just go back to the original issue. Some more testing gives me more ideas.

Actually the time spent on the value extraction is minimal in my test case. (I am now using one of my files which has 40k row and 21 columns because it has more columns than the bench 10k file so make the difference more significant). Some numbers:

.. The total time for get_row_iter().collect() is ~80ms. .. The cumulative time for pushing fields in the file record/readers.rs line 326-328 is ~67ms. .. The cumulative time for column.read_next().unwrap() in the file record/readers.rs line 341 is ~27ms. .. The cumulative time for low level value extraction read_batch in the filerecord/triplet.rs line 305-309 is ~7ms!

Actually, the method method read_batch only got called 260 times while its parent method read_next got called 900k times!

We can also see a significant chunck of time is spent on the pushing to fields vector. This line also got called #rows x # cols times.

So it’s quite clear to me that there is no problem with the low level column reader but we need to optimize the design of Triplet Reader.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/140#issuecomment-414221895, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3nFBlxvoXmm7z-XX4eHf-WXAhPlDks5uSmR7gaJpZM4Vtzbr .

asdetrefle commented 6 years ago

I was able to reproduce the file using simulation. See pull request.

asdetrefle commented 6 years ago

Got some strange benchmark for the new bench data.

test record_reader_10k_collect   ... bench:  17,170,066 ns/iter (+/- 99,153) = 38 MB/s
test record_reader_stock_collect ... bench: 118,736,233 ns/iter (+/- 1,352,017) = 10 MB/s

Will try adding some low level benchmarks before doing a PR.

sunchao / parquet-rs

Read is much slower than `pyarrow.parquet.read_table` #140