Fix read_batch method in column reader

sadikovi commented 6 years ago

This PR fixes read_batch method, because it had certain limitations/problems that I found when implementing column vector and reading triplets.

Several changes were made:

We ensure the minimal batch size by checking available space for values, definition and repetition levels. Then we also check how many actual values we can read from a row group.
We exit loop when we have read max of values or levels. Previously it was just values, which IMHO, is wrong, in the case when there are fewer values than definition levels, so it will trigger out of bound error.
We update levels_read based on both definition levels and repetition levels. Previously it would only update it based on repetition levels, which does not work when they are not provided as input.
Now terminating condition indicating that there are no more values in row group is values_read == 0 && levels_read == 0.

Also updated comments for the method.

I could not figure out how to add test for this - tested manually on a set of files I have that includes different combinations of values/def/rep levels.

Here is the output:

Before:

$ cargo run --bin read2
   Compiling parquet v0.1.0 (file:///parquet-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 6.26 secs
     Running `target/debug/read2`
version: 1
num of rows: 7
created by: parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c)
message spark_schema {
  REQUIRED INT32 a;
  OPTIONAL group b {
    OPTIONAL INT32 _1;
    OPTIONAL BOOLEAN _2;
  }
}
Reading batch of records: 3
* Found 3 columns
* max def level: 0, max rep level: 0
- Read batch, values_read: 3, levels_read: 0
Read values: 3, values: [1, 2, 3], def: [0, 0, 0], rep: [0, 0, 0]
* max def level: 2, max rep level: 0
thread 'main' panicked at 'values.len() must be at least 4', src/column/reader.rs:194:7
note: Run with `RUST_BACKTRACE=1` for a backtrace.

If debug further, this is what actually happens (because of loop terminating condition):

* max def level: 2, max rep level: 0
values_read: 0, values_to_read: 2, values.len(): 3
values_read: 2, values_to_read: 2, values.len(): 3
thread 'main' panicked at 'values.len() must be at least 4', src/column/reader.rs:195:7
note: Run with `RUST_BACKTRACE=1` for a backtrace.

After

$ cargo run --bin read2
   Compiling parquet v0.1.0 (file:///parquet-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 4.90 secs
     Running `target/debug/read2`
version: 1
num of rows: 7
created by: parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c)
message spark_schema {
  REQUIRED INT32 a;
  OPTIONAL group b {
    OPTIONAL INT32 _1;
    OPTIONAL BOOLEAN _2;
  }
}
Reading batch of records: 3
* Found 3 columns
* max def level: 0, max rep level: 0
- Read batch, values_read: 3, levels_read: 0
Read values: 3, values: [1, 2, 3], def: [0, 0, 0], rep: [0, 0, 0]
* max def level: 2, max rep level: 0
- Read batch, values_read: 2, levels_read: 3
- Read batch, values_read: 0, levels_read: 0
Read values: 2, values: [1, 2, 0], def: [2, 1, 2], rep: [0, 0, 0]
* max def level: 2, max rep level: 0

With this patch, it should be okay.

sadikovi commented 6 years ago

@sunchao Can you review this PR? Thanks!

Tests failed with some thrift compilation error. All parquet-rs tests pass on my machine. Could you re-trigger the tests on travis? It might be intermittent problem with thrift.

sunchao commented 6 years ago

Thanks @sadikovi for the PR. Seems you are addressing multiple issues in this PR: for my understanding, can you provide some simple examples to demonstrate the issues?

Currently, the read_batch has some constraints on the input parameters. First, the values, dep_levels and rep_levels must have enough space to hold all the decoded values / levels. Secondly, you cannot have dep_levels be None but rep_levels be Some, and vice versa. Seems you are trying to loosen these constraints, is that right?

sadikovi commented 6 years ago

Yes, you are right. I found that there are several related issues, and it is difficult to fix one of them without fixing the others - so I tried to refactor the whole method and make it work.

I found some inconsistencies, the major being current terminating condition that is values_read < batch_size. What can happen is that we would buffer definition levels until we reach values_read == batch_size. But because, we do not check limit of definition levels, we would overflow the vector slice for them.

But overall you are right. I just removed some of the constraints, because I need them to implement iterator to get triplets (value, def level and rep level).

Now, with these changes, you can pass either definition levels or repetition levels, buffers can be of different size (we will handle that correctly) and terminating conditions are updated, so we do not overflow definition levels or repetition levels.

I also updated comments for the method to make it clear what we expect and what is going on.

sunchao commented 6 years ago

Hmm.. OK. I still not sure when the terminating condition will cause error. Is that the issue you showed in the example output - it seems to be caused by that values slice doesn't have enough space.

It would be great if you can add a few tests to cover the cases that failed before (e.g., def_levels and rep_levels have different length), but is succeeding now. You could use make_pages to generate some sample data.

sadikovi commented 6 years ago

Yes, that is the output. As you can see, we incorrectly buffer records before the patch. Vector length is 3 and batch size is 3. We should be reading 3 values or levels and returning, but the code fails. After patch it correctly reads in batches of 3. There were other couple of issues, which I have not recorded output for.

The main reason of updating this method is that it allows me to correctly do spacing when reading values and levels.

Yes, I will add tests in the next couple of days. Thanks.

sunchao commented 6 years ago

I see. Yes I'm fine with changing the method to support more flexible use cases. I'll take a look at the current patch soon, and looking forward to some test cases. :)

sunchao commented 6 years ago

@sadikovi I've merged the Thrift fix. You need to rebase the branch to trigger CI again.

sadikovi commented 6 years ago

@sunchao Thanks for fixing this! I will rebase shortly and start adding tests!

sadikovi commented 6 years ago

~~Found another issue, might be potentially one of the decodings - investigating.~~ It was issue in test, will update asap.

sadikovi commented 6 years ago

@sunchao I added the tests. Can you review? Thanks!

coveralls commented 6 years ago

Coverage increased (+0.1%) to 92.449% when pulling 2711dcf254ae66e654f7faff58b9457e2359766e on sadikovi:fix-read-batch-method into 4c07f35fc913c92cb5a71516f4cdf6f8782df014 on sunchao:master.

sadikovi commented 6 years ago

@sunchao I updated the code. Could you have a look again? Thanks!

sunchao / parquet-rs

Fix read_batch method in column reader #48

Before:

After