Closed sadikovi closed 6 years ago
@sunchao Can you review this PR? Thanks!
Tests failed with some thrift compilation error. All parquet-rs tests pass on my machine. Could you re-trigger the tests on travis? It might be intermittent problem with thrift.
Thanks @sadikovi for the PR. Seems you are addressing multiple issues in this PR: for my understanding, can you provide some simple examples to demonstrate the issues?
Currently, the read_batch
has some constraints on the input parameters. First, the values
, dep_levels
and rep_levels
must have enough space to hold all the decoded values / levels. Secondly, you cannot have dep_levels
be None
but rep_levels
be Some
, and vice versa.
Seems you are trying to loosen these constraints, is that right?
Yes, you are right. I found that there are several related issues, and it is difficult to fix one of them without fixing the others - so I tried to refactor the whole method and make it work.
I found some inconsistencies, the major being current terminating condition that is values_read < batch_size
. What can happen is that we would buffer definition levels until we reach values_read == batch_size
. But because, we do not check limit of definition levels, we would overflow the vector slice for them.
But overall you are right. I just removed some of the constraints, because I need them to implement iterator to get triplets (value, def level and rep level).
Now, with these changes, you can pass either definition levels or repetition levels, buffers can be of different size (we will handle that correctly) and terminating conditions are updated, so we do not overflow definition levels or repetition levels.
I also updated comments for the method to make it clear what we expect and what is going on.
Hmm.. OK. I still not sure when the terminating condition will cause error. Is that the issue you showed in the example output - it seems to be caused by that values
slice doesn't have enough space.
It would be great if you can add a few tests to cover the cases that failed before (e.g., def_levels
and rep_levels
have different length), but is succeeding now. You could use make_pages
to generate some sample data.
Yes, that is the output. As you can see, we incorrectly buffer records before the patch. Vector length is 3 and batch size is 3. We should be reading 3 values or levels and returning, but the code fails. After patch it correctly reads in batches of 3. There were other couple of issues, which I have not recorded output for.
The main reason of updating this method is that it allows me to correctly do spacing when reading values and levels.
Yes, I will add tests in the next couple of days. Thanks.
I see. Yes I'm fine with changing the method to support more flexible use cases. I'll take a look at the current patch soon, and looking forward to some test cases. :)
@sadikovi I've merged the Thrift fix. You need to rebase the branch to trigger CI again.
@sunchao Thanks for fixing this! I will rebase shortly and start adding tests!
Found another issue, might be potentially one of the decodings - investigating. It was issue in test, will update asap.
@sunchao I added the tests. Can you review? Thanks!
@sunchao I updated the code. Could you have a look again? Thanks!
This PR fixes
read_batch
method, because it had certain limitations/problems that I found when implementing column vector and reading triplets.Several changes were made:
values
orlevels
. Previously it was justvalues
, which IMHO, is wrong, in the case when there are fewer values than definition levels, so it will trigger out of bound error.values_read == 0 && levels_read == 0
.Also updated comments for the method.
I could not figure out how to add test for this - tested manually on a set of files I have that includes different combinations of values/def/rep levels.
Here is the output:
Before:
If debug further, this is what actually happens (because of loop terminating condition):
After
With this patch, it should be okay.