Closed nicwaller closed 4 years ago
This is similar to the issue I described here: https://github.com/ironSource/parquetjs/issues/60 (check the last comment). I also included a local fix that I tested that may work. Waiting for feedback.
Thanks for reaching out about this and providing a repro case! I think this problem is caused by concurrent calls to appendRow, which are currently not supported. The best workaround for now is to ensure appendRow is not called concurrently, i.e. ensure that the previous call to appendRow has returned (using await) before issuing a new one.
Please also see the comments in https://github.com/ironSource/parquetjs/issues/60#issuecomment-641975079 and https://github.com/ironSource/parquetjs/pull/105#issuecomment-641991636
Thanks @asmuth, your evaluation was correct. Even though I was using await, my use of event emitters still meant that it was possible for multiple invocations to occur simultaneously. Although appendRow is not reentrant, I was able to work around it using the for await
construct instead:
for await (const line of lines) {
await writer.appendRow({key: line});
}
I encountered a very strange bug using this library to generate parquet files that resulted in the output files containing duplicate rows and file sizes being massively inflated. Sometimes my output file in parquet format was 50x larger than my text input file!
I can reproduce this bug with a simple test case: read lines of text from a text file and add them to a parquet file. But I can only reproduce this when two conditions are met simultaneously:
Here's the program output:
Some additional info from parquet-tools.
And here's the test case.