ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
345 stars 173 forks source link

Issue with decodeRunRepeated #127

Open aizard-trackinsight opened 3 years ago

aizard-trackinsight commented 3 years ago

Hi,

I think there is an issue here: https://github.com/ironSource/parquetjs/blob/07fb2fd8fc03bf2b57243531eaf91f2d60f5e460/lib/codec/rle.js#L114

I worked on a parquet file where decodeRunRepeated was basically supposed to convert a [18, 1] buffer into 274 as the repeated value but yielded 19 instead.

[18,1] is supposed to be interpreted as 18 2^(8 0) + 1 2^(8 1) = 18 + 256 = 274, which would lead to something like this:

value += (cursor.buffer[cursor.offset] << 8*i)

The current code yields the correct result if there is only one byte needed: [18, 0] yields 18 which is expected.

The issue is only visible if the parquet file has some repeated values above 256, as those repeated values will need more than 1 bytes to be encoded, and the current code would yield incorrect values.

I think value << 8 without affectation has no effect. There might be a similar problem in the encoding function but I haven't used it so far:

https://github.com/ironSource/parquetjs/blob/07fb2fd8fc03bf2b57243531eaf91f2d60f5e460/lib/codec/rle.js#L26