Closed bmerrihort-sennen closed 3 months ago
This is interesting, it looks like this only happens on the last day of the year. Would you be able to upload another parquet file with a row number column?
For some reason, regenerating the parquet file seems to fix the issue: timeline-cleaned.sp.v2.parquet.zip
I'm looking into this. Thanks for the clear re-pro!
I believe it is related to readBitPacked failing when bitwidth is 17. Will keep you posted.
@bmerrihort-sennen I just published v0.9.10 which includes a fix to readBitPack
when the bitwidth is 17 or greater. The issue was a signed >>
vs unsigned >>>
shift operator.
I tested with your example file, but please confirm if that fixes it for you. Thanks again for the report!
Hi, I've uploaded a parquet file (link) that contains some timestamped data, where
trip_time
andfall_time
are the timestamp fields. The values in each column should be unique, which I've confirmed by looking at the data in a VS Code parquet viewer extension, and also parsing in Python using pyarrow. However when I parse the file using hyparquet there's a big block of rows that havetrip_time: 2022-12-31T05:33:14.127Z
andfall_time: 2022-12-31T05:33:19.167Z
, even though these values only occur once in the data.These times do occur in the data, but only on one record:
I used this code to test:
Excerpt of the output:
This is using hyparquet 0.9.9.