hyparam / hyparquet

parquet file parser for javascript
MIT License
191 stars 5 forks source link

Timestamps being incorrectly parsed sometimes #13

Closed bmerrihort-sennen closed 3 months ago

bmerrihort-sennen commented 3 months ago

Hi, I've uploaded a parquet file (link) that contains some timestamped data, where trip_time and fall_time are the timestamp fields. The values in each column should be unique, which I've confirmed by looking at the data in a VS Code parquet viewer extension, and also parsing in Python using pyarrow. However when I parse the file using hyparquet there's a big block of rows that have trip_time: 2022-12-31T05:33:14.127Z and fall_time: 2022-12-31T05:33:19.167Z, even though these values only occur once in the data.

These times do occur in the data, but only on one record: image

I used this code to test:

import * as fs from "fs"
import { parquetRead } from "hyparquet"

async function main() {
    const buffer = fs.readFileSync("timeline-cleaned.sp.parquet")
    const arrayBuffer = new Uint8Array(buffer).buffer
    const columnsToRead = ["trip_time", "fall_time", "iec_category_id"]
    await parquetRead({
        file: arrayBuffer,
        columns: columnsToRead,
        onComplete: data => {
            for (const record of data) {
                console.log(`${record[0].toISOString()} ${record[1].toISOString()} ${record[2]}`)
            }
            console.log(`read ${data.length} rows`)
        }
    })
}

main()

Excerpt of the output:

2022-04-11T20:31:57.423Z 2022-04-11T20:58:44.547Z 102020101
2022-12-31T03:53:03.250Z 2022-12-31T03:53:08.277Z 1010101
2022-12-31T05:33:14.127Z 2022-12-31T05:33:19.167Z 102020101
2022-12-31T05:33:14.127Z 2022-12-31T05:33:19.167Z 30201
2022-12-31T05:33:14.127Z 2022-12-31T05:33:19.167Z 102020101
2022-12-31T05:33:14.127Z 2022-12-31T05:33:19.167Z 1010101
2022-12-31T05:33:14.127Z 2022-12-31T05:33:19.167Z 102020101
2022-12-31T05:33:14.127Z 2022-12-31T05:33:19.167Z 1010101
2022-12-31T05:33:14.127Z 2022-12-31T05:33:19.167Z 102020101
2022-12-31T05:33:14.127Z 2022-12-31T05:33:19.167Z 30201
[continues]

This is using hyparquet 0.9.9.

park-brian commented 3 months ago

This is interesting, it looks like this only happens on the last day of the year. Would you be able to upload another parquet file with a row number column?

park-brian commented 3 months ago

For some reason, regenerating the parquet file seems to fix the issue: timeline-cleaned.sp.v2.parquet.zip

platypii commented 3 months ago

I'm looking into this. Thanks for the clear re-pro!

I believe it is related to readBitPacked failing when bitwidth is 17. Will keep you posted.

platypii commented 3 months ago

@bmerrihort-sennen I just published v0.9.10 which includes a fix to readBitPack when the bitwidth is 17 or greater. The issue was a signed >> vs unsigned >>> shift operator.

I tested with your example file, but please confirm if that fixes it for you. Thanks again for the report!