segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 103 forks source link

integer overflow #366

Closed thorfour closed 1 year ago

thorfour commented 1 year ago

I recently ran into the following panic when attempting to open a large parquet file.

panic: runtime error: index out of range [-425984]

goroutine 1 [running]:
github.com/segmentio/parquet-go.(*fileRowGroup).init(0x1401ebc6000, 0x14000182360, 0x10282d040?, {0x14000147880, 0xd, 0x1027adee0?}, 0x14002ec0000)
        /Users/thor/go/src/github.com/polarsignals/frostdb/vendor/github.com/segmentio/parquet-go/file.go:381 +0x4b0
github.com/segmentio/parquet-go.OpenFile({0x102828d98?, 0x1400012a050}, 0x37926e50, {0x0, 0x0, 0x0})
        /Users/thor/go/src/github.com/polarsignals/frostdb/vendor/github.com/segmentio/parquet-go/file.go:100 +0x760

This is caused due to the fact that the writer will cast a length without checking for integer overflow. https://github.com/segmentio/parquet-go/blob/771b3e358a03c412947573a15fe7af1e1eb3751a/writer.go#L640

Looking through the code a bit more I noticed that this problem exists in a lot of places

ex: https://github.com/segmentio/parquet-go/blob/771b3e358a03c412947573a15fe7af1e1eb3751a/writer.go#L606

https://github.com/segmentio/parquet-go/blob/771b3e358a03c412947573a15fe7af1e1eb3751a/writer.go#L333

While most of these are likely much harder to hit since they are wider integers the possibility still exists to hit this bug in other parts of the code.

I suggest we at least convert the Ordinal value to a uint16 since it being negative has no meaning, as well as add some casting guards in the writer codes to catch overflow at write time. Further looking it may be a good idea to convert all of these ints into their uint counterparts since there doesn't seem to be value to being able to represent negative numbers from what I can tell.

achille-roussel commented 1 year ago

Hello @thorfour, thanks for reporting the issue!

This is caused due to the fact that the writer will cast a length without checking for integer overflow.

The parquet format defines the ordinal index of row groups as a signed 16 bits integer, which effectively enforces a limit of 32K row groups per file. For compatibility with with parquet libraries, I believe we should avoid changing it to an unsigned type.

That being said, the parquet writer should enforce this limit and error if it is asked to produce more than 32K row groups in a single file instead of generating an invalid parquet file and triggering a panic.

As for other places, conversions to larger integer types should be valid (e.g. int32 => int64), but I agree we could add more checks for conversions to smaller types to ensure that we detect overflows as early as possible.

Let me know what you think!

thorfour commented 1 year ago

Thank you! I didn't know where the formal spec definition was located. Seems wild they would choose a signed integer, but I agree that we definitely shouldn't deviate (I had assumed they only defined the bit width).

Yea then I think we should just update the writer to throw an error if an overflow is going to occur.