segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

Truncated column index for binary columns are incorrect #495

Closed ty-sentio-xyz closed 1 year ago

ty-sentio-xyz commented 1 year ago

The current implementation truncates the max value of a binary column index, to a length specified by config (16 bytes by default): https://github.com/segmentio/parquet-go/blob/f785677b9a75f25984869794be0f7ba9dc4a24c0/column_index.go#L509

If we look at how this is implemented in parquet-mr, we will see that there is a discrepancy: https://github.com/apache/parquet-mr/pull/481/files#diff-f9db9810e29543540ccae76593ec75e7c543396ef73f9cddb7cd9e44238317c7R109

When truncating the max value of a binary column index, parquet-mr increments the truncated value, so that when a reader tries to apply their filter, correct results can be obtained.

To give an example, let's say we have a binary column Col1:

When a reader tries to apply filter Col1 == []byte{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17}, the given predicate actually is lexicographically larger (compared the truncated max value in column index), making filter pushdown incorrect.