When truncating the max value of a binary column index, parquet-mr increments the truncated value, so that when a reader tries to apply their filter, correct results can be obtained.
To give an example, let's say we have a binary column Col1:
Truncated max value in column index, as in the current implementation, : []byte{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
Truncated max value in column index, as in the parquet-mr implementation, : []byte{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16}
When a reader tries to apply filter Col1 == []byte{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17}, the given predicate actually is lexicographically larger (compared the truncated max value in column index), making filter pushdown incorrect.
The current implementation truncates the max value of a binary column index, to a length specified by config (16 bytes by default): https://github.com/segmentio/parquet-go/blob/f785677b9a75f25984869794be0f7ba9dc4a24c0/column_index.go#L509
If we look at how this is implemented in parquet-mr, we will see that there is a discrepancy: https://github.com/apache/parquet-mr/pull/481/files#diff-f9db9810e29543540ccae76593ec75e7c543396ef73f9cddb7cd9e44238317c7R109
When truncating the max value of a binary column index, parquet-mr increments the truncated value, so that when a reader tries to apply their filter, correct results can be obtained.
To give an example, let's say we have a binary column Col1:
[]byte{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17}
[]byte{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
[]byte{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16}
When a reader tries to apply filter
Col1 == []byte{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17}
, the given predicate actually is lexicographically larger (compared the truncated max value in column index), making filter pushdown incorrect.