Closed achille-roussel closed 2 years ago
name old time/op new time/op delta
Encode/PLAIN/byte_array 26.9µs ± 0% 65.9µs ± 1% +144.98% (p=0.000 n=10+10)
Encode/DELTA_LENGTH_BYTE_ARRAY/byte_array 113µs ± 1% 14µs ± 0% -87.77% (p=0.000 n=9+9)
Encode/DELTA_BYTE_ARRAY/byte_array 153µs ± 1% 145µs ± 1% -5.23% (p=0.000 n=10+10)
Decode/PLAIN/byte_array 26.9µs ± 0% 80.5µs ± 1% +199.59% (p=0.000 n=9+10)
Decode/DELTA_LENGTH_BYTE_ARRAY/byte_array 21.0µs ± 0% 14.5µs ± 0% -30.94% (p=0.000 n=10+9)
Decode/DELTA_BYTE_ARRAY/byte_array 68.5µs ± 0% 37.4µs ± 0% -45.41% (p=0.000 n=10+10)
name old speed new speed delta
Encode/PLAIN/byte_array 5.53GB/s ± 0% 2.26GB/s ± 1% -59.18% (p=0.000 n=10+10)
Encode/DELTA_LENGTH_BYTE_ARRAY/byte_array 1.32GB/s ± 1% 10.77GB/s ± 0% +717.76% (p=0.000 n=9+9)
Encode/DELTA_BYTE_ARRAY/byte_array 971MB/s ± 1% 1025MB/s ± 1% +5.52% (p=0.000 n=10+10)
Decode/PLAIN/byte_array 5.54GB/s ± 0% 1.85GB/s ± 1% -66.62% (p=0.000 n=9+10)
Decode/DELTA_LENGTH_BYTE_ARRAY/byte_array 7.08GB/s ± 0% 10.25GB/s ± 0% +44.82% (p=0.000 n=10+9)
Decode/DELTA_BYTE_ARRAY/byte_array 2.17GB/s ± 0% 3.98GB/s ± 0% +83.18% (p=0.000 n=10+10)
name old value/s new value/s delta
Encode/PLAIN/byte_array 372M ± 0% 152M ± 1% -59.18% (p=0.000 n=10+10)
Encode/DELTA_LENGTH_BYTE_ARRAY/byte_array 88.5M ± 1% 723.8M ± 0% +717.73% (p=0.000 n=9+9)
Encode/DELTA_BYTE_ARRAY/byte_array 65.3M ± 1% 68.9M ± 1% +5.52% (p=0.000 n=10+10)
Decode/PLAIN/byte_array 372M ± 0% 124M ± 1% -66.62% (p=0.000 n=9+10)
Decode/DELTA_LENGTH_BYTE_ARRAY/byte_array 476M ± 0% 689M ± 0% +44.80% (p=0.000 n=10+9)
Decode/DELTA_BYTE_ARRAY/byte_array 146M ± 0% 268M ± 0% +83.17% (p=0.000 n=10+10)
Thanks for the reviews!
This change contributes to #226 by changing the in-memory representation of byte array pages.
Prior to this change, values in byte array pages were stored using the PLAIN encoding: each value had a 4 bytes length prefix followed by its content. This memory layout meant that the only possible access was a sequential (and unpredictable) scan, increasing CPU cache misses and limiting the throughput at which byte array values could be encoded or decoded. Even in the case where the underlying file used the PLAIN encoding, validating the inputs resulted in a poor performance.
Two other observations motivated this change:
To bridge with Arrow, the PLAIN encoding layout would have required re-encoding the byte array values to match the variable size binary layout. Using a memory layout mapping more directly to Arrow in parquet-go would greatly simplify the integration and improve efficiency when translating between the Arrow and Parquet layers.
The Parquet format states that the DELTA_LENGTH_BYTE_ARRAY encoding should be used as the default for byte array columns -- which is what we do in parquet-go -- and as a result adopting a memory layout which favors this encoding instead of PLAIN will yield efficiency to a larger scope of use cases.
As a result, this PR modifies the in-memory representation of byte array pages (containing variable length values) from using the PLAIN encoding to separating the data and layout information into two buffers.
The layout is an array of the byte offsets (32 bits unsigned integers) of the start of each value in the data buffer. The offsets contain N+1 elements, the first offset is the beginning of the values section and the last offset is the total length in bytes of the values. Offsets are therefore represented in memory in the same way than value offsets in Arrow's variable length binary layout (the one difference is that parquet-go does not require the data to be contiguous to the offset array in memory).
Values in the data buffer are simply concatenated, ensuring maximum utilization of CPU caches when scanning the content, but more importantly matching the memory layout of the DELTA_LENGTH_BYTE_ARRAY, which means that we can use a simple memory copy to decode values from parquet pages.
In order to land this change, I made a few adjustments to the
encoding
package:I introduced a new
encoding.Values
container, which can embed arrays of values of all types supported by the parquet encodings, which helped implement the higher level code changes in the top level package.Methods of the
encoding.Encoding
interface now use stronger typing. Instead of working on[]byte
only, we use strongly typed slices (e.g.[]int32
), effectively reverting some of the changes introduced in https://github.com/segmentio/parquet-go/pull/175. Combined with the newencoding.Values
type, this helped increase type safety between the abstraction layers, and seemed to make more sense since we are not using the PLAIN encoding for in-memory representation anymore.I think that the changes to the
encoding
package will likely be useful to support https://github.com/segmentio/parquet-go/issues/283 as well, as having stronger type checking will help developing encoding extensions.I still have a few improvements to make, and documentation to write, but overall I believe the change is in a good place to be reviewed. While I was able to measure the expected performance improvements on encoding of byte array values, some encoding benchmarks are showing regressions due to the removal of optimized routines that relied on the PLAIN encoding. I had taken a first pass at this change in the encoding-bytearray branch and was able to match or exceed the encoding and decoding performance on almost all benchmarks, so I will be sending follow ups to back port these changes here.
Please take a look and let me know if you have any feedback!