Refactor encodings to change the memory layout of byte array values

This change contributes to #226 by changing the in-memory representation of byte array pages.

Prior to this change, values in byte array pages were stored using the PLAIN encoding: each value had a 4 bytes length prefix followed by its content. This memory layout meant that the only possible access was a sequential (and unpredictable) scan, increasing CPU cache misses and limiting the throughput at which byte array values could be encoded or decoded. Even in the case where the underlying file used the PLAIN encoding, validating the inputs resulted in a poor performance.

Two other observations motivated this change:

To bridge with Arrow, the PLAIN encoding layout would have required re-encoding the byte array values to match the variable size binary layout. Using a memory layout mapping more directly to Arrow in parquet-go would greatly simplify the integration and improve efficiency when translating between the Arrow and Parquet layers.
The Parquet format states that the DELTA_LENGTH_BYTE_ARRAY encoding should be used as the default for byte array columns -- which is what we do in parquet-go -- and as a result adopting a memory layout which favors this encoding instead of PLAIN will yield efficiency to a larger scope of use cases.

As a result, this PR modifies the in-memory representation of byte array pages (containing variable length values) from using the PLAIN encoding to separating the data and layout information into two buffers.

The layout is an array of the byte offsets (32 bits unsigned integers) of the start of each value in the data buffer. The offsets contain N+1 elements, the first offset is the beginning of the values section and the last offset is the total length in bytes of the values. Offsets are therefore represented in memory in the same way than value offsets in Arrow's variable length binary layout (the one difference is that parquet-go does not require the data to be contiguous to the offset array in memory).

Values in the data buffer are simply concatenated, ensuring maximum utilization of CPU caches when scanning the content, but more importantly matching the memory layout of the DELTA_LENGTH_BYTE_ARRAY, which means that we can use a simple memory copy to decode values from parquet pages.

In order to land this change, I made a few adjustments to the encoding package:

I introduced a new encoding.Values container, which can embed arrays of values of all types supported by the parquet encodings, which helped implement the higher level code changes in the top level package.
Methods of the encoding.Encoding interface now use stronger typing. Instead of working on []byte only, we use strongly typed slices (e.g. []int32), effectively reverting some of the changes introduced in https://github.com/segmentio/parquet-go/pull/175. Combined with the new encoding.Values type, this helped increase type safety between the abstraction layers, and seemed to make more sense since we are not using the PLAIN encoding for in-memory representation anymore.

I think that the changes to the encoding package will likely be useful to support https://github.com/segmentio/parquet-go/issues/283 as well, as having stronger type checking will help developing encoding extensions.

I still have a few improvements to make, and documentation to write, but overall I believe the change is in a good place to be reviewed. While I was able to measure the expected performance improvements on encoding of byte array values, some encoding benchmarks are showing regressions due to the removal of optimized routines that relied on the PLAIN encoding. I had taken a first pass at this change in the encoding-bytearray branch and was able to match or exceed the encoding and decoding performance on almost all benchmarks, so I will be sending follow ups to back port these changes here.

name                                             old value/s    new value/s    delta
Encode/PLAIN/byte_array                              372M ± 0%      150M ± 1%   -59.77%  (p=0.000 n=9+9)
Encode/DELTA_LENGTH_BYTE_ARRAY/byte_array           88.6M ± 1%    548.6M ± 0%  +519.25%  (p=0.000 n=10+9)
Encode/DELTA_BYTE_ARRAY/byte_array                  65.3M ± 1%     68.9M ± 1%    +5.63%  (p=0.000 n=10+10)
Decode/PLAIN/byte_array                              372M ± 0%      127M ± 1%   -65.76%  (p=0.000 n=8+9)
Decode/DELTA_LENGTH_BYTE_ARRAY/byte_array            474M ± 0%      416M ± 0%   -12.16%  (p=0.000 n=10+10)
Decode/DELTA_BYTE_ARRAY/byte_array                   146M ± 0%       76M ± 1%   -47.78%  (p=0.000 n=9+10)

Please take a look and let me know if you have any feedback!

name                                       old time/op    new time/op     delta
Encode/PLAIN/byte_array                      26.9µs ± 0%     65.9µs ± 1%  +144.98%  (p=0.000 n=10+10)
Encode/DELTA_LENGTH_BYTE_ARRAY/byte_array     113µs ± 1%       14µs ± 0%   -87.77%  (p=0.000 n=9+9)
Encode/DELTA_BYTE_ARRAY/byte_array            153µs ± 1%      145µs ± 1%    -5.23%  (p=0.000 n=10+10)
Decode/PLAIN/byte_array                      26.9µs ± 0%     80.5µs ± 1%  +199.59%  (p=0.000 n=9+10)
Decode/DELTA_LENGTH_BYTE_ARRAY/byte_array    21.0µs ± 0%     14.5µs ± 0%   -30.94%  (p=0.000 n=10+9)
Decode/DELTA_BYTE_ARRAY/byte_array           68.5µs ± 0%     37.4µs ± 0%   -45.41%  (p=0.000 n=10+10)

name                                       old speed      new speed       delta
Encode/PLAIN/byte_array                    5.53GB/s ± 0%   2.26GB/s ± 1%   -59.18%  (p=0.000 n=10+10)
Encode/DELTA_LENGTH_BYTE_ARRAY/byte_array  1.32GB/s ± 1%  10.77GB/s ± 0%  +717.76%  (p=0.000 n=9+9)
Encode/DELTA_BYTE_ARRAY/byte_array          971MB/s ± 1%   1025MB/s ± 1%    +5.52%  (p=0.000 n=10+10)
Decode/PLAIN/byte_array                    5.54GB/s ± 0%   1.85GB/s ± 1%   -66.62%  (p=0.000 n=9+10)
Decode/DELTA_LENGTH_BYTE_ARRAY/byte_array  7.08GB/s ± 0%  10.25GB/s ± 0%   +44.82%  (p=0.000 n=10+9)
Decode/DELTA_BYTE_ARRAY/byte_array         2.17GB/s ± 0%   3.98GB/s ± 0%   +83.18%  (p=0.000 n=10+10)

name                                       old value/s    new value/s     delta
Encode/PLAIN/byte_array                        372M ± 0%       152M ± 1%   -59.18%  (p=0.000 n=10+10)
Encode/DELTA_LENGTH_BYTE_ARRAY/byte_array     88.5M ± 1%     723.8M ± 0%  +717.73%  (p=0.000 n=9+9)
Encode/DELTA_BYTE_ARRAY/byte_array            65.3M ± 1%      68.9M ± 1%    +5.52%  (p=0.000 n=10+10)
Decode/PLAIN/byte_array                        372M ± 0%       124M ± 1%   -66.62%  (p=0.000 n=9+10)
Decode/DELTA_LENGTH_BYTE_ARRAY/byte_array      476M ± 0%       689M ± 0%   +44.80%  (p=0.000 n=10+9)
Decode/DELTA_BYTE_ARRAY/byte_array             146M ± 0%       268M ± 0%   +83.17%  (p=0.000 n=10+10)

segmentio / parquet-go

Refactor encodings to change the memory layout of byte array values #289