This PR modifies parquet.RowBuffer to hold row values in a single array instead of letting Go allocate arrays dynamically for each row held in the buffer.
This approach has multiple benefits:
To avoid reallocation of these value arrays, we would retain the memory held by each row when the buffer is reset. This would cause the values buffer sizes to trend towards the largest one over time. When the schema contains repeated columns the size of each value array could read ~10KiB, resulting in a lot of memory waste.
Since most columns are not repeated, most rows would contain a small array with one element, causing more GC pressure since each array is managed as an individual object.
By keeping value buffers contiguous in memory we increase locality and help the code leverage CPU caches, which improves overall efficiency of the program.
This PR modifies
parquet.RowBuffer
to hold row values in a single array instead of letting Go allocate arrays dynamically for each row held in the buffer.This approach has multiple benefits:
To avoid reallocation of these value arrays, we would retain the memory held by each row when the buffer is reset. This would cause the values buffer sizes to trend towards the largest one over time. When the schema contains repeated columns the size of each value array could read ~10KiB, resulting in a lot of memory waste.
Since most columns are not repeated, most rows would contain a small array with one element, causing more GC pressure since each array is managed as an individual object.
By keeping value buffers contiguous in memory we increase locality and help the code leverage CPU caches, which improves overall efficiency of the program.