segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

Optimize parquet.MergeRowGroups #431

Closed achille-roussel closed 1 year ago

achille-roussel commented 1 year ago

This PR increases compute efficiency of parquet.MergeRowGroups. The main cost came from using the parquet.SortFunc, as well as the inefficient implementation of bufferedRowGroupCursor which over-complicated the implementation and failed to take advantage of batching capabilities of parquet.RowReader.

The benchmark results look promising:

name             old time/op  new time/op  delta
MergeRowBuffers   271ms ± 1%   162ms ± 0%  -40.29%  (p=0.000 n=10+10)
name                                                 old time/op  new time/op  delta
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=3,rows=30000  86.4µs ± 1%  52.8µs ± 0%  -38.84%  (p=0.000 n=9+10)

name                                                 old row/s    new row/s    delta
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=3,rows=30000   11.4M ± 1%   18.9M ± 0%  +66.48%  (p=0.000 n=10+10)