Optimize parquet.MergeRowGroups

This PR increases compute efficiency of parquet.MergeRowGroups. The main cost came from using the parquet.SortFunc, as well as the inefficient implementation of bufferedRowGroupCursor which over-complicated the implementation and failed to take advantage of batching capabilities of parquet.RowReader.

The benchmark results look promising:

name             old time/op  new time/op  delta
MergeRowBuffers   271ms ± 1%   162ms ± 0%  -40.29%  (p=0.000 n=10+10)

name                                                 old time/op  new time/op  delta
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=3,rows=30000  86.4µs ± 1%  52.8µs ± 0%  -38.84%  (p=0.000 n=9+10)

name                                                 old row/s    new row/s    delta
MergeFiles/FIXED_LEN_BYTE_ARRAY/groups=3,rows=30000   11.4M ± 1%   18.9M ± 0%  +66.48%  (p=0.000 n=10+10)

segmentio / parquet-go

Optimize parquet.MergeRowGroups #431