segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

pre-allocate slices to avoid allocations #511

Closed thorfour closed 1 year ago

thorfour commented 1 year ago

DecodeByteArray is given nil buffers which can cause a large number of allocations as appends have to resize the array. This adds a step to pre-compute the required sizes of the buffers and allocates them all up front to avoid that.

In benchmark we see

thor@thors-MacBook-Pro compactor % benchstat before.txt pre-compute.txt
name              old time/op    new time/op    delta
_PreAggregate-10     4.38s ± 1%     4.35s ± 1%  -0.80%  (p=0.019 n=10+10)

name              old alloc/op   new alloc/op   delta
_PreAggregate-10    10.5GB ± 0%     9.4GB ± 1%  -9.89%  (p=0.000 n=10+10)

name              old allocs/op  new allocs/op  delta
_PreAggregate-10      128M ± 0%      128M ± 0%    ~     (p=0.971 n=10+10)

Was also curious if we could add a file open option to use an allocator for these buffers and things like the page buffers which would allow users to have more fine-grained control over memory usage. Something like the Arrow allocator wdyt?

type Allocator interface {
      Allocate(size int) []byte
      Reallocate(size int, b []byte) []byte
      Free(b []byte)
 }