segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

Feature: Leveled buffers #354

Closed joe-elliott closed 2 years ago

joe-elliott commented 2 years ago

This PR introduces leveled buffering to the pools in buffer.go. Each pool internally is divided into 16 different pools. The smallest pool is guaranteed to have buffers in size from 1024->2047. The next is 2048->4095, etc. When a buffer is requested the appropriate subpool is used to guarantee that the buffer will be large enough. This prevents all buffers in the pools to trend towards max requested size. The largest pool contains buffers of 32MB and larger. This is the one sub pool that does not guarantee that it will return a buffer of the appropriate size and a resize may be necessary.

We have seen a total heap reduction of ~25% in the components we tested.

Also, fixed a bug where resize() would drop a buffer.

abraithwaite commented 2 years ago

For my own understanding, this is essentially creating room by allowing smaller buffers to still be created without claiming ownership of a big buffer even when there's very large buffers being allocated for I/O. Is that understanding correct?

Also, without anything to prevent resize of buffers by the owner when it's claimed from a pool (not sure we'd want to do this), it's possible that a buffer would be taken from a smaller pool and returned to a larger pool, yeah? I presume this wouldn't be a big issue in practice with the size hints, but I'm curious how we're thinking about that from a bufferPool consumer perspective.

Very cool though, thanks! I think we should let Achille take a look before merging, but overall I think it looks great.

joe-elliott commented 2 years ago

For my own understanding, this is essentially creating room by allowing smaller buffers to still be created without claiming ownership of a big buffer even when there's very large buffers being allocated for I/O. Is that understanding correct?

Yup, this will choose more appropriately sized buffers when they are requested. This prevents all buffers trending toward the largest size. It also significantly reduces allocs b/c previously it would be very common to receive a smaller buffer then requested.

Also, without anything to prevent resize of buffers by the owner when it's claimed from a pool (not sure we'd want to do this), it's possible that a buffer would be taken from a smaller pool and returned to a larger pool, yeah?

Yup. I wanted to build something with as small of an impact to the codebase as possible. So all the existing code that pulls from the pool and then attempts to resize still continues working. This implementation still does have one case in which this is necessary: when you are dealing in buffers > 32MB.

Going to mark this draft real fast. There is one more change I want to make and test. As written if you request a 2047 byte buffer it will return one of exactly this size and then on put() it will place that buffer in the 1024 byte bucket. This is a bit wasteful. I'm going to change these lines:

if sz < basePoolIncrement {
    sz = basePoolIncrement
}

To align to the next pool up.

joe-elliott commented 2 years ago

I know it's not standard practice in this repo to write internal tests, but I'd love to test these changes directly :)

achille-roussel commented 2 years ago

We have some internal tests in places where it makes sense, feel free to add more if you think they are useful 👍

joe-elliott commented 2 years ago

I think I've addressed all review comments:

Thanks for reviewing!