Deprecate usage of valyala/bytebufferpool

gaby commented 1 week ago

@erikdubbelboer I was curious about the performance of ByteBufferPool now days, I decided to re-run the benchmarks it mentions on the README and to my surprise sync.Pool is more performant these days. This also probably explains why VictoriaMetrics doesn't use ByteBufferPool anymore as seen here: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/lib/bytesutil/bytebuffer.go#L128

I created a workflow with benchmarks using go1.20 through 1.23 for both Ubuntu Latest and MacOS with M1 CPU. You can see the results here: https://github.com/gaby/go-benchmark/actions/runs/10721391990

TLDR:

Ubuntu Latest - Golang v1.20.x

Ubuntu Latest - Golang v1.23.x

MacOS Latest (Apple M1) - Golang v1.23.x

erikdubbelboer commented 1 week ago

Thanks for making the suggestion and running the benchmarks! It indeed looks better to just switch to sync.Pool. It makes sense that is faster now as it contains a bunch of runtime optimizations that ByteBufferPool will never be able to do.

I'm open to a pull request to completely swap ByteBufferPool with sync.Pool.

gaby commented 1 week ago

@erikdubbelboer Awesome, I'm currently testing the changes on Fiber. Will report back and see how much work it requires to do the swap in Fasthttp.

newacorn commented 1 week ago

In any version of Golang, bytebufferpool is slower than directly using sync.Pool, and the performance difference is almost consistent. Because bytebufferpool is designed to optimize memory usage to the greatest extent. Its goal is to prevent the pool from being filled with large memory slices due to a few high-memory requests, while the majority of memory requests don't require such large allocations. The performance difference between Go 1.18 and Go 1.23 is almost the same because both rely on sync.Pool. However, bytebufferpool performs bookkeeping during the recycling of byte slices and recalculates the threshold when a certain point is reached.

❯ ~/go/go1.18/bin/go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/gaby/go-benchmark/buffer/main
cpu: 13th Gen Intel(R) Core(TM) i9-13900KS
BenchmarkGenericBuf-32                           4131448           270.7 ns/op
BenchmarkGenericStackBuf-32                      2675854           451.5 ns/op
BenchmarkAllocBuf-32                             4409329           295.9 ns/op
BenchmarkSyncPoolBuf-32                         305353983            3.810 ns/op
BenchmarkBpoolPoolBuf-32                         4470063           266.0 ns/op
BenchmarkByteBufferPoolBuf-32                   53832303            21.90 ns/op
BenchmarkEasyJsonBuffer-32                       7854277           157.2 ns/op
BenchmarkEasyJsonBuffer_OptimizedConfig-32      124866505           10.04 ns/op
PASS
ok      github.com/gaby/go-benchmark/buffer/main    13.609s

❯ ~/go/go1.23.0/bin/go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/gaby/go-benchmark/buffer/main
cpu: 13th Gen Intel(R) Core(TM) i9-13900KS
BenchmarkGenericBuf-32                           3460868           375.0 ns/op
BenchmarkGenericStackBuf-32                      3203481           371.2 ns/op
BenchmarkAllocBuf-32                             3103330           363.0 ns/op
BenchmarkSyncPoolBuf-32                         436120023            2.616 ns/op
BenchmarkBpoolPoolBuf-32                         4311795           278.1 ns/op
BenchmarkByteBufferPoolBuf-32                   55834230            20.91 ns/op
BenchmarkEasyJsonBuffer-32                       7717189           162.2 ns/op
BenchmarkEasyJsonBuffer_OptimizedConfig-32      88038954            13.53 ns/op
PASS
ok      github.com/gaby/go-benchmark/buffer/main    11.460s

After removing the bookkeeping and validation code from bytebufferpool, its speed is consistent with that of sync.Pool.

// Put releases byte buffer obtained via Get to the pool.
//
// The buffer mustn't be accessed after returning to the pool.
func (p *Pool) Put(b *ByteBuffer) {
    /*
        idx := index(len(b.B))

        if atomic.AddUint64(&p.calls[idx], 1) > calibrateCallsThreshold {
            p.calibrate()
        }

        maxSize := int(atomic.LoadUint64(&p.maxSize))
        if maxSize == 0 || cap(b.B) <= maxSize {
            b.Reset()
            p.pool.Put(b)
        }
    */
    b.Reset()
    p.pool.Put(b)
}

❯ ~/go/go1.18/bin/go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/gaby/go-benchmark/buffer/main
cpu: 13th Gen Intel(R) Core(TM) i9-13900KS
BenchmarkGenericBuf-32                           4051310           266.1 ns/op
BenchmarkGenericStackBuf-32                      4599678           252.7 ns/op
BenchmarkAllocBuf-32                             4741383           255.2 ns/op
BenchmarkSyncPoolBuf-32                         531053491            2.262 ns/op
BenchmarkBpoolPoolBuf-32                         4293218           276.6 ns/op
BenchmarkByteBufferPoolBuf-32                   314608893            3.550 ns/op
BenchmarkEasyJsonBuffer-32                       7363118           166.0 ns/op
BenchmarkEasyJsonBuffer_OptimizedConfig-32      100000000           10.33 ns/op
PASS
ok      github.com/gaby/go-benchmark/buffer/main    11.147s

❯ ~/go/go1.23.0/bin/go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/gaby/go-benchmark/buffer/main
cpu: 13th Gen Intel(R) Core(TM) i9-13900KS
BenchmarkGenericBuf-32                           3512586           366.1 ns/op
BenchmarkGenericStackBuf-32                      3442252           357.3 ns/op
BenchmarkAllocBuf-32                             3388725           369.6 ns/op
BenchmarkSyncPoolBuf-32                         333597666            3.660 ns/op
BenchmarkBpoolPoolBuf-32                         4197034           285.4 ns/op
BenchmarkByteBufferPoolBuf-32                   689101736            1.845 ns/op
BenchmarkEasyJsonBuffer-32                       7685049           156.6 ns/op
BenchmarkEasyJsonBuffer_OptimizedConfig-32      88073781            13.11 ns/op
PASS
ok      github.com/gaby/go-benchmark/buffer/main    11.897s

This benchmark does not reflect real-world usage scenarios. We can consider optimizing the algorithm or providing users with a configuration switch, allowing them to configure the pool implementation themselves, similar to how it's done in the client implementation. We shouldn't hastily replace it with a simple sync.Pool. After all, in the case of HTTP request and response handling, the size of the response body and request body can vary greatly across different scenarios. A rushed replacement might go against the original intent of the fasthttp author.

sixcolors commented 1 week ago

I think this level of optimization isn't ideal for user configuration—it’s something framework maintainers should handle. Perhaps @newacorn could help create a more comprehensive benchmark suite to better reflect real-world use cases and guide the decision?

newacorn commented 1 week ago

I think this level of optimization isn't ideal for user configuration—it’s something framework maintainers should handle. Perhaps @newacorn could help create a more comprehensive benchmark suite to better reflect real-world use cases and guide the decision?

Whether or not to provide user configuration is optional. I also agree with your point that such optimizations should be provided by the maintainers. Because these are not important; they are just suggestions and no one has put them into practice. I just feel that using sync.Pool simply does not align with the semantics of the original implementation. Since sync.Pool is consistently faster than bytebufferpool in any version of Go, and the performance difference is consistent, why didn’t the author use sync.Pool when designing fasthttp in the first place? @erikdubbelboer supports this proposal based on the belief that Go has optimized sync.Pool in recent versions. However, this change is not related to Go's runtime, as both implementations are based on sync.Pool. The original implementation was aimed at conserving resources. When we optimize, should we not consider the original intention of the implementation?

Sorry, I don't intend to provide such a benchmark because this issue is fundamentally different from the original implementation. Given that the goals of the implementations are different, what is the point of providing a benchmark?

gaby commented 1 week ago

@newacorn The point is to discuss whether it makes sense to swap ByteBufferPool with sync.Pool. There's no pull request or any changes done.

I'm currently testing changing some of Fiber to use sync.Pool and I am getting 0 improvements.

newacorn commented 1 week ago

@newacorn The point is to discuss whether it makes sense to swap ByteBufferPool with sync.Pool. There's no pull request or any changes done.

I'm currently testing changing some of Fiber to use sync.Pool and I am getting 0 improvements.

Sorry, I looked at the wrong column and confused the issue with the pull request.

newacorn commented 1 week ago

@gaby Actually, your proposal can indeed improve performance, but it might increase memory usage in certain special scenarios. My intention was to suggest that if you implement this pull request, it would be even better to provide a configuration option to maintain compatibility with the original semantics, as different users might have different use cases. I’m not negating your proposal.

gaby commented 1 week ago

@newacorn I tried swapping in 3 places of GoFiber the ByteBufferPool to sync.Pool and I get same performance or worst.

This is probably related to what you mentioned before of buffers having different sizes

gaby commented 1 week ago

The other option I can see is merging ByteBufferPool into fasthttp and doing fixes/updates here. Given that the upstream ByteBufferPool doesn't get any updates. Thoughts ? @erikdubbelboer

valyala / fasthttp

Deprecate usage of valyala/bytebufferpool #1856