segmentio / stats

Go package for abstracting stats collection
https://godoc.org/github.com/segmentio/stats
MIT License
208 stars 32 forks source link

wait-free buffer #75

Closed achille-roussel closed 6 years ago

achille-roussel commented 6 years ago

This PR changes the approach taken to handle concurrent use of a stats buffer to get rid of all locks and offer better throughput in highly concurrent environments. What motivated this change is an observation that on an high-traffic service, 50% of the time spent in locking/unlocking was on the mutexes in the stats.(*Buffer).HandleMeasures method (see profile attached).

The proposed solution here is to use an array of buffers instead of a single buffer, with a wait-free lock on each buffer that goroutines have to acquire before writing metrics. They don't block on the lock tho so if a buffer is busy they'll just move to try the next one until one becomes available.

The initial chunks to which metrics were serialized is gone as well, because sync.Pool has an internal mutex which happens to hit frequently when GC is configured to be more aggressive.

I'll be testing this in production before merging it but please take a look and let me know if anything should be changed.

screen shot 2017-10-30 at 6 28 47 pm

achille-roussel commented 6 years ago

You’re right, not waiting can potentially result in longer busy-waits with goroutines just looping trying to acquire buffers. In practice this will call for increasing the pool size. I’m not too worried about it, I think the default pool size of 2 x GOMAXPROCS should prevent this kind of scenario from occurring since there should always be a roughly half of the buffers available to be grabbed. Profiles will tell what what the right approach is, it’s really hard to prove what the best approach is without exercising the code in real world applications (which is why I’ll wait to have results before merging this, the benchmarks in this package show no difference between the different versions for example).

achille-roussel commented 6 years ago

Actually I wasn't running the right benchmark, we do see some differences:

benchmark                                                       old ns/op     new ns/op     delta
BenchmarkClient/write_a_batch_of_1_measures_to_a_client-4       171           81.5          -52.34%
BenchmarkClient/write_a_batch_of_10_measures_to_a_client-4      445           385           -13.48%
BenchmarkClient/write_a_batch_of_100_measures_to_a_client-4     3309          3367          +1.75%

In most cases the number of metrics in a batch is somewhere between 1 and 20, so likely we'll get better throughput with this change. While it may be related to contention it doesn't really tell if it improved here because the block profiler gets too much noise from synchronizations within the test package that it doesn't show anything about the mutex in the buffer.

achille-roussel commented 6 years ago

Thanks for the reviews guys, I'm still getting this tested in production and it seems to be helping so far, will merge after running it for a little longer.

achille-roussel commented 6 years ago

Alright this has been running without issues for a few days, definitely no contention anymore on this code path.

I'll go ahead and merge!