storozhukBM commented 2 years ago

Hi everyone, It's been a few years since we introduced the atomic-based rate limiter. Since then, after numerous changes to the runtime scheduler, mutex-based implementation has become much more stable. So I found a way to improve atomic-based implementation further and squeeze its state into one int64. The new implementation is much faster, stable under contention, and has zero allocations.

Benchmarks on 8 core i7 intel machine:

Benchmarks on 8 core M1 machine:

CLAassistant commented 2 years ago

All committers have signed the CLA.

codecov[bot] commented 2 years ago

Codecov Report

Merging #85 (bb12424) into main (bca0419) will decrease coverage by 1.01%. The diff coverage is 96.66%.

:exclamation: Current head bb12424 differs from pull request most recent head 5d22744. Consider uploading reports for the commit 5d22744 to get more accurate results

@@             Coverage Diff             @@
##              main      #85      +/-   ##
===========================================
- Coverage   100.00%   98.98%   -1.02%     
===========================================
  Files            3        4       +1     
  Lines           70       99      +29     
===========================================
+ Hits            70       98      +28     
- Misses           0        1       +1

Impacted Files	Coverage Δ
limiter_atomic_int64.go	`96.55% <96.55%> (ø)`
ratelimit.go	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update bca0419...5d22744. Read the comment docs.

storozhukBM commented 2 years ago

@abhinav @rabbbit

Regarding levered coverage: The only few lines that are not covered with tests in new implementation require very high contention to get triggered in tests, help the scheduler a bit, and introduce no behavioral change. So I think it is OK to leave them uncovered. What do you think?

rabbbit commented 2 years ago

Thanks, this looks really cool!

I took a brief look, left some minor comments, will need to take another pass to re-learn the old code.

I don't actually know/understand - could you explain why/how the new implementation is so much better than the previous atomic based one? I imagined that storing/loading a pointer atomically would be approximately as fast as storing an int64. Is it just the overhead of loading an int vs a struct?

Also, do you know why the overhead for {1,2,3} goroutines is larger than 4+? This seems counterintuitive.

storozhukBM commented 2 years ago

@rabbbit

I don't actually know/understand - could you explain why/how the new implementation is so much better than the previous atomic based one? I imagined that storing/loading a pointer atomically would be approximately as fast as storing an int64. Is it just the overhead of loading an int vs a struct?

Yes, the main overhead is due to allocation. We basically need to allocate a state struct on the heap and then set a pointer to it using CAS. There are no allocations when using only int64, and GC is not bothered with all that small objects, etc.

Also, do you know why the overhead for {1,2,3} goroutines is larger than 4+? This seems counterintuitive.

This is just an artifact of how we measure time per operation. What we actually do is take testing.B.N as the number of permissions we should get from the rate limiter, then we distribute this amount across ng (number of goroutines) In the end, we measure time.Duration of how long it took us to Take the testing.B.N permissions, then we divide this measured time.Duration by testing.B.N and we get ns/op.

So with atomic_int64 the overhead of synchronization is so small that 1, 2, or 4 goroutines basically can't progress through testing.B.N quickly enough, so their sum throughput is smaller in 8 goroutines.

To demonstrate it, I implemented a bit different way to measure time, where I measured not the time of how quickly goroutines together finish testing.B.N, but the average time of Take() and this graph you don't see such artefacts

storozhukBM commented 2 years ago

@abhinav @rabbbit For some reason tests fail on CI on atomic rate limiter implementation that is currently in master, this is happened also on other PR https://github.com/uber-go/ratelimit/pull/86, but I don't get such failures locally, can you check on your machines?

rabbbit commented 2 years ago

@abhinav @rabbbit For some reason tests fail on CI on atomic rate limiter implementation that is currently in master, this is happened also on other PR #86, but I don't get such failures locally, can you check on your machines?

Yeah, we've seen that in the past, but didn't debug (we don't make many changes to this repo). It seems to be happening mostly, but not exclusively, during codecov generation.

So with atomic_int64 the overhead of synchronization is so small that 1, 2, or 4 goroutines basically can't progress through testing.B.N quickly enough, so their sum throughput is smaller in 8 goroutines.

The another surprising part was for your i7 the 1,2,4 goroutines cases were slower, but it was reversed for the M1 machine. Magic.

To demonstrate it, I implemented a bit different way to measure time, where I measured not the time of how quickly goroutines together finish testing.B.N, but the average time of Take() and this graph you don't see such artefacts

Nice. The performance of the limiter seems to be significantly slower with more goroutines here - presumably because time calculation is more expensive here?

Thanks for ding this, and sorry for all the nits/questions :)

Separately, this PR probably resolves #22 - we see that the mutex was performing better than the old atomic version.

storozhukBM commented 2 years ago

Separately, this PR probably resolves https://github.com/uber-go/ratelimit/issues/22 - we see that the mutex was performing better than the old atomic version.

It was not the case on 1.16 or go 1.15 but it is now I think.

storozhukBM commented 2 years ago

The another surprising part was for your i7 the 1,2,4 goroutines cases were slower, but it was reversed for the M1 machine. Magic.

M1 has 4 slow cores and 4 fast cores, I think 4 fast cores start to work at fist and when 4 slow join the party they only get in a way. If you want I can run techniques_comp benches on M1 to the the whole picture

storozhukBM commented 2 years ago

@rabbbit It looks like our clock mocking library is supper old and archived by author. What do you say if I replace it with drop-in replacement that is actively maintained https://github.com/benbjohnson/clock

uber-go / ratelimit

New atomic-based implementation squeezed into int64 #85

Codecov Report