mailgun / gubernator

High Performance Rate Limiting MicroService and Library
Apache License 2.0
964 stars 99 forks source link

MegaFix global behavior bugs. #225

Closed Baliedge closed 7 months ago

Baliedge commented 8 months ago

Every call to GetRateLimits would reset the ResetTime and not the Remaining counter. This would cause counters to eventually deplete and never fully reset. The solution involved fixing two issues:

Fix race condition in QueueUpdate() used by peers to propagate updates to rate limits that it owns.

Fix inconsistency with over limit status when calling GetRateLimits on a non-owner peer with global behavior.

Optimize calls to GetRateLimits with zero hits to not trigger any global updates because nothing changed.

Add rigorous functional tests around global behavior to verify full peer-to-peer propagation after a call to GetRateLimits.

Fix doublecounting of metric gubernator_over_limit_counter on both non-owner and owner peers. Only count on owner peer.

Fix metric doublecounting of gubernator_getratelimit_counter. When a non-owner uses Global behavior to process a request, do not increment the counter. After it global sends to the owner, the owner will increment the counter. This counter shall be the accurate count of rate limits checked.

Remove redundant metric gubernator_broadcast_counter. Use gubernator_broadcast_duration_count instead.

Fix intermittent test error related to TestHealthCheck that causes the next test to fail because the Gubernator services were restarted and aren't always ready in time to accept requests.

Baliedge commented 8 months ago

LGTM, We should do something about the global code in v3. With all the counters used for functional tests I feel like it would be simpler to just add a new API call which gets the current status of broadcasts and updates. Thoughts?

Agreed on both.