Closed Baliedge closed 7 months ago
LGTM, We should do something about the global code in v3. With all the counters used for functional tests I feel like it would be simpler to just add a new API call which gets the current status of broadcasts and updates. Thoughts?
Agreed on both.
Every call to
GetRateLimits
would reset theResetTime
and not theRemaining
counter. This would cause counters to eventually deplete and never fully reset. The solution involved fixing two issues:Duration
value was never properly propagated in global behavior. This was added to the global broadcast logic.UpdatePeerGlobals
during a global broadcast but neglected to propagateDuration
.Duration
to zero and trigger a reset of theResetTime
. This code path does not reset theRemaining
counter because it's meant for cases where an existing rate limit had been extended or abbreviated in duration.CacheItem
struct and logic in algorithms.go would ignore it, causing it to short circuit around the logic that checksDuration
. Once the data type was corrected, theDuration
bug was revealed.ResetTime
generated by the owning and non-owning peers did not always match exactly.ResetTime
in multiple places based onclock.Now()
.ResetTime
doesn't change due to the above bug.GetRateLimits()
will set arequestTime
and pass it around so that any date/time computation to setResetTime
will always use the same base value instead ofclock.Now()
.Fix race condition in
QueueUpdate()
used by peers to propagate updates to rate limits that it owns.Remaining
counter. So, if the same key were updated multiple times it may get added in non-chronological order. The last update wins, potentially passing a staleRemaining
count, thereby dropping hits already applied.QueueUpdates()
. Then, when the timer calls to propagate the update, get the current ratelimit state of each queued update just before sending to the peers.Fix inconsistency with over limit status when calling
GetRateLimits
on a non-owner peer with global behavior.UNDER_LIMIT
no matter how many hits were applied.Optimize calls to
GetRateLimits
with zero hits to not trigger any global updates because nothing changed.Add rigorous functional tests around global behavior to verify full peer-to-peer propagation after a call to
GetRateLimits
.Fix doublecounting of metric
gubernator_over_limit_counter
on both non-owner and owner peers. Only count on owner peer.Fix metric doublecounting of
gubernator_getratelimit_counter
. When a non-owner uses Global behavior to process a request, do not increment the counter. After it global sends to the owner, the owner will increment the counter. This counter shall be the accurate count of rate limits checked.Remove redundant metric
gubernator_broadcast_counter
. Usegubernator_broadcast_duration_count
instead.Fix intermittent test error related to
TestHealthCheck
that causes the next test to fail because the Gubernator services were restarted and aren't always ready in time to accept requests.