mailgun / gubernator

High Performance Rate Limiting MicroService and Library
Apache License 2.0
964 stars 99 forks source link

Fix global behavior #216

Closed thrawn01 closed 9 months ago

thrawn01 commented 9 months ago

Over Limit Behavior

The way it's supposed to work is that the owner of the rate limit will broadcast the computed result of the rate limit to other peers, and that is what will be returned to clients who submit hits to a peer. The owner of the rate limit should be setting the OVER_LIMIT status before broadcasting.

However, we don't store OVER_LIMIT in the cache, See this comment because it is expected that every time the functions in algorithms.go are called that OVER_LIMIT will be calculated and set properly. Again, because of this, the owning peer must preform those calculations and set the OVER_LIMIT status before broadcasting to other peers.

As these wonderful contributors have noticed (@miparnisari and @philipgough) It is currently not doing this! As was pointed out by @miparnisari, I do not know why this has not been caught before now. (Probably because of the lack of testing around the global functionality 😥)

At one time, I had envisioned the broadcaster would only send hits to the peers, and then each peer would calculate the correct OVER_LIMIT status based on the number of hits received from the owning peer. However, this plan would require more logic in the peers for when things like RESET_REMAINING or when a negative hit occurs, etc.... Instead I decided the owning peer would broadcast a RateLimitResp to each pear which in turn would be returned to the client as the most up to date status of the rate limit. It looks like I goofed when I made the transition and decided to make a call to getLocalRateLimit() for each queued broadcast. As a consequence I had to set rl.Hits = 0 which short circuits the OVER_LIMIT calculation, as a result the broadcasted responses do not have the proper OVER_LIMIT set. I think the correct way to handle this is to have the broadcaster queue responses and send them without recalculating the current status before broadcast to peers. (Included in this PR)

Also, I noted during testing that when using TOKEN_BUCKET the non owning peer would respond with UNDER_LIMIT on the third hit. (which is kinda correct, because it will only respond with what the owner has broadcast to it) but isn't what most users would expect when using TOKEN_BUCKET. So I added some code for the peer to return OVER_LIMIT if the broadcast response has a remainder of 0.

I think this works for TOKET_BUCKET because the expiration time on the owner and peer are the same, such that when the bucket expires, both the peer and owner will respond with 0 Hits correctly at the same time. However, for LEAKY_BUCKET the peer doesn't really know what tokens have leaked since we last checked.... I guess we could have the peer calculate the leak... but that sounds icky... Maybe someone can convince me otherwise?

philipgough commented 9 months ago

I guess we could have the peer calculate the leak... but that sounds icky... Maybe someone can convince me otherwise?

I agree that sounds like a bit too much logic leaking into the peer.

@thrawn01 thanks for taking this up. Changes make sense and lgtm from as best as I can understand the flow.

thrawn01 commented 9 months ago

This PR is replaced by #219