Add method to RateLimiter to mark a key as overflowed

brasic commented 5 years ago

At ShippingEasy we sometimes use gcra-ruby in a manner that is slightly different than the originally intended use-case of rate limiting an API server. This use case benefits from an additional method on GCRA::RateLimiter to immediately exhaust all of the limiter's remaining quota.

We integrate with many providers that use the leaky-bucket API rate limiting scheme that this gem implements. Sometimes we make API requests from more than one of our servers at the same time. If the provider has particularly low rate limits (for example, Amazon MWS), a simple retry-on-quota-exceeded strategy can quickly become unworkable.

This gem happens to work nearly perfectly for this alternate use case: rather than implementing a canonical limiter we are incidentally mirroring the state of another party's limiter. An example looks something like:

class ClientThrottle
  def initialize(scope, store, rate, quota)
    @scope = scope
    @limiter = GCRA::RateLimiter.new(store, rate, quota)
  end

  def run(request_count: 1)
    loop do
      exceeded, info = @limiter.limit(@scope, request_count)
      if exceeded
        sleep(info.retry_after)
      else
        break
      end
    end
    yield
  end
end

api_client = [...]
throttle = ClientThrottle.new(api_client.derive_scope, [...])
# can happen simultaneously on many servers without violating rate limit
api_result = throttle.run { api_client.call }

The main downside to this approach is that since we are mirroring another limiter's state we can not know with 100% certainty that we are synchronized to it. In practice the main time things break down is when multiple clients use the same resources from machines that do not share a redis instance (e.g. a staging server and production server hitting the same API). In this scenario, a limited client may receive a response indicating that it is rate limited. When this happens, it means that the mirrored state was incorrect and it's desirable to be able to immediately mark the resource as being out of quota so that it more accurately reflects the canonical case and the next call to limit will return the correct retry_after value.

brasic commented 5 years ago

@tobischo we currently maintain a fork for our own use. This PR and https://github.com/Barzahlen/gcra-ruby/pull/3 are the only differences with upstream. I realize the changes here aren't necessarily useful if you don't have our particular use case but it would be nice if we could switch back to using the official version of this gem 😄.

tobischo commented 5 years ago

What is the exact benefit of the rate limit state mirroring? You do not have the round trip time to go to the target API before you can abort on a second request, which might for some reason be faster? Wouldn't you still have instances competing for the same resource, just at your side instead of the API side?

brasic commented 5 years ago

@tobischo the benefit is that putting competing API consumers behind a shared limiter results in predictable fair access to the underlying limited API endpoint.

Some alternatives for consuming a heavily rate limited API are:

retry after sleeping when a request fails due to rate limit bucket flow
Maintain a process-local leaky-bucket model inside the client to proactively respect the API limit and slow the request rate once the .
ignore the burst quota, single clients always request with rate_period spacing.

A combination of 1 and 2 was our status quo before we started using this gem. None of these approaches works well in the face of multiple API clients competing for a single resource. The issue with 1 is that for very stingy rate limits like amazon MWS provides, some consumers will end up failing and retrying indefinitely. The issue with 2 is that the model is inaccurate without knowledge of other consumers, so in practice 2 also needs a facility to retry. Since it's foolish for any operation to retry an indefinite number of times, we need to impose a cap on the maximum number of retries. For heavily limited APIs we would have many clients using both approaches 1 and 2 that would frequently exceed our maximum retry count.

You're correct that the approach above simply changes the location of contention from the API itself to the limiter. But in practice contending against the limiter results in a fairer and more predictable distribution of wait times per client. When a client wants to make a request against an exhausted resource and the limiter tells it to sleep for a given duration and check again, it has a much better chance of being able to successfully make its request than it would if it waited an indeterminate amount with no knowledge or incomplete knowledge of the leaky bucket state. I've extensively simulated this (and observed it in production) and using the limiter among a large number of clients results in very close to the ideal request spacing against the underlying API, whereas a retry approach ends up with a bunch of clients awkwardly bumping into each other and many clients waiting far longer than their "fair" wait time.

As I said, I realize that this functionality is somewhat niche compared with the intended use case of the gem, but it is very useful to us. If you have any ideas for a less invasive way to achieve this functionality I'd be happy to discuss them.

joshuaflanagan commented 5 years ago

When a system uses an approach like gcra-ruby to protect a resource (API), they may or may not publish the rate limiting specifications.

If the rate limits are not documented or shared, then consumers have no choice but to blindly "sleep and retry". This can be undesirable for the resource owner (the consumer retries too often) or for the consumer (the consumer sleeps too long before retrying again).

If the rate limits are documented, it is assumed that consumers are expected to use those specifications to control how often they attempt to use the resource. In that case, the consumer needs some way of keeping track of how many requests have already been used within a time period. When there are multiple consumer processes in play, they need to keep track of that state in a shared location. That led us to storing the shared state in redis, using this gem.

For example, if an API is documented to only allow 10 request per minute - after we (collectively, across all of our processes) make 10 requests, we know it is time to sleep until a minute has passed. After the minute since the first request has passed, one of our consumers can safely make another request. If the limits were not documented, each of our processes would immediately make that 11th request, all get failures (like HTTP 429 status), and all sleep and retry after some amount of time.

Do you think that consumers should always blindly sleep and retry, even with documented rate limit specifications? Or should they use the documented rate limits to control how often they make calls?

We think consumers should control how often they make calls, based on the documented rate limits. But we recognize that it is possible for the mirrored state (gcra) to be out of sync with the protected resource's state, so we need a mechanism to tell it that the requests have been exhausted (when we get an HTTP 429 from the resource), even if gcra doesn't think it should be. Hence our need for mark_overflowed.

tobischo commented 5 years ago

I think the single proper source of thruth for how many requests can still be made or when the next one may be made after the resource was exhausted, is the resource provider. The resource provider should properly advertise those limits both in the documentation as well as in the API responses. These limitations from the API responses should then be used by the resource consumers to plan the next possible execution.

If you are allowed 10 requests per minute and you have 10 clients and the entirety of the available requests will be reset at the same time, then each of those 10 clients could retry immediately, if it knew that the bucket will reset after time x. If the server uses a leaky bucket version and only one request is allowed at that time and the others would be spaced out at 6 seconds to the next, then you would have 10 clients competing for that one request. That one request would go through, however the 9 others would fail. Depending on how long a request actually takes, the successful client would be competing again for being able to make the request 6 seconds later. So if you have 10 clients running like that, the majority of requests would always be failing. In that case it would cost resources running 10 clients and the main reason for doing that would be if it was essential that you are operating at the limit of the API even if one or more of the clients fails. Using something like gcra to mirror the state would then have the main benefit of allowing you not to actually make the request and at least save the resources for that, but the clients would spent most of the time sitting there and waiting. That does assume however that the API does rate limit properly and does not penalize API users for running into the rate limit (i.e. count the request even if it was rejected)

So the use cases I can see are:

The API does not return information about the current limitation (e.g. through headers). Then it would be easier to keep track on the client side instead of blindly retrying and handling the errors and also better for the server side, as there would be less load on the rate limit checking.
The API is not rate limited but unstable and it is easier to use it in a safe manner and rate limiting like this on the client side ensures that you do not make it even more unstable.
It would be easier to define a lower rate limit to ensure that all requests are actually going through successfully in terms of rate limiting

For most of the cases a solution would probably be to not have a pool of clients competing for a resource that appears to be more limited than the number of clients.

Use cases for marking a resource as exhausted:

Testing the behaviour of an exhausted resource on the server side without having to set a ridiculously low limit or a lot of requests.
Different environments which cannot access the same state which need to be synced up somehow and running up the limit and setting the exhausting is a good way to get close to that

As maintainers the questions are always:

Does it fit into the scope of the project? I believe it does, even if just to allow resource exhaustion for testing purposes.
Do we need it ourselves? Right now we do not, but I could see a use case where it could come in handy.
Is it complex to maintain? Not more complex than what we have now.

I will take another look at the code tomorrow, but I do not object in general from having such a feature merged.

brasic commented 5 years ago

Thanks for the thoughtful review!

viafintech / gcra-ruby

Add method to RateLimiter to mark a key as overflowed #4