Notes for design: Rate Limiting

jamesmunns commented 1 month ago

As part of the current milestone, we'd like to add basic rate limiting options.

https://blog.nginx.org/blog/rate-limiting-nginx is a good basic read on how NGINX does this.

There's a couple of concepts we need to define:

The "key" - or what we use to match a request to the rate limiting rule
- Could be source IP, request URI, or other items
- TODO: This is probably another area where we want to define a common "format-like" syntax, also needed for load balancing with hash algorithms, so you can specify things like $src_ip/$uri or something
- TODO: We probably need to define how to handle when a request could potentially match two keys, but PROBABLY the answer is "rules are applied in the order they are applicable, first match wins" - nginx applies all and takes the most restrictive result!
The "rule" - or what policies we follow to decide what to do with each request. The three outcomes of a rule are:
- Forward immediately - allow the request to continue immediately
- Delay - don't serve the request immediately, but wait for some amount of time (more on this later) before forwarding
- Reject - Immediately respond with a 503 or similar "too much" error response
The "Rate", which actually has multiple components, if we are implementing a leaky bucket style of rate limiting (what NGINX does)
- The "active in-flight" count - how many outstanding forwarded requests can we have at one time?
- The "delayed in-flight" count - how many requests will we hold on to at one time before rejecting?
- The "delay to active promotion rate" - how often do we "pop" a delayed request to the "active" queue?
- NOTE: These are SLIGHTLY different than the rate, burst, and delay terms from nginx!

Unlike NGINX, I don't currently think we need to consider the "zone" or "scope" of the rules - I currently intend for rate limiting to be per-service, which means that the "zone" is essentially the tuple of (service, key)

I have checked with @eaufavor, and the correct way to handle delayed requests in pingora is to have the task yield (for a timer, or activation queue, etc).

git001 commented 1 month ago

Maybe you are also interested to look into Introduction to Traffic Shaping Using HAProxy which have similar features as nginx.

Traffic shaping is available since https://www.haproxy.com/blog/announcing-haproxy-2-7

jamesmunns commented 1 month ago

@git001 thanks for the pointer, I'll check it out!

jamesmunns commented 1 month ago

An additional detail that was surfaced during discussion was that there are two main "orientations" for rate limiting:

We want to limit the downstream connections, in order to limit individual peers from making excess requests
We want to limit the upstream connections, to prevent individual proxied servers from being overwhelmed

This could maybe be implemented in a way that is transparent: If we use the proposed formatting key for the matching, this could be independently matched against, for example if we have three "rules":

src -> The source IP of the downstream requestor
uri -> The request URI path
dst -> The selected upstream path

In this case, I think that all connections would need to obtain a token from ALL THREE to proceed.

So lets say that downstream 10.0.0.1 wants to request for /api/whatever, and upstream 192.0.0.1 is selected as the upstream.

For each of these keys, we'd need to:

Make sure that the key exists in the global table of rate limiters
- If the key does not exist, create it
Get a handle for each of the rate limiters
Attempt to enqueue ourselves in the list for each:
- If any of them are "overfilled", then immediately bail
- Otherwise begin awaiting on all rate limiters
Once all rate limiting is complete, then issue the request

My initial thought for this is to use something like the leaky_bucket crate, and place the rate limiting table in something like:

// things that can be rate limiting keys
enum Key {
    Ipv4(IpV4Addr),
    Format(String),
    ...
}

struct LimiterPayload {
    // Some kind of record of last use? For culling?
    last_used: AtomicU64,
    // The actual rate limiter
    limiter: RateLimiter,
}

struct Context {
    limiter_map: RwLock<BTreeMap<Key, Arc<LimiterPayload>>>,
}

The behavior would have to be something like:

let mut limiters = vec![];
for rule in self.rules.iter() {
    let key = generate_key(&request_context);
    // This probably could optimistically get a read lock, then upgrade to a write lock
    // if the key does not exist
    let limiter = service_context.get_or_create(key);
    limiters.push(limiter);
}

let mut pending_tokens = vec![];
for limiter in limiters {
    match limiter.get_token() {
        Ok(None) => {}, // immediately allowed
        Ok(Some(fut)) => {
            // Not immediately allowed
            pending_tokens.push(fut);
        }
        Err(e) => {
            // too full, immediately return error
            return Err(400);
        }
    }
}

// futures unordered or something
pending_tokens.join_all().await;

Ok(...)

But this is immediately problematic:

Currently, the leaky_bucket crate has no concept of "immediate fail" - it always enqueues all requests. This could be okay, but not what I wrote above
We do need some way to de-allocate unused limiters, either at some age basis, or on a "least recently used" basis in case we end up with a lot in a very short time (think scraping or something that hits a ton of unique endpoints)
I'm unsure how we should handle the "temporal differences" in sub-matches
- If we get a token for one match now, but the next rule doesn't resolve for 30s, does that make sense?
- If we get a token for one match now, but then the next is "immediate fail", we don't get credit back for the first.

It also feels like a lot of locking and allocation. This is probably unavoidable, but it makes me a little nervous about being a DoS vector.

jamesmunns commented 3 weeks ago

Closing this as completed in #67

memorysafety / river

Notes for design: Rate Limiting #59