Closed kriskowal closed 8 years ago
I think we already do this:
https://github.com/uber/hyperbahn/blob/master/rate_limiter.js#L33 https://github.com/uber/hyperbahn/blob/master/service-proxy.js#L549
Maybe the feature isn't implemented properly.
We do kill switch; that works.
The issue is probably the current "total rate limit". 4k * 2 === 8k;
There probably are all kinds of bad things that happen between 4k QPS & 8k QPS before the black hole kicks in... The black hole may never kick in if it cannot even get to 8k QPS.
You should probably look at a rate limiter not based on QPS but based on say "event loop lag" or some other symptom of being 100% CPU.
Aka rate limit at X event loop & black hole at 2X event loop will most likely kick in properly as 2X event loop is "garantueed" to happen but 8k QPS may never happen.
The total kill switch isn't a multiplier; it's just totalRpsLimit + 200. So currently 4296.
I propose lowering the service kill switch factor to 1.5 but this is something we could determine experimentally with the staging ring.
We should just make the kill switch factor a configurable, then we can
On Mon, Jul 25, 2016 at 3:47 PM Russ Frank notifications@github.com wrote:
The total kill switch isn't a multiplier; it's just totalRpsLimit + 200. So currently 4296.
I propose lowering the service kill switch factor to 1.5 but this is something we could determine experimentally with the staging ring.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uber/hyperbahn/issues/291#issuecomment-235109713, or mute the thread https://github.com/notifications/unsubscribe-auth/AADCM4HCub4puY09EtQJh8EBh0imhxQdks5qZTz9gaJpZM4JUZbv .
Fixed in #295
The TChannel protocol specifies that clients should retry if they receive a busy frame. The rate limiter produces busy frames. There are two kinds of busy: ephemeral business, where a node is temporarily busy but other peers are available; and systemic business, where all peers are backlogged. The rate limiter indicates the latter but induces the behavior for the former.
We could alternately drop requests if a service receives 2x the rate limit. In practice, the normal rate limiter induces 5x the normal volume, by causing some number of retires and different retry policies. A 2x black hole rate limiter would mitigate retry storms caused by the 1x rate limit. This effectively would encourage folks to use a "retry once policy" to keep under the 2x rate limit, and thus getting notified immediately if they hit the rate limiter.