uber-archive / hyperbahn

Service discovery and routing for large scale microservice operations
MIT License
396 stars 57 forks source link

Black hole for 2x rate limit #291

Closed kriskowal closed 8 years ago

kriskowal commented 8 years ago

The TChannel protocol specifies that clients should retry if they receive a busy frame. The rate limiter produces busy frames. There are two kinds of busy: ephemeral business, where a node is temporarily busy but other peers are available; and systemic business, where all peers are backlogged. The rate limiter indicates the latter but induces the behavior for the former.

We could alternately drop requests if a service receives 2x the rate limit. In practice, the normal rate limiter induces 5x the normal volume, by causing some number of retires and different retry policies. A 2x black hole rate limiter would mitigate retry storms caused by the 1x rate limit. This effectively would encourage folks to use a "retry once policy" to keep under the 2x rate limit, and thus getting notified immediately if they hit the rate limiter.

rf commented 8 years ago

I think we already do this:

https://github.com/uber/hyperbahn/blob/master/rate_limiter.js#L33 https://github.com/uber/hyperbahn/blob/master/service-proxy.js#L549

Maybe the feature isn't implemented properly.

Raynos commented 8 years ago

We do kill switch; that works.

The issue is probably the current "total rate limit". 4k * 2 === 8k;

There probably are all kinds of bad things that happen between 4k QPS & 8k QPS before the black hole kicks in... The black hole may never kick in if it cannot even get to 8k QPS.

You should probably look at a rate limiter not based on QPS but based on say "event loop lag" or some other symptom of being 100% CPU.

Aka rate limit at X event loop & black hole at 2X event loop will most likely kick in properly as 2X event loop is "garantueed" to happen but 8k QPS may never happen.

rf commented 8 years ago

The total kill switch isn't a multiplier; it's just totalRpsLimit + 200. So currently 4296.

I propose lowering the service kill switch factor to 1.5 but this is something we could determine experimentally with the staging ring.

jcorbin commented 8 years ago

We should just make the kill switch factor a configurable, then we can

seewhatworks, and have the ability to drop the boom if needed

On Mon, Jul 25, 2016 at 3:47 PM Russ Frank notifications@github.com wrote:

The total kill switch isn't a multiplier; it's just totalRpsLimit + 200. So currently 4296.

I propose lowering the service kill switch factor to 1.5 but this is something we could determine experimentally with the staging ring.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uber/hyperbahn/issues/291#issuecomment-235109713, or mute the thread https://github.com/notifications/unsubscribe-auth/AADCM4HCub4puY09EtQJh8EBh0imhxQdks5qZTz9gaJpZM4JUZbv .

kriskowal commented 8 years ago

Fixed in #295