uber-archive / hyperbahn

Service discovery and routing for large scale microservice operations
MIT License
396 stars 57 forks source link

For ad requests, max wait for relay-ad #290

Closed kriskowal closed 8 years ago

kriskowal commented 8 years ago

Ad requests fan out a relay-ad request for each of k affine Hyperbahn relays for the advertiser. If any of those requests fail, typically due to a timeout, the ad response fails. As k increases, the probability of failure grows, and it is a frequent occurrence in production that failures cause "storms" of "relay-ad" requests, bogging down the network and effecting "ad" retries.

To mitigate this problem, let’s consider capping the amount of time an ad will wait for relay-ad responses, and always provide a successful response containing all of the relays that successfully responded, unless there were none, or there were too few (at discretion of assignee).

I got started on this in my working copy, but it’s not even close to started. This just points where the code change probably ought to happen:

https://github.com/uber/hyperbahn/compare/best-effort-ad

The logic should go like this: for about 200ms (out of the 500ms budget allotted to ad requests), we should wait for relay-ad responses. If all of the relays respond before the 200ms deadline, respond fast with the whole list. If the 200ms expires, respond immediately. Maybe if there are too few (arbitrary) responses, return an error, maybe forward the last error. Otherwise, return a success ad response with whatever relay-ad successes were obtained.

Use timers.setTimeout instead of global setTimeout. We may need to use the time heap since we have a lot of timers, but if simple works in prod, the feature is done.

blampe commented 8 years ago

cc @jeffbean

jcorbin commented 8 years ago

Completed prototype in the best-effort-ad branch:

jcorbin commented 8 years ago

turned best-efforta-ad branch into WIP #293

jcorbin commented 8 years ago

Done; published in v2.15.4