Ad requests fan out a relay-ad request for each of k affine Hyperbahn relays for the advertiser. If any of those requests fail, typically due to a timeout, the ad response fails. As k increases, the probability of failure grows, and it is a frequent occurrence in production that failures cause "storms" of "relay-ad" requests, bogging down the network and effecting "ad" retries.
To mitigate this problem, let’s consider capping the amount of time an ad will wait for relay-ad responses, and always provide a successful response containing all of the relays that successfully responded, unless there were none, or there were too few (at discretion of assignee).
I got started on this in my working copy, but it’s not even close to started. This just points where the code change probably ought to happen:
The logic should go like this: for about 200ms (out of the 500ms budget allotted to ad requests), we should wait for relay-ad responses. If all of the relays respond before the 200ms deadline, respond fast with the whole list. If the 200ms expires, respond immediately. Maybe if there are too few (arbitrary) responses, return an error, maybe forward the last error. Otherwise, return a success ad response with whatever relay-ad successes were obtained.
Use timers.setTimeout instead of global setTimeout. We may need to use the time heap since we have a lot of timers, but if simple works in prod, the feature is done.
Ad requests fan out a relay-ad request for each of k affine Hyperbahn relays for the advertiser. If any of those requests fail, typically due to a timeout, the ad response fails. As k increases, the probability of failure grows, and it is a frequent occurrence in production that failures cause "storms" of "relay-ad" requests, bogging down the network and effecting "ad" retries.
To mitigate this problem, let’s consider capping the amount of time an ad will wait for relay-ad responses, and always provide a successful response containing all of the relays that successfully responded, unless there were none, or there were too few (at discretion of assignee).
I got started on this in my working copy, but it’s not even close to started. This just points where the code change probably ought to happen:
https://github.com/uber/hyperbahn/compare/best-effort-ad
The logic should go like this: for about 200ms (out of the 500ms budget allotted to ad requests), we should wait for relay-ad responses. If all of the relays respond before the 200ms deadline, respond fast with the whole list. If the 200ms expires, respond immediately. Maybe if there are too few (arbitrary) responses, return an error, maybe forward the last error. Otherwise, return a success ad response with whatever relay-ad successes were obtained.
Use timers.setTimeout instead of global setTimeout. We may need to use the time heap since we have a lot of timers, but if simple works in prod, the feature is done.