Thundering herd at expiration time

sysrqb commented 1 year ago

Consider adding a deployment consideration comment about how to avoid the thundering herd problem.

sysrqb commented 1 year ago

Double-check contains the following recommendation:

All clients of the same Proxy and Desired Resource will have locally cached copies with the same expiration time. When this copy expires, all active clients will send refresh GET requests to the Proxy at their next request. Proxies SHOULD use "request coalescing" to avoid duplicate cache-refresh requests to the Origin.

If the Desired Resource has changed, these clients will all initiate GET requests to the Origin (via transport proxy if applicable) to double-check the new contents. Proxies and Origins MAY use an HTTP 503 response with a "Retry-After" header to manage load spikes.

Alternatively, clients could re-check/refresh/update their cached keys at some (relatively small) offset, chosen at random, from the expiration time. This wouldn't prevent a spike around the expiration time, but this would at least smooth the spike a bit. Ben Schwartz has an interesting idea of how to do this safely.

bemasc commented 1 year ago

The best solution I've thought of is to define each profile as providing a "ladder" comprising a pair of resources (perhaps distinguished by a request header). The client fetches both, which are cached independently, and verifies that that they indicate the same configurations during the overlap in their cache lifetimes. After the first expiration, the client picks a random time before the second one expires to refresh the first one. This ensures that refresh requests are not clustered near expiration times, and allows us to continue using the Mirror Protocol as-is. (Mirrors do not need to be aware of the ladder semantics.) However, it doubles the number of requests, and complicates the deployment for origins.

tfpauly commented 11 months ago

Another approach is to generally recommend that clients fetch resources from the mirror as they need to use the resource (if I don't need to fetch a token, then I don't check the key when the key expires). For these less frequent cases, they shouldn't have a thundering herd.

For resources that are used very frequently, clients should avoid fetching the mirror resource at the exact time of expiration, and should have some jitter around that (potentially aligned to when they'd fetch a resource based on use anyway)

bemasc commented 11 months ago

For these less frequent cases, they shouldn't have a thundering herd.

Yes: if the use frequency is long relative to the cache lifetime, then we don't expect clustering of refreshes at expiration points. (Assuming on-demand validation, which seems like the reasonable default despite some concerns about timing correlations.)

For resources that are used very frequently, clients should avoid fetching the mirror resource at the exact time of expiration, and should have some jitter around that

I don't think this works. Fetches prior to the expiration time don't help: they just return a mirror resource that is about to expire. Using the fetched resource after the expiration time is unsafe in two ways: it may be invalid (leading to failure) and you may be the only client still using it (leading to deanonymization).

sysrqb commented 11 months ago

Thanks for raising this issue Ben, and giving some guidance on how to solve it. I wrote some text around this. It takes a simpler approach that puts some responsibility on the service/origin, instead of describing a more complicated client-side chaining/ladder solution. Some implementations may benefit from chaining multiple resources, with different expiration periods, but that seems out of scope for the core protocol at this time. If there is interest in specifying this from the group, then we will discuss adding it in a later version of the draft.

sysrqb commented 10 months ago

Happy to continue the discussion from https://github.com/chris-wood/draft-group-privacypass-consistency-mirror/pull/21#discussion_r1365720273 if there is more that we need to consider.

bemasc commented 10 months ago

Ultimately, the options I see are:

There is only one copy of the resource in use. Thundering herd at replacement time.
There are two or more copies of the resource in use with staggered lifetimes. No thundering herd, but someone has to check that both copies are equivalent during the overlap in their lifetimes.
- Clients fetch both and check: "ladder" strategy as above (https://github.com/chris-wood/draft-group-privacypass-consistency-mirror/issues/8#issuecomment-1711918025)
- The mirror understands the resource and performs the equivalence check before updating its cache (a "smart mirror" strategy).

The "ladder" strategy seems preferable to me because the "smart mirror" approach has some issues:

It creates ossification risk (mirrors need to add support for new formats before they can be used).
It doesn't support novel uses of mirrors beyond consistency checking.
It creates a risk of consistency failures if the mirror and the clients have different interpretations of the resource.

ietf-wg-privacypass / draft-ietf-privacypass-consistency-mirror

Thundering herd at expiration time #8