Define "retry" logic - Githubissues

sciros commented 3 months ago

There is some language in the requirements around retrying a corroboration, but it is so far limited to the following (paraphrasing):

A retry of validation on a single perspective is not allowed. If a retry is required, all corroborations from previous attempts must be invalidated, and all perspectives need to retry.

The current API specification puts a further constraint on how a retry would function, by forcing a non-predictable-yet-deterministic selection of perspectives from the pool of known perspectives. This will select the same set of perspectives every time a check is run. It effectively mitigates the risk of an adversary retrying many a check times until enough hijacked routing tables are checked to result in a false quorum (with the assumption that the check selects a truly random set of perspectives each time).

This mitigation technique has the disadvantage of constraining what legitimate retry logic can be implemented, as it effectively forces the same perspectives to be tried over and over.

An alternative approach to mitigating the same risk, while enabling retry logic that interrogates distinct sets of perspectives, is to constrain the number of retries. This carries its own implementation complexity, however, since tracking the number of corroboration attempts per domain or IP target is a stateful concern and serverless functions should be stateless. And leaving tracking of retries as a burden for the API client to manage (i.e. not having an out-of-the-box risk mitigation against brute-force retries) is undesirable.

So, what designs are available to us if we want to enable effective retries while limiting the number of retry attempts?

I propose the following idea (@birgelee this may be more-or-less what you already have in mind):

Given a set P of all known perspectives, an interrogation subset size (perspective count) C <= |P| and ⌊|P|/C⌋ = N (integer division):

If N = 1, i.e., two disjoint subsets of perspectives of size C cannot be defined, then for any given domain or IP target T, a subset S ⊆ P should be defined based on a function f where S = f(P,C,T) and the same values for P, C, and T always yield the same S. Furthermore, S should not be easily predictable. Example: P = {p0, p1, p2, p3, p4, p5, p6, p7}, |P| = 8, C = 6, and ⌊|P|/C⌋ = 1. Then one valid subset may be S = {p1,p2,p3,p4,p5,p6}. And it must be the same subset for the same target T each time.
Else if N > 1, i.e., N disjoint subsets of perspectives of size C can be defined, then it is valid to define N subsets and for corroboration to interrogate any of these subsets. This list of subsets should be defined based on a hashing function h where [S1..SN] = h(P,C,T) and the same values for P, C, and T always yield the same list of subsets [S1..SN]. Furthermore, the elements of each subset should not be easily predictable. Example: P = {p0..p13}, |P| = 14, C = 6, and ⌊|P|/C⌋ = 2. two valid subsets are S1 = {p1,p2,p3,p4,p5,p6} and S2 = {p7,p8,p9,p10,p11,p12}, with S1 ∩ S2 = { }. And they must be the same subsets for the same target T each time.

Scenario 1 above is how the code currently works for all values of N. Scenario 2 will require defining a function h that can deterministically sort the elements of P into N buckets of size C. It may also be a good idea to have the subsets ordered deterministically relative to each other.

Example: P = {p0..p19, |P| = 20, C = 6, ⌊|P|/C⌋ = 3. Given T = example.com, h(P,C,T) always returns [{p1,p4,p7,p10,p13,p16}, {p2,p5,p8,p11,p14,p17}, {p3,p6,p9,p12,p15,p18}].

Such a setup allows for a max-attempts field to be made available in the API spec.

The MPIC service can first attempt corroboration with the first subset from h.
If it fails, then the service can carry out up to max_attempts - 1 retries as part of a single API call, since it can track what subset Sx is used for each remote corroboration attempt.
The service can have baked-in retry logic to use a previously-unused subset until no unused subsets remain.
If no max-attempts parameter is set, it can just use the first subset for its single corroboration attempt.

This would give the API client a certain level of flexibility. The client can retry as many separate API calls as they like, however each separate call will use the same set of remote perspectives. And the client can execute several corroboration attempts per API call, configured by max-attempts, with the above constraints on how perspective "cohorts" are determined.

It would also not force an upper bound on the number of retries, something that is good about the current implementation as well, while providing some more value to multiple retries, up to a point (effectively limited by how many remote perspectives the API owner is willing and able to deploy).

This setup also makes it as simple as it is today to meet the BR specification that any previous corroboration results must be discarded when executing a retry.

Edit: Adding that in all cases the perspective groups selected would adhere to BR requirements specifying a minimum of 2+ corroborating RIRs, and a minimum distance of (currently) 500km between perspectives.

birgelee commented 3 months ago

I like this approach. I would add one point. There may still be some value in retries if N = 1 so long as P > C. Essentially this is the case where there are more available perspectives than the number being chosen but there are not twice as many so that two distinct cohorts can be produced. Imagine deployed perspectives P = 10 and chosen perspectives C = 6.

Here the perspective counts cannot be divided into multiple cohorts. However, there is still substantial potential for use of alternate perspectives. To improve performance in these cases, I propose an additional form of retry logic to be used when N = 1 and P > C.

Assume the chosen perspectives C are deterministically chosen given a random seed. On attempt one generate the seed from the hash of h(domain | 0) where | represents concatenation. On attempt two generate the seed from h(domain | 1) etc. Limit this behavior based on the max attempts value.

I think when N>=2 the proposed solution with cohorts is preferable because it ensures that for every retry no previous vantage points will be reused. However, I think there may be value in offering some retry logic even when N=1

My recommendation for the code would probably be a hybrid approach, the system attempts to derive cohorts if possible, but otherwise relies on this type of hash salting.

sciros commented 3 months ago

Seems like a good addition to me. So, overall I'd say this is fairly complex logic. As far as the API spec is concerned, I suppose the only real implication is the new max-attempts parameter. Or should we describe the entirety of this logic within the API spec? It feels complex enough that it may even warrant some kind of documentation based on formal notation (if not in the API spec, certainly in the open source implementation's docs). I'm not really practiced at that. Know anyone who is?

birgelee commented 3 months ago

I would like to keep the logic out of the API spec. other implementations might conform to the spec but not implement logic this complex. I would add a max-attempts param to the API spec and keep it at that.

There should be some documentation. I'm not sure if there is a particular formal notation you had in mind. In a paper we would probably represent something like this in pseudocode. I could begin producing some documentation along with the implementation.

Another point: I would probably also recommend two separate parameters to control retry logic. The first is max-attempts which is a very simple control on how many times the API should retry. The second I would think of as something like max-perspective-sets which defines a hard cap on how many distinct sets of perspectives will be used for any particular identifier. If max-attempts > max-perspective-sets some number of retries with the same sets of perspectives may be used.

sciros commented 3 months ago

I agree that this logic doesn't have to be in the spec. (There's some documentation in the current spec regarding logic to prevent retry-based attacks within the perspective-count description, but we can revisit that later if it needs tweaking.)

The tests I plan to write will document the logic to some extent. Beyond that, I'll probably look at using some sort of formal notation, including set notation, just to keep things concise, and add some specification by example.

As for the max-perspective-sets suggestion, unless I'm not thinking about it right, that only makes sense for the N = 1 use case and only if the re-seeding retry logic is used as you suggested, and only if at some point you want the retries to start re-using the same perspective set before you run into the otherwise theoretical limit of perspective combinations which is P choose C. (Though that limit could still be reached if max-perspective-sets is set high enough.)

For N > 1, the number of perspective sets used is N since the idea is the sets are disjoint. Unless, that is, you are thinking of generating a new list of disjoint sets once the first one is exhausted with retries.

In both cases, unless my thinking about the algorithm is wrong (and that's very possible on a Sunday afternoon) this feels to me like a pathway to defeating the idea of deterministic perspective selection (in fact I'd say that about the N = 1 with re-seeding logic just on its own as well), unless there's an internal limit to max-retries that keeps you from combining, say, 6 perspectives out of 10 in 210 different ways. The current logic (and the fixed disjoint sets) are provably resilient against a sufficiently high number of retries.

And, I kind of feel this additional parameter adds an element of complexity to what the client should even think about doing with these parameters, with rather little intuition around how these two parameters might interplay. But that's just my initial thinking on it.

birgelee commented 3 months ago

I was thinking about this a bit more and there were two points I would bring up.

I think it probably makes sense to remove any reseeding logic if we opt for the cohort behavior simply to avoid having two relatively complex retry systems implemented. Towards this end I am currently leaning towards only using cohorts when N >= 2 and using fully deterministic selection when N = 1
I would still recommend a max-perspective-sets parameter not in API calls but in some sort of configuration. It could even be a .py config file that is not intended to be edited by the person doing the deployment. I would think of the number of perspective sets being max(N, max-perspective-sets) , It is simply a way of putting an upper bound on N because high N can cause a security degradation. I also do not like the idea of the number of sets used increasing if the API is deployed in more regions. I know some CAs have discussed having a ton of perspectives to cause a high degree of variability in terms of where the MPIC requests come from. However, if N alone determines perspective set counts, this high region deployment will counterintuitively degrade security because it will increase N.

I would personally go for max-perspective-sets = 2 as this is the strongest security stance that still permits retries. In other words, even if 15 perspectives are deployed and 5 are used (potentially allowing for 3 cohorts), only two will ever be used for any given identifier even though this deployment has N=3.

Also, to clarify I think cohorts should be selected deterministically based on a seed derived from the identifier queried.

sciros commented 3 months ago

Ok, I see what you're getting at. So for the API spec, it's just a max-attempts parameter at the top level and the rest is implementation-specific, with perhaps some "must"/"should" language like the spec currently has about implementations.

There is one other implication here for the spec: What about the perspectives argument, which lets the caller explicitly list the specific perspectives to interrogate? This opens up a way to get around everything we've been discussing, so I think that either this parameter gets eliminated altogether, or is retained as a convenience parameter for diagnostics/development. In that case I would propose that we have a diagnostic-mode flag parameter which, if set to true (or we can have like a mode (= "production" or "diagnostic") parameter which defaults to "production", whatever approach you prefer) lets the caller do a couple of things that a "production" mode configuration would not: one is to set specific perspectives (disabling the deterministic perspective set selection) and the other is to set request validation to "structure-only" (disabling value-based validation that would look at things like quorum count or whether the explicitly named perspectives are in the list of known perspectives and so forth).

(I do want to discuss the max-perspective-sets question further, but I also think it's implementation-specific, so that part of the conversation I'll continue by opening an issue in the reference implementation repo.)

birgelee commented 3 months ago

I am in favor of perspectives being moved to diagnostic only. Its quite helpful for some internal API call work we have done at Princeton so I would prefer that over it getting removed.

I would propose an optional max-attempts parameter under system-params since it affects both validation and caa behavior.

birgelee commented 3 months ago

Max-attempts is now implemented in the retry-logic branch . If this looks good I will merge it in. I propose additional retry discussion move to the implementation repo.

sciros commented 3 months ago

Thanks @birgelee, that looks good.

birgelee commented 3 months ago

Thanks @sciros . I merged in the branch and updated the version to 1.0.2 . I am closing this discussion here and additional discussion I propose move to the implementation repo.

open-mpic / open-mpic-specification

Define "retry" logic #3