refine agreement table in 9.4.5, Immunity and Availability

dhh1128 commented 3 months ago

As far as I can tell, the table that appears in this section contains columns adapted from figure 11.4 on page 98 of https://arxiv.org/pdf/1907.02143. However, the discussion that precedes the table, in this spec, does not explain the significance of 3F+1 or of (N+F+1)/2 in the analysis; readers would have to go to the PDF to understand them. If the text isn't going to explain those columns, I suggest that we remove them because they are confusing.

The table doesn't have a row for N=5, which confuses me. Perhaps this is because the introduction of F* into the analysis makes this table diverge from the one in the whitepaper for some reason I don't understand?

The discussion in the spec introduces something that I don't see in the section on agreement in the PDF, which is the idea of availability. There is no mention of F* in the whitepaper. If I understand the whitepaper correctly, "agreement" requires fully receipted proof by a sufficient number of witnesses, and requires the availability of the controller, so availability at the time the event was promulgated isn't the availability this section of the spec must be talking about -- right? But if it is talking about availability at the time the event is later queried, don't we just need to query any single witness that can prove prior agreement among at least M witnesses? If so, why is F* even mentioned?

I am happy to submit a rewrite, but I want to be sure I understand the intent here, before I do.

SmithSamuelM commented 3 months ago

@dhh1128 The relevant section starts with... It can be shown that for any set of N witnesses, (see [[ref: KERI-WP ]]) there is a threshold M < N that guarantees that at most one sufficient agreement occurs or none at all despite a dishonest controller but where at most F* = N-M of the witnesses are potentially unavailable and at most F < M is duplicitous.

The result comes from a formula that uses F, F*, M, and N. THis formula is not included in the spec because it is defined in the reference. So this section of the Spec is citing a way for a controller to ensure sufficient agreement given faults. It is up to the controller to decide what policy it will be held accountable via the threshold and witness list. A validator can then decide if its sufficient for them to trust. In this sense it is informative.

Likewise the table only lists some of the values one could generate with the formula and is not exhaustive.

The phrase "It can be shown" tells the reader that the result is being cited but not explained or proven in detail as that would take several pages. It is assumed that if the reader wants to verify the result they need to be familiar with the cited reference. Similarly when the spec uses the word Internet it is assumed that the reader is familiar with the term.

The question I struggled to answer when writing the spec was how much was enough for the spec. Specs aren't tutorials then are specs and have to make assumptions about the terminology. I can see that another paragraph that goes into more detail on the difference between F and F* may be helpful but then so would a tutorial on Byzantine Agreement algorithms.

For example the white paper has this paragraph:

With the inequality above, there may be more than one solution for M for any given N and F. In that case, we can have an effective number of unavailable witnesses denoted F which is given by F = N - M (0.1) There are two types of faults we care about with regard to F and F. One type is unavailability, in which an otherwise honest witness is not available to publish a receipt for a first-seen event. The other type is duplicity, in which a witness publishes a receipt for more than one version of an event given more than one version of an event has been produced by a duplicitous or compromised controller. The inequality means that while no more than F witnesses are allowed to be duplicitous, in some cases F ≥ F honest witnesses may be unavailable.

But to understand what an HONEST witness is, versus a DUPLICITOUS witness requires yet more paragraphs.

An alternative would be to simply cite the reference and state in the spec that a Controller MAY choose to use the KAWA algorthm for threshold selection and cite the reference and leave everything else out. It is in an appendix as informative but not exhaustively so.

SmithSamuelM commented 3 months ago

FYI: A duplicitous (maliciously dishonest ) witness could witness more than one version of the same event. This means that effectively the size of N is exanded. Suppose that N is 7 and M is 5 and F is 2 then when "voting" the two dishonest witnesses each vote to two different versions of the event this means that a total of 9 votes would be cast. 5 votes by the honest witnesses and 4 votes by the two dishonest witnesses. This means that 5 is still sufficient to guarantee that only one version can satisfy the threshold given at most two dishonest witnesses. But I could expand my witness pool N to 8 with a threhold of 5 by allowing at most a total of 3 faulty witnesses where 3 were faulty but unavailable not dishonest. This is F. So the controller determines what level and types of faults it wants to be responsible or held accountable for. The algorithm tells the controller and validators what the possibilities are given the N and M. It doesn't guarantee that they won't be more than F or F faults, merely that at most, F or F* faults could occur and still guarantee agreement.

We can also consider combinations of F and F. Suppose at any given point in time we have at most 1 F dishonest witnesses and at most 2 F unavailable witnesses, but we expand to N =8 while keeping M = 5. Now, the worst cases would be

1) the 1 F votes twice and all the other 7 honest witnesses vote. This gives a total of 9 votes. The threshold of 5 is still a majority so we have a guarantee of agreement

2) The dishonest witness votes twice and two of the honest witnesses are unavailable. So we have 6 votes, (5 available honest and 2 from the one dishonest) So we have either 5 votes that are in agreement or no agreement at all.

Recall, that the guarantee for a given N, M, F or F* is that either there is one agreement or no agreement at all never multiple agreements.

Now, with a pool of 8 and a threshold of 5, I can't have two dishonest witnesses at the same time and still satisfy the guarantee because the number of votes, in the worst case, would be 10 with 6 honest + 4 (2 dishonest voting twice each) and 5 is no longer a majority.

The formula just informs the controller and validators what types and levels of faults F or F* are allowable given N and M.
Although, a Validator may hold the controller ultimately accountable for any duplicity of its witnesses. a Validator on a high stakes event may choose to watch enough witnesses to guarantee the actual level of duplicity satisfies the validators trust level before making a trust decision.

dhh1128 commented 3 months ago

In the version of the spec published at https://trustoverip.github.io/tswg-keri-specification, the reference to KERI-WP points at a term, not at an entry in the bibliography. When you hover, its URL is https://trustoverip.github.io/tswg-keri-specification/#term:keri-wp, I see now that what's checked in in the markdown has a biblref instead of a termref, so I would assume the intent was to point to item 4 in the bibliography? The string "KERI-WP" only occurs in the markdown at that one referencing location; there is no apparent referent for it. It seems that either the transformation process is generating URLs wrong, or we have stale content published.

The version of the KERI whitepaper that is referenced in both the published HTML and currently checked in markdown of the spec is v15 from arxiv.org: https://arxiv.org/pdf/1907.02143. In that version, some of what you quoted as whitepaper content doesn't appear. For example, I could not find the phrase "there may be more than one solution", and I could not find the string "F*". I assume that you are citing a newer version of your paper, perhaps the one published in your github repo? If so, should we update the link in the spec?

I get that the spec and the whitepaper have different purposes, and that the spec references the whitepaper to provide more context. However, I still think the spec should minimize references to terms or mathematical expressions that cannot be understood from the spec itself. Perhaps that goal is not always possible, but in this case, I think it's easy to achieve. Either:

as you said in your final paragraph: simply cite the reference and state that a controller MAY choose to use the KAWA algorithm. OR
remove the columns from your table that have a significance not explained in the spec: 3F+1 and (N+F+1)/2.

I think less is more, so I have a preference for solution 1. But I'm happy to submit a PR for either solution.

dhh1128 commented 3 months ago

FYI: A duplicitous (maliciously dishonest ) witness

What is confusing to me in this explanation is the element of time. Does "voting" happen as witnesses are coming to agreement, or does it happen later, when a witness decides which version of reality to report to a validator? Is F*, which accounts for unavailable witnesses, accounting for witnesses that are unavailable at the time the controller promulgates the events and waits for agreement -- or to witnesses that are unavailable later, when a validator asks for the KEL?

SmithSamuelM commented 3 months ago

The spec should be referencing version 2.62 of the WhitePaper which is the one on my github repo.
I can go either way on the table.

dhh1128 commented 3 months ago

Okay, I'll raise a PR.

dhh1128 commented 3 months ago

Both https://github.com/SmithSamuelM/Papers/blob/master/whitepapers/KERI_WP.web.pdf and https://github.com/SmithSamuelM/Papers/blob/master/whitepapers/KERI_WP_2.x.web.pdf are at version 2.60 rather than 2.62. Should I be looking elsewhere?

SmithSamuelM commented 3 months ago

@dhh1128

What is confusing to me in this explanation is the element of time.

This is a good question. As the white paper points out, KAWA does not have a liveness guarantee. Which means there is no time bound driving agreement. A validator either trusts or does not yet trust. Not yet could go on forever. A controller may recover by doing a rotation that replaces witnesses..

An honest witness can only see one version of a key event so it will only report that version and attach receipts from other witnesses to that event. If an event first seen by a witness never gets enough reciepts to reach the threshold then there is no agreement as far as that witnesses reporting of the event is concerned. But the witness only validates that the controller signatures are valid and if there is an authentication factor for receiving an event from a controller the witness also verifies that before "seeing" the event. Once seen the witness witnesses it regardless of what other witnesses have done or may do. The "consensus" is only with respect to what an external validator of the event sees. Should an unavailable witness later become available and provide a receipt, then that event may then at that later time show agreement.

So for example suppose a validator receives and event from a witness with attached receipts. The validator can verify. 1) the controller signatures satisfy the controller threshold for signing against the current key list at the sn of the event. The witness receipts verify agaisnt the witness keys. The number of receipts satisfy or not the threshold for witnessing.

If the receipts do not yet satisfy the TOAD (threshold of accountable duplicity or witness pool threshold) then the validator does not yet trust the event it does not accept it into its copy of the controller's KEL. The event is held in escrow. The validator is thereby prevented from making a trust decision with respect to any sealed data in that event.

Several possibilities for the future encur.

At some time in the future the event becomes fully receipted, i.e. the validator may eventually receive enough receipts that the event satisfies the threshold and the validator accepts it into its copy of the controller's KEL thereby enabling a trust decision.
The event is never fully receipted so the validator never makes a trust decision
The validator detects duplicity with respect to the event and therefore decides not to trust until the duplicity is reconciled to the validators satisfaction.
The controller determines that the witness pool is sufficiently faultly that it does a rotation recovery and diputes the event as well as reconstituting the witness pool. So a new event that anchors the same trust decision data eventually is fully receipted and propagates to the validator so the trust decison is able to proceed albeit with a different event. Since the original event never was accepted.

So how might duplicity exhibit.

First to keep it simple, Lets assume the controller is honest. The controller picks a witness threhold. Recall that the validator never "sees" the event until it is fully receipted, i.e., the number of witness receipts satisfies the threshold. So there is no "seeable" duplicity (in the strictest sense) wrt the validator for any event where receipts are below the threshold. But as soon as a validator first sees an event, it can't see any other version of that event. So it can't "see" any duplicity. Therefore for a validator to detect (not "see") duplicity it would have or act as a watcher that is watching for duplicity. I.e., was keeping track of events in addition to those it first saw that are duplicitous. A watcher who is doing this is called a Juror.

So a juror can detect (not "see") duplicity and then report that duplicity.

In the case of a witness pool, minimal duplicity is exhibited via an alternate version of the event that is witnessed by any witness. Ie.e some subset of witnesses witness an event that is a different version of that event as witnessed by any other subset of witnesses.

When duplicity is below the threshold, validators and controllers are protected because no validator will "see" duplicity below the threshold. This makes witnessing fault-tolerant. However, a controller can still detect it and take action to mitigate it even when it is below the threshold. Likewise, a validator may decide not to ignore duplicity and be proactive even when it is below the threshold.

When the duplicity is above the threshold then bad stuff can happen. This means that two versions of the same event are fully witnessed from a validator's standpoint.

In a two way transaction between a controller and validator, the version first seen by the validator wins. If the first version seen by the validator is the uncompromized version then no harm no fault, but the controller can detect it and repair the fault for the future without impinging on the validator.

If the first seen is the compromised version then the controller is accountable for any trust decision the validator makes as a result. The controller can still repair the fault by doing a rotation recovery but would have to get the validator to also redo the trust decision with a new transaction.

Also problematic is a three-way transaction in which two different validators "see" two different versions of sealed data. They can still protect each other by exchanging copies of KELs before making trust decisions. But if they don't, they could be induced to make errant trust decisions.

The threshold performs several functions:

TOAD (see above)
Security Threshold
Level of fault tolerance to dishonest (malicious) witnesses
Level of fault tolerance to honest but unavailable witnesses

So picking N, M, F, and F* impacts to varying degrees those four functions.

For example a witness pool of size N=4 with TOAD M=3 has a fault margin of 1 for dishonest witnesses and a fault margin of 1 for unavailable witnesses. This does not mean that duplicity is prevented it just defines the margins, how many faults could occur and still make guarantees.

Suppose instead that for a pool size N-4 the toad M=1. This means that there is no margin of safety for a malicious witness. Even one malicious witness can produce a alternate event that is fully witnessed and satisfies the threshold. But the available margin is 3. The pool can have up to 3 honest but unavailable witnesses and events can still be fully witnessed. So high availbility but zero fault tolerance to compromised witnesses. A validator making a trust decision may not want to engage in that case. Or the risk of the type of transaction may say that it is fine. For example, if the controller is using multisig, and has witnesses with independent authentication factors for accepting new events from controllers. Then the risk of a compromised witness may be so low that zero fault tolerance is acceptable.

An EGF might specify constraints on allowble combinations of N, M, F, and F*.

SmithSamuelM commented 3 months ago

I need to push 2.62 sorry

SmithSamuelM commented 3 months ago

A typical distributed consensus algorithm with a liveness guarantee is duplicity hiding. KAWA preserves the critical element of duplicity evident.

What I mean by duplicity hiding is that one version of an event that is correctly signed will always be guaranteed within a time bound be agreed upon (subject to the fault constraint). Two correctly signed but different events are duplicity. But the ledger only ever provides one.

This means a Validator never sees that there ever were alternate versions of that same event that were being voted upon but were discarded. Likewise a controller posting an event may never see that its keys have been compromised unless and until the compromised version of the event wins the voting and is the once accepted. By then it is too late.

Now from a systems design perspective how likely is F versus F. F is highly unlikely whlle F is highly likely. So NOT informing the users that F is a design variable might be unsatisfying. Nonetheless, F includes F as a subset, just with a more stringent threshold, so assuming that F* never exceeds F is a conservative design constraint but may be too conservative.

In order for a duplicitous witness fault to occur two compromises must have happened. A sufficient majority of the Controller's signing keys must be compromised and a given witness authenticaton factor with respect to accepting events from its controller must have also been compromised. Now a malicous attacker can publish a verifiable receipt for an alternate version of an event.

In order for an unavailable witness fault to occur, no key compromise is needed, the witness just has to be offline.

There are two types of unavailable faults, the first is intermittent and the second is permanent. In the first case when the witness comes back on line it repairs its own fault, recovery doesn't require any action by the controller. But in the seoond case the witness never comes back online or is offline long enough that the controller must take action and replace the witness with a new witness that works. The second fault could occur becasue of location, legal, or fiscal issues. Like a witness host goes out of business and the private key goes with it. So there is no way to bring it back on line. Or the private key gets lost and is only ever in memory so when the witness is rebooted the witness becomes permanently inable to witness any receipts.

So a transaction between a controller and some validator that depends on the validator accepting the latest key state, will be delayed indifinately if two many witnesses are permanently offline, or delayed long enough that the validator gives up if too many witnesses are temporarily offline during the time frame of the transaction. So F may matter as a design constraint. So a pool of 5 with a threshold of 4 has an F of 1 and an F of 1. Well a pool of 4 with a threshold of 3 also has an F of 1 and and F* of 1. So no more fault tolerance as the cost of an additional witness. So 5 is not a good example of a pool size. For a pool of 5, 3 is not a good threshold as F must be zero to guarantee agreement.

Whereas a pool size of 6 can have an F of 1 or an F * of 2. with a threshold of 4. So it gains some more Fault tolerance in return for more nodes.

So the table I put in the spec was simplified to be the more optimal combinations as I saw them.

SmithSamuelM commented 3 months ago

What I found when reviewing the state of the art with respect to distributed consensus algorithms and shared ledgers is little to no consideration of key management faults and how to mitigate and protect those. This is why I invented KERI and duplicity evident systems and algorithms (KAWA), so that key management faults are elevated to a primary fault we care about and duplicity evidence is the primary trigger for mitigation via rotation recovery and the primary trigger for protection by not trusting, ile withholding trust decision until the duplicty has been reconciled.

You can find some good discussion by Mazieres in his write ups and presentations on the Stellar algorithm why one may want to have different Safeness vs Liveness margins and guarantees in consensus. I.e making tradeoffs between the two. From that I realized that for trust decisions, often the best policy is to not trust, i.e the margin is 100% safety and 0% liveness. I.e never make a trade for liveness at the cost of safety.

SmithSamuelM commented 3 months ago

@dhh1128 I pushed the 2.62 version of the white paper. it was updated in January. I just forgot to push it.

trustoverip / tswg-keri-specification

refine agreement table in 9.4.5, Immunity and Availability #204