FlyClient support via a modified header "Variable Difficulty MMR"

antiochp commented 4 years ago

FlyClient (revised Aug2020): https://eprint.iacr.org/2019/226.pdf

To enable handling difficulty-based sampling, we need to make two adjustments. We need a data-structure which efficiently and securely enables the verifier to sample blocks based on their relative difficulty positions, rather than their absolute positions as in the standard MMR. Second, Assumption 2, which states that the adversary’s forks have only a fraction of the honest chain’s weight, requires that all difficulty transitions are correct. In fact, as described in Section 3.1, the assumption is broken if the adversary can arbitrarily manipulate block difficulties. We show how an adapted variable-difficulty MMR which aggregates information about the total difficulty as well as difficulty transitions can resolve both issues at once.

Definition 7 (Difficulty MMR). A difficulty MMR is a variant of the MMR with identical properties, but such that each node contains additional data. Every node can be written as h, w, t, Dstart, Dnext, n, data, where h is a λ bit output of a hash function, w is an integer representing a total difficulty, t is a timestamp, D is a difficulty target, n is an integer representing the size of the subtree rooted by the node and data is an arbitrary string. In a block, i.e., a leaf node, n = 1 and t is the time it took to mine the current block (the blocks time stamp minus the previous block’s timestamp). w, Dstart is the current difficulty targets and Dnext is the difficulty of the next block computed using the difficulty adjustment function defined Definition 1. Each non-leaf node is defined as {H(lc, rc), lc.w + rc.w, lc.t + rc.t, lc.Dstart, rc.Dnext, lc.n + rc.n, ⊥}, where lc = LeftChild and rc = RightChild.

Difficulty MMR Verification. We need to modify the MMR verification algorithm in several ways. The prover algorithm will be the same generating a proof consisting of a Merkle path. In general the verifier will check the Merkle path and that the targets assigned to each node are feasible. For simplicity, we assume that the epoch length m and the total number of leafs n are powers of 2. Given a left child (lc), a right child (rc) and a parent node (p), the verifier performs the following checks:

Compute p using lc and rc following Definition 7.

Verify that lc.Dnext = rc.Dstart.

For both lc and rc verify that they are internally consistent. That is, ensure that there is a possible set of legal difficulty transitions given the aggregate timing and difficulty parameters of these nodes: • If the node is within an epoch, i.e., below level log2 (m), ensure that the difficulty and weight are consistent with the epoch’s difficulty. • If the node captures a difficulty transition, ensure that Dnext is computed correctly using the difficulty transition function from Definition 1 and t. • tstart, tend, w, Dstart, Dnext: there is a possible assignment to the difficulty transitions yielding these parameters. See discussion below for details.

Our header MMR contains non-leaf nodes that are defined simply as -

H(lc, rc, pos) where lc = LeftChild and rc = RightChild

To implement FlyClient such that we can efficiently verify difficulty transitions, the paper states that we need to redefine non-leaf nodes as -

{H(lc, rc), lc.w + rc.w, lc.t + rc.t, lc.Dstart, rc.Dnext, lc.n + rc.n, ⊥}

i.e. Rather than simply storing intermediate hashes for non-leaf nodes we store difficulty related information, representing contiguous lists of headers (periods of time) beneath subtrees of the header MMR. Otherwise there is no efficient way to determine, for example w the total difficulty of the set of headers beneath the subtree root, or Dstart and Dnext the initial and next difficulty targets within the period beneath the subtree root.

We used to have the concept of a sumtree in the early days of Grin. This maintained both the hash and the sum of all commitments beneath a given subtree root in the output MMR. I believe the intent was to leverage the sum for utxo related calculations but we stopped using this once it became apparent that output MMR was only ever going to represent the full TXO and not the UTXO (or STXO).

The approach discussed in the FlyClient paper is similar conceptually to our sumtree, but for difficulty related information.

lc.t + rc.t is confusing given t is a timestamp related to each leaf node block header. Presumably it would make more sense to store the delta (the time taken to mine a set of blocks) in non-leaf nodes?

n does not appear to be necessary given our MMR construction. We can determine n based on the MMR pos for any given node.

Q) what information is required at each non-leaf node to allow verification of difficulty transitions for a non-leaf node (i.e. across an aggregate time period over multiple block headers, without access to the individual headers)?

TODO - If we want to support FlyClient in the future, and if this requires a consensus breaking change to way we calculate the header MMR root, do we want to consider making these changes sooner rather than later?

It is not sufficient to simply commit to difficulty information at the individual header leaf nodes. We need to be able to verify difficulty transitions at the aggregate non-leaf node level.

tromp commented 4 years ago

The paper claims that "All of these checks ensure that the difficulty transitions of queried blocks are valid." which to me seems incorrect, unless they check every single difficulty adjustment, which they don't. How can they check that even two consecutive difficulty adjustments are correct, without knowing the intermediate data? It looks like their verification only checks whether the difficulty adjustments are possible, given certain limits on how fast difficulty can change (at most a factor 4 over 2016 blocks in bitcoin). Furthermore, the proofs don't show why this need only be checked at doubling intervals. If that suffices, then only doing it between consecutive sampled points could suffice as well. Which doesn't need any additional data to be hashed at internal nodes. I contacted Benedikt to ask for clarification.

antiochp commented 4 years ago

only checks whether the difficulty adjustments are possible, given certain limits on how fast difficulty can change

Yes. That was my interpretation also. That a path in a Merkle proof would include increasing sizes of periods from a pair of headers at height 0 all the way up to the entire chain at the root. And that this difficulty adjustment check would be performed at each intermediate "aggregate" subtree.

Furthermore, the proofs don't show why this need only be checked at doubling intervals.

I was assuming this is not so much the doubling that is important, but inclusion under all ancestors above it in the MMR along the path to the root. That no ancestor on the Merkle proof path exhibits an invalid (invalid in aggregate) difficulty adjustment.

Hopefully Benedikt gets back to us with some clarification/confirmation soon. In the meantime I have spent a bit of time looking at what would be involved in supporting "sumtree" ("difftree" here I guess) functionality in our MMR structures - maintaining summary data in addition to the hash at each parent in the MMR.

This may end up being potentially useful for other things, regardless of what the outcome is here.

[WIP] PR coming shortly that explores this.

antiochp commented 4 years ago

at most a factor 4 over 2016 blocks in bitcoin

If we have the start time and the end time of the aggregate period (based on first and last header timestamps) is this further constrained?

antiochp commented 4 years ago

WIP PR is here - https://github.com/mimblewimble/grin/pull/3480

Still working on getting this into a state where it compiles, but the basics are there and this approach appears to be do-able. Might be a PITA to deal with pre and post HF for eternity though.

tromp commented 4 years ago

Please hold off on rewriting hash_with_index. I'm entirely unconvinced we need that. In my last email to Benedikt I wrote:

Wouldn't it suffice to check that the cumulative difficulty change between any pair of successive sample points is consistent with the time interval bet ween them? With all your internal node checks, you seem to be checking pretty much the same thing, since there is always a largest (aligned) 2-power sized interval in between successive samples of which you only check the boundary condition.

(still no response)

antiochp commented 4 years ago

Related: ZCash ZIP 22 "FlyClient" - https://zips.z.cash/zip-0221

https://github.com/therealyingtong/zips/blob/master/zip-0221.rst https://github.com/zcash/zcash/pull/4264

antiochp commented 4 years ago

@tromp The more I think about this the more convinced I am that you are right to question this.

(I'm just repeating what we know here for my benefit and reference later).

In terms of sampling headers based on difficulty then inclusion proofs would suffice and we don't need aggregate data in the proof itself. i.e. To prove we are sampling a header at some difficulty (rather than some specific pos) we can provide the header immediately higher than difficulty x and the header immediately lower than difficulty x and show they are consecutive in terms of pos (so no other header exists closer to x). Proofs will be larger as we need 2 consecutive headers per sampled difficulty.

From the DAA RFC -

wtema is simpler to state, requires only the last 2 blocks, ...

Am I correct in assuming the new DAA will allow us to validate each sampled header difficulty based on 2 previous headers. So if we were to provide 3 headers per sampled difficulty then we could validate difficulty for individual sampled headers?

And then, as you mention above, consecutive sampled headers could be used to construct time periods across which we can validate boundary conditions of the overall difficulty adjustments across each period. Presumably if we wanted more granularity we could sample more headers? Or randomly sample headers based on difficulty as described in FlyClient paper and then additionally sample midpoints between these (or some other subsampling approach)? Although any approach can potentially be simplified by just sampling more headers randomly?

The approach in the FlyClient paper using nodes in the Merkle proof as doubling of period looks only at MMR position and this feels like it conflicts with the need to look at total difficulty and not position (due to varying difficulty over time). Looking at pos here feels somewhat arbitrary, simply because we have them in the Merkle proofs.

mimblewimble / grin

FlyClient support via a modified header "Variable Difficulty MMR" #3479