GRANDPA invariant violated (target > best)

davxy commented 1 year ago

May happen that THIS GRANDPA invariant constraining a vote target to be less than the current best is violated.

The bug triggers quite often so is worth fixing it ASAP

Analysis:

GRANDPA protocol needs to select what is the target block to vote on here
That block is selected to be above a given block (aka "base") passed as parameter
target is selected using select_chain.finality_target(base, None) here. Starting from the leaves this will pick the block that is farther from the base (optionally less than a max number, None is passed)
best is selected using select_chain.best_chain() that , according to the LongestChain implementation here then here it returns the best descendant of the best block (sorry for the recursion... 😃) according to the META info stored by the backend. In practice this contains a descendant of the best according to BABE primary/secondary block rule (not always corresponding to the longest chain).

I have a locked up DB (i.e. every time I restart the validator this terminates immediately) like the following:

    7---8---9---10---11
    ^       |         ^selected as target by select chain
    base    |
            +---10'
                 ^best acording to backend meta info

When I start the node the following error is then returned and grandpa terminates:

XXX DEBUG LOG target: 11, best: 10, base: 7
2023-01-27 15:15:54 GRANDPA voter error: safety invariant has been violated: SelectChain returned a finality target higher than its best block
2023-01-27 15:15:54 Essential task `grandpa-voter` failed. Shutting down service.

Potential solution

Obviously we have to prevent GRANDPA to vote for a chain that is not the best according to our node.

The finality_target should choose directly a chain that includes our best, thus should pick the longest chain that: contains the base AND the best

rphmeier commented 1 year ago

Polkadot uses a custom finality voting rule and select_chain implementation which is quite crucial not to break: https://github.com/paritytech/polkadot/blob/master/node/service/src/relay_chain_selection.rs and https://github.com/paritytech/polkadot/blob/master/node/core/chain-selection/src/lib.rs

https://github.com/paritytech/substrate/blob/248fdf0d4b5e3758cfdadb283b5eca5f0731e466/client/finality-grandpa/src/environment.rs#L1230 I don't fully understand why this check is required. If there is a maximum allowed block number then the SelectChain trait should accept this block number as a parameter to finality_target.

rphmeier commented 1 year ago

Obviously we have to prevent GRANDPA to vote for a chain that is not the best according to our node.

Not exactly sure what you mean by this. The SelectChain implementation is also part of the node. Which source of information is actually correct within the node and how do you know that for certain?

rphmeier commented 1 year ago

The finality_target should choose directly a chain that includes our best, thus should pick the longest chain that: contains the base AND the best

I think this in particular would break Polkadot. Forcing a vote on anything >= than the minimum possible block is just dead wrong and a bad API for the SelectChain. In particular, it gives us no option to actually select a chain...

In Polkadot we are happy to build on top of forks which contain un-approved parachain candidates but we definitely 100% don't want to finalize them. This is just one example. But the API needs to support such use-cases.

SelectChain is a very consensus critical interface.

It is critical that GRANDPA follows the outputs it gets from SelectChain
It is critical that the interface does not constrain behavior any more than the absolute minimum required by GRANDPA

davxy commented 1 year ago

Maybe I was not super clear describing the issue.

If I have to synthesize the issue in one sentence I would say: "the current way to choose the target (SelectChain::finality_target()) doesn't take in consideration what is the chain chosen by SelectChain::best_chain()"

The two functions may return blocks in two separate forks, we compare their heights and bail out if best > target.

IMO the hash returned by finality_target should be in the same chain of best_chain (or can we vote for a chain that is not the best according to best_chain()?) As far as I understood this is not correct.

davxy commented 1 year ago

Obviously we have to prevent GRANDPA to vote for a chain that is not the best according to our node.

Not exactly sure what you mean by this. The SelectChain implementation is also part of the node. Which source of information is actually correct within the node and how do you know that for certain?

With "best according to our node" I was referring to the value returned by SelectChain::best_chain(),

andresilva commented 1 year ago

The reason we call SelectChain::best_chain in GRANDPA is just to pass it along to any VotingRule that might want to further restrict the votes (e.g. we have a voting rule to constrain votes by N blocks from best). I think we should change the API such that in a single call we can get both the finality target and the best block (this also helps with a possible race condition of making two asynchronous calls separately). The semantics should stay the same and the voter should follow the decision to finalize the block that was yielded by SelectChain, but I think it's fair for the API to expect that both blocks are congruent, i.e. the finality target should be an ancestor of the best block (or the same block).

davxy commented 1 year ago

@andresilva I elaborated our discussion and if here our requirements are:

that target.number <= best.number
(this is the new requirement) target is in the same chain of best (according to the SelectChain::best_chain)

Can't we just start our finality target search from SelectChain::best_chain and eventually get one of its ancestors if there is a limit to respect? Obviously the lower bound is the passed block (aka the base) as before...

In short, requisite 2 above limits our choices to this chain only: base -> ..... -> best if I've not overlooked some detail, probably makes no sense to iterate over all the leaves.

rphmeier commented 1 year ago

target is in the same chain of best (according to the SelectChain::best_chain)

Is this actually achievable?

AFAIK target is set either as the last finalized block or is an overestimate of what could be finalized. This by definition can be conflicting with BABE or other chain selection rules.

We may require that target is in the same chain as best, but it should be done by changing the opinion of best (i.e. "breaking" BABE) rather than changing the definition of target (i.e. "breaking" GRANDPA)

davxy commented 1 year ago

It is achievable when the user decides to use LongestChain implementation of SelectChain (i.e. the "default" one that ships with Substrate).

Current authoring algorithms are using the same SelectChain that GRANDPA is using here.

They write what is their best in the backend (META column), this info is then fetched using self.backend.blockchain().info() in the LongestChain methods implementation.

This strategy is not something written in stone, but depends on the SelectChain implementation the user decides to use (e.g. Polkadot, as you know doesn't use LongestChain as is).

PR https://github.com/paritytech/substrate/pull/13289 is actually using this information to get a finalization target in the same chain of the best

bkchr commented 1 year ago

Can't we just start our finality target search from SelectChain::best_chain and eventually get one of its ancestors if there is a limit to respect? Obviously the lower bound is the passed block (aka the base) as before...

The linked pr is still missing the removal of best_chain then. Otherwise we can run into issues when there happens a reorg between calling finality_target and best_chain as already explained by rob.

paritytech / substrate

GRANDPA invariant violated (target > best) #13254

Potential solution