Finalization stalling issues 2024-07

MOZGIII commented 1 month ago

This is a tracking issue for the recurring finalization stalling incidents.

So far, here are the pointers we have.

Commits that are affecting the finalization directly:

https://github.com/paritytech/substrate/pull/13289 Should not the the cause though, as the issue is with the inability of grandpa to follow a reorg, which is cause by pruned blocks. Or, rather, it could be the cause, but this one is not really where the bug we'd like to fix is.
https://github.com/paritytech/substrate/pull/13323 - ~~could this be the reason?~~ no, this one works as intended
https://github.com/paritytech/substrate/pull/13364 - this looks like it directly messes with what's causing the bug, but more like it fixes the aftermath of the one above

Notable historical commits that might've affected this incident:

https://github.com/paritytech/substrate/pull/12567
https://github.com/paritytech/substrate/pull/11727 (unlikely though)
https://github.com/paritytech/substrate/pull/13216 (actually looks somewhat interesting)
https://github.com/paritytech/substrate/pull/13157 (maybe pinning the block that is used as a finality based is a good idea)

Probably irrelevant:

Related things from the future:

https://github.com/paritytech/polkadot-sdk/pull/3962 - new fork pruning algorithm - might work better, but not sure
https://github.com/paritytech/polkadot-sdk/issues/4903 - irrelevant for our issue, as this issue is with the new fork pruning algorithm mentioned above being too slow

~~The thoughts we have so far is that the issue is caused by the broken block pruning (which is meaningful when an unfortunate best block selection happens).~~ (UPD: the blocks are not really pruned, we can still request them from the API; this means this is a selection issue) Grandpa tries to finalize the block that is pruned and not the part of the best chain anymore; upon receiving the precommits it fails to recognize them because the block can't be resolved (it's pruned). This is quite odd, since the nodes operate with --blocks-pruning archive, so it must mean they haven't seen the block in question as is it no never to be pruned if seen - but in reality we see a the reorg happening from the block in question to a new one, so the old block is definitely a known one.

We are currently on substrate 0.9.40.

Mainnet encountered this at least three times lately:

MOZGIII commented 1 month ago

2024-07-26 02:31:13.467  WARN tokio-runtime-worker grandpa: Encountered error finding best chain containing 0xab3e065e87c2c2e3ac34cb07a0e05c1fa7dd7df2d782fee66a8c284bbd430fd0: couldn't find target block: Chain lookup failed: Requested a finality target using a base 0xab3e065e87c2c2e3ac34cb07a0e05c1fa7dd7df2d782fee66a8c284bbd430fd0 not in the best chain 0x325ab807d115074518d0c5bd764de875b0945d0166d0937753968f98b4c2805d

Explore-logs-2024-07-26 10_41_41.txt Explore-logs-2024-07-26 10_53_12.txt The logs from the node to hint what's going on

MOZGIII commented 1 month ago

Current plan for a workaround is to delay the finality for more than 2 blocks. Currently we are planning to delay the finality after by 5 7 blocks beyond the best block. This should help with reducing the chance of the reorg-caused race condition we are facing.

dmitrylavrenov commented 1 month ago

Added the test to reproduce the issue logic - https://github.com/humanode-network/polkadot-sdk/compare/master...humanode-network:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition.

It shows the root cause that we face.

MOZGIII commented 1 month ago

We have analyzed the code, and it looks like the issue is with those two PRs:

We think that the fix that is added at https://github.com/paritytech/substrate/pull/13364 does not apply after the code added at https://github.com/paritytech/substrate/pull/13289 returns an error. We'll investigate further.

We are still on course to apply the workaround mentioned above as a hotfix, but the long term solution is also in sight.

MOZGIII commented 1 month ago

To elaborate more on https://github.com/paritytech/substrate/pull/13364:

LongestChain::finality_target will error out when the passed target block is not in the best chain (https://github.com/paritytech/substrate/pull/13289/files#diff-97e7e3622ec6f0b455d18c1bf8055ccf5cfd210efd50348df6b07fce4531baeeR115-R119). This also applies to the case where the block number of the target and best-chain block are the same.
The fixup code at https://github.com/paritytech/substrate/pull/13364/files#diff-5ecb4cf31ed389a0d12482137c8b234485b199c200b09446cf8876321527446aR1242 tries to use the best block as target block (best block to target block) but in practice that part of the fn is not reached due to how an error from LongestChain::finality_target above is handled.
As a result, the intended fix is not applied, and we see the error logged instead - but also the consensus stalls.

bkchr commented 1 month ago

Added the test to reproduce the issue logic - humanode-network/polkadot-sdk@master...humanode-network:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition.

The test is succeeding?

dmitrylavrenov commented 1 month ago

Added the test to reproduce the issue logic - humanode-network/polkadot-sdk@master...humanode-network:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition.

The test is succeeding?

Yep, we've implemented it in the way to verify our thoughts about the happened case.

dmitrylavrenov commented 1 month ago

We met the same error

    // `hashof8_a` should be finalized next based on last completed round data,
    // but it's not an valid finality target
    assert_matches!(
        env.select_chain.finality_target(hashof8_a, None).await.unwrap_err(),
        ConsensusError::ChainLookup(_)
    );

andresilva commented 1 month ago

I have only looked into this superficially, but I think @MOZGIII comment on the LongestChain::finality_target issue is correct. The root cause is that we shouldn't error when the given vote target is not in the best chain, otherwise we make it impossible for finality to proceed in case we have a round estimate that gets re-orged out. In principle what should happen is grandpa would finalize the fork that got re-orged out, and then the node would re-org again (after grandpa finalizes). I didn't think through all the consequences of such a change.

dmitrylavrenov commented 1 month ago

I have only looked into this superficially, but I think @MOZGIII comment on the LongestChain::finality_target issue is correct. The root cause is that we shouldn't error when the given vote target is not in the best chain, otherwise we make it impossible for finality to proceed in case we have a round estimate that gets re-orged out. In principle what should happen is grandpa would finalize the fork that got re-orged out, and then the node would re-org again (after grandpa finalizes). I didn't think through all the consequences of such a change.

The same thoughts we discussed with @MOZGIII

MOZGIII commented 1 month ago

@bkchr @andresilva should we expect any hotfix for this implemented at the parity side or rather work on this ourselves?

dmitrylavrenov commented 1 month ago

@bkchr @andresilva let me create PR of this test into our fork and describe it in more details

andresilva commented 1 month ago

https://github.com/paritytech/polkadot-sdk/pull/5153

I think this should fix it.

MOZGIII commented 1 month ago

https://github.com/paritytech/polkadot-sdk/pull/5153 looks good, thx!

This would still be risky to ship for us as it means the whole layer of logic dedicated to the grandpa-issued reorg was not used for a while now and may contain bugs or just be broken cause of being stale...

We'll think about how to test this. So far we have thought about replicating the scenario of having 1k+ validators on our load-test facilities - but in our prior load tests we haven't seen the finality stalls like we see in the live chain; that might be due to not running for enough time, or due to other conditions - so it unlikely we'll have an accurate test for this in the load testing environment.

Any thoughts on how to trigger this bug in the live-like environment?

andresilva commented 1 month ago

I think besides any significant changes to make sure the issue happens (e.g. changing vote logic to make sure validators vote on abandoned forks explicitly), the easiest way would be to make forks more likely to happen and hence make it more likely for the issue to surface. I think changing https://github.com/humanode-network/humanode/blob/master/crates/humanode-runtime/src/constants.rs#L49 to (1, 1) should make forks more prevalent, reducing block time might also help.

This shouldn't be changed on any live chain though, since these parameters can't be changed dynamically. You'd need to bootstrap a new test chain.

MOZGIII commented 1 month ago

Great idea! We will also change the block time from 6s to 1s for more stress...

Pretty sure it is fine, but just to double check - is it safe to ship https://github.com/humanode-network/humanode/commit/f5bccf0d13193e070aeef1472a2a4726e836cd93 ? Should not cause anything problematic afaik.

andresilva commented 1 month ago

Yes, don't see any problem with that (and should make the issue less likely to happen).

For the test above you can also remove the BeforeBestBlockBy rule to make sure the validators always vote for the latest block, which should make issues with reorgs even more likely!

MOZGIII commented 1 month ago

Great!

To summarize, we'll test your fix with the following:

(1, 1) primary at babe
1s block production time
removed BeforeBestBlockBy at voting rules

UPD: implemented at https://github.com/humanode-network/humanode/pull/1106

MOZGIII commented 1 month ago

Some new input, as the chain stalled again just now:

last finalized block 8745707 0x3c97148e16cc04c5df90b96f64f37d9dddf083468a6c4df2dd447f5396df58fd
epoch start 8745697 0xce0fac61989be9f69a27c5671e71b40f2adc579c231b933a613c5ab103fd5f7a (10 blocks away)
Reorg on #8745708,0x191a…c58d to #8745711,0xfb9c…21bc, common ancestor #8745707,0x3c97…58fd - to the future

MOZGIII commented 1 month ago

We did not push the fix just yet, only the bigger finalization delay

But we see this:

The issue seems to be that once the prevote for a round goes out it then can't progress and the grandpa stalls.

@andresilva @bkchr let me know if you need access to our infra

MOZGIII commented 1 month ago

Maybe some peers are not voting in the round because of the bug (and some do - like the one we have above), and thus the whole consensus stalls, as if the amount of online nodes is less that quorum...

Only 39% of the nodes have applied the finality delay update according to https://telemetry.humanode.io

andresilva commented 1 month ago

Yes, you'd need >66% of the nodes on the new version for it to effectively take effect.

MOZGIII commented 1 month ago

We are backporting the fix to our substrate fork; we are still working on reproducing the issue. @dmitrylavrenov will send an update when we have news on that front.

dmitrylavrenov commented 3 weeks ago

Status update

Still working on implementing our own chain-load-test k8s-based tool that allows to run substrate-based networks with different peers configs and some network chaos experiments (https://chaos-mesh.org).

Current plan is to run different load tests this week trying to reproduce faced issue.

bkchr commented 3 weeks ago

Still working on implementing our own chain-load-test k8s-based tool that allows to run substrate-based networks with different peers configs and some network chaos experiments (https://chaos-mesh.org).

Why not zombienet?

dmitrylavrenov commented 3 weeks ago

Still working on implementing our own chain-load-test k8s-based tool that allows to run substrate-based networks with different peers configs and some network chaos experiments (https://chaos-mesh.org).

Why not zombienet?

Will prepare the detailed answer a little bit later.

dmitrylavrenov commented 3 weeks ago

Status update

We were able to prepare a such config of our chain-load-test k8s-based tool that reproduces the exact issue we faced even with just 9 nodes.

Usually it requires less than 1 hour.

reproduced-grandpa-issue-logs-1.txt reproduced-grandpa-issue-logs-2.txt reproduced-grandpa-issue-logs.txt

Next step is preparing an image containing the fix and run tests again.

MOZGIII commented 3 weeks ago

On zombienet: in short, we tired it, and had issues with it.

The practical issues that we had was it rolled out too slow in our k8s testing environment, and took forever to deploy 1000 nodes (which we ran for testing some time ago). Also, we dig deeper and found that it ~~does not provide features besides reading the logs and running assertions on those logs~~, and even that was heavily unoptimized for the scale of testing we needed.

We also did look under the hood of zombienet, and found it substantially overengineering to our taste: supporting multiple execution environments (besides k8s) is just not necessary for a tool like this, in our collective opinion.

Ultimately, we decided to go agains fixing zombienet issues and make our own thing. It is very basic so far, but much more convenient for us to use, and with it we were actually able to reproduce the finality stalls. Not from the get go, and we'd probably be able alter a zombienet setup in a similar way - but with zombienet it would be unlikely that we'd be able to integrate our k8s-specific setup into the upstream. We are quite satisfied with the results so far, as we can proceed with the work on fixing this bug knowing we have means to reproduce the issue.

Compared to zombienet, our solution does not have the evens model just yet, and doesn't parse the logs - but we don't need it, although it will be not too difficult to add if we decide to.

We'll can do a showcase if you're interested, as it is in a private repo for now. We'll make it public soon (the soon tm kind of soon - no eta).

UPD: our initial motivation for trying zombienet was to run a test cluster with a lot of nodes (1000, 2000, 5000, ...) as we are preparing to grow our validators headcount.

dmitrylavrenov commented 2 weeks ago

On zombienet: in short, we tired it, and had issues with it.

The practical issues that we had was it rolled out too slow in our k8s testing environment, and took forever to deploy 1000 nodes (which we ran for testing some time ago). Also, we dig deeper and found that it ~does not provide features besides reading the logs and running assertions on those logs~, and even that was heavily unoptimized for the scale of testing we needed.

We also did look under the hood of zombienet, and found it substantially overengineering to our taste: supporting multiple execution environments (besides k8s) is just not necessary for a tool like this, in our collective opinion.

Ultimately, we decided to go agains fixing zombienet issues and make our own thing. It is very basic so far, but much more convenient for us to use, and with it we were actually able to reproduce the finality stalls. Not from the get go, and we'd probably be able alter a zombienet setup in a similar way - but with zombienet it would be unlikely that we'd be able to integrate our k8s-specific setup into the upstream. We are quite satisfied with the results so far, as we can proceed with the work on fixing this bug knowing we have means to reproduce the issue.

Compared to zombienet, our solution does not have the evens model just yet, and doesn't parse the logs - but we don't need it, although it will be not too difficult to add if we decide to.

We'll can do a showcase if you're interested, as it is in a private repo for now. We'll make it public soon (the soon tm kind of soon - no eta).

UPD: our initial motivation for trying zombienet was to run a test cluster with a lot of nodes (1000, 2000, 5000, ...) as we are preparing to grow our validators headcount.

I can add some important items from practical usage:

it was difficult to customize settings to make it working for networks with different SS58-prefix. Maybe it's already improved, but was needed to do some changes (the version I used in April-May) in the source code and rebuild by myself the tool.
It's not possible to rerun failed pods keeping running good ones. All time you need to rerun the network from clean. As a result, sometimes the test with 1000 nodes take almost forever as setup is terminated if an image for some pod were not able to be pulled for some networks reason.

dmitrylavrenov commented 2 weeks ago

Status update

@bkchr @andresilva we have great news! We were able to test fixed version where we faced the same issue. But the consensus wasn't stucked and continued working as usual! Tested twice. You can check the logs :)

grandpa-fix-logs.txt grandpa-fix-logs-1.txt

dmitrylavrenov commented 2 weeks ago

@MOZGIII I think we can close the issue and open the new one to backport the fix and ship to our mainnet?

andresilva commented 2 weeks ago

Thanks for the thorough testing 👌

MOZGIII commented 2 weeks ago

@MOZGIII I think we can close the issue and open the new one to backport the fix and ship to our mainnet?

Yes.

The investigation is now finished, and we were able to identify the bug, theorize and implement a fix, reproduce the bug reliably in a controlled environment, apply the fix and ensure the bug does not reproduce under the same conditions anymore, and that there are no other surprising effects from applying the fix.

We are now working on releasing the fix on the Humanode network.

Thanks everyone for taking part in this work!

humanode-network / humanode

Finalization stalling issues 2024-07 #1104