paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.81k stars 660 forks source link

Finalisation stopped : Encountered error finding best chain #3487

Closed 1xstj closed 2 months ago

1xstj commented 7 months ago

Finalisation stopped for chain : https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Ftestnet-rpc-archive.tangle.tools#/explorer

The errors we see on validator nodes after restart is :

2024-02-27 10:30:49.726  WARN tokio-runtime-worker grandpa: Encountered error finding best chain containing 0x4c68d1ba9612b87c63d9786547ce706e07080ba7905b6928d8080f0cdd787af4: couldn't find target block: Chain lookup failed: Requested a finality target using a base 0x4c68d1ba9612b87c63d9786547ce706e07080ba7905b6928d8080f0cdd787af4 not in the best chain 0x91c6e5a9de0926428d850503cb76014270190804656fda6d66eb644dbad009e4    
2024-02-27 10:30:49.727  WARN tokio-runtime-worker grandpa: Could not cast prevote: previously known block 0x4c68d1ba9612b87c63d9786547ce706e07080ba7905b6928d8080f0cdd787af4 has disappeared  

Telemetry : https://telemetry.polkadot.io/#list/0x3d22af97d919611e03bbcbda96f65988758865423e89b2d99547a6bb61452db3

Substrate version : v1.1.0

Also tried logs with -lafg as suggested in other issues, but same output

Logs before restart :

2024-02-27 05:42:28.344  INFO tokio-runtime-worker sc_informant: ♻️  Reorg on #651080,0x418d…1743 to #651078,0x5d4a…e6fa, common ancestor #651077,0x9a2d…09f2    
2024-02-27 05:42:28.344  INFO tokio-runtime-worker substrate: ✨ Imported #651078 (0x5d4a…e6fa) 

Update: Restarted validators with pruning archive and these are the logs now

2024-02-27 11:44:17.838 DEBUG tokio-runtime-worker grandpa: Completed round 2397, state = State { prevote_ghost: Some((0x4c68d1ba9612b87c63d9786547ce706e07080ba7905b6928d8080f0cdd787af4, 651078)), finalized: Some((0x9a2d0c115d0c91876d087a1bf0dc52913b59f1ca9c9f76490d4e26d93a3609f2, 651077)), estimate: Some((0x4c68d1ba9612b87c63d9786547ce706e07080ba7905b6928d8080f0cdd787af4, 651078)), completable: true }, step = None    
2024-02-27 11:44:17.838 DEBUG tokio-runtime-worker grandpa: Round 2397: prevotes: 534/500/749 weight, 534/749 actual    
2024-02-27 11:44:17.838 DEBUG tokio-runtime-worker grandpa: Round 2397: precommits: 534/500/749 weight, 534/749 actual  
1xstj commented 7 months ago

Fixed by calling grandpa.noteStalled

dmitrylavrenov commented 3 months ago

Hey @1xstj,

Have you got the root cause of this behaviour ? Which substrate version has been used in your testnet ?

We've faced the same issue few days ago on our humanode.io mainnet and would like to get the root cause. grandpa.noteStalled has helped to continue finalization. But I don't think it's the fix.

@bkchr @davxy What do you think about this case ? Can you suggest smth to investigate the issue ? We would like to hear some thought from you, please.

2024-06-26 12:35:07 ♻️  Reorg on #8327330,0xa5ee…9c61 to #8327328,0x2669…84f8, common ancestor #8327327,0x9b9c…019b    
2024-06-26 12:35:07 ✨ Imported #8327328 (0x2669…84f8)    
2024-06-26 12:35:08 Encountered error finding best chain containing 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b: couldn't find target block: Chain lookup failed: Requested a finality target using a base 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b not in the best chain 0x2669f7c45d66835eac78f5e90c4a1a4e093c761d7f786bfb50c8920537c984f8    
2024-06-26 12:35:08 Could not cast prevote: previously known block 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b has disappeared    
2024-06-26 12:35:09 💤 Idle (1359 peers), best: #8327328 (0x2669…84f8), finalized #8327327 (0x9b9c…019b), ⬇ 3.5MiB/s ⬆ 438.3kiB/s    
2024-06-26 12:35:12 ✨ Imported #8327329 (0x6dfe…d531)    
2024-06-26 12:35:14 💤 Idle (1359 peers), best: #8327329 (0x6dfe…d531), finalized #8327327 (0x9b9c…019b), ⬇ 1.0MiB/s ⬆ 973.2kiB/s    
2024-06-26 12:35:18 ✨ Imported #8327330 (0x56aa…28a2)    
2024-06-26 12:35:19 💤 Idle (1359 peers), best: #8327330 (0x56aa…28a2), finalized #8327327 (0x9b9c…019b), ⬇ 158.8kiB/s ⬆ 33.4kiB/s    
2024-06-26 12:35:24 💤 Idle (1359 peers), best: #8327330 (0x56aa…28a2), finalized #8327327 (0x9b9c…019b), ⬇ 35.2kiB/s ⬆ 9.3kiB/s
bkchr commented 3 months ago

There is probably no easy fix for this. I can not recommend anything. If this is happening more often, it would be nice to find out how and why grandpa is voting this way, maybe it reveals the bug or not.

dmitrylavrenov commented 3 months ago

There is probably no easy fix for this. I can not recommend anything. If this is happening more often, it would be nice to find out how and why grandpa is voting this way, maybe it reveals the bug or not.

Got it, thanks

Anyway, I've analyzed reorg logs and found smth interesting, would like to know if it makes sense. The guys and our node has reorg to 2 blocks back. As a result, we got the situation of voting error that block has been disappeared.

The rest reorg logs contains the info, that reorg happens on the same block or 1 block forward. But not to the past.

MOZGIII commented 2 months ago

The problem now occurs daily. We run the grandpa.noteStalled to fix every time, but this is not a proper solution. https://github.com/humanode-network/humanode/issues/1104 - tracking issue with more details Please reopen, or should we start another issue here for this?

I invite parity team to look into this stuff on our end now - cause it seems like it will affect everyone at some point

Note that we have passed 1k validators recently, so it might be a high-end case that is not visible when on lower validator counts

dmitrylavrenov commented 2 months ago

@bkchr @davxy implemented an test that reproduce faced issue - https://github.com/humanode-network/humanode/issues/1104#issuecomment-2252375673.