Closed 1xstj closed 2 months ago
Fixed by calling grandpa.noteStalled
Hey @1xstj,
Have you got the root cause of this behaviour ? Which substrate version has been used in your testnet ?
We've faced the same issue few days ago on our humanode.io mainnet and would like to get the root cause. grandpa.noteStalled
has helped to continue finalization. But I don't think it's the fix.
@bkchr @davxy What do you think about this case ? Can you suggest smth to investigate the issue ? We would like to hear some thought from you, please.
2024-06-26 12:35:07 ♻️ Reorg on #8327330,0xa5ee…9c61 to #8327328,0x2669…84f8, common ancestor #8327327,0x9b9c…019b
2024-06-26 12:35:07 ✨ Imported #8327328 (0x2669…84f8)
2024-06-26 12:35:08 Encountered error finding best chain containing 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b: couldn't find target block: Chain lookup failed: Requested a finality target using a base 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b not in the best chain 0x2669f7c45d66835eac78f5e90c4a1a4e093c761d7f786bfb50c8920537c984f8
2024-06-26 12:35:08 Could not cast prevote: previously known block 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b has disappeared
2024-06-26 12:35:09 💤 Idle (1359 peers), best: #8327328 (0x2669…84f8), finalized #8327327 (0x9b9c…019b), ⬇ 3.5MiB/s ⬆ 438.3kiB/s
2024-06-26 12:35:12 ✨ Imported #8327329 (0x6dfe…d531)
2024-06-26 12:35:14 💤 Idle (1359 peers), best: #8327329 (0x6dfe…d531), finalized #8327327 (0x9b9c…019b), ⬇ 1.0MiB/s ⬆ 973.2kiB/s
2024-06-26 12:35:18 ✨ Imported #8327330 (0x56aa…28a2)
2024-06-26 12:35:19 💤 Idle (1359 peers), best: #8327330 (0x56aa…28a2), finalized #8327327 (0x9b9c…019b), ⬇ 158.8kiB/s ⬆ 33.4kiB/s
2024-06-26 12:35:24 💤 Idle (1359 peers), best: #8327330 (0x56aa…28a2), finalized #8327327 (0x9b9c…019b), ⬇ 35.2kiB/s ⬆ 9.3kiB/s
There is probably no easy fix for this. I can not recommend anything. If this is happening more often, it would be nice to find out how and why grandpa is voting this way, maybe it reveals the bug or not.
There is probably no easy fix for this. I can not recommend anything. If this is happening more often, it would be nice to find out how and why grandpa is voting this way, maybe it reveals the bug or not.
Got it, thanks
Anyway, I've analyzed reorg logs and found smth interesting, would like to know if it makes sense. The guys and our node has reorg to 2 blocks back. As a result, we got the situation of voting error that block has been disappeared.
The rest reorg logs contains the info, that reorg happens on the same block or 1 block forward. But not to the past.
The problem now occurs daily. We run the grandpa.noteStalled
to fix every time, but this is not a proper solution.
https://github.com/humanode-network/humanode/issues/1104 - tracking issue with more details
Please reopen, or should we start another issue here for this?
I invite parity team to look into this stuff on our end now - cause it seems like it will affect everyone at some point
Note that we have passed 1k validators recently, so it might be a high-end case that is not visible when on lower validator counts
@bkchr @davxy implemented an test that reproduce faced issue - https://github.com/humanode-network/humanode/issues/1104#issuecomment-2252375673.
Finalisation stopped for chain : https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Ftestnet-rpc-archive.tangle.tools#/explorer
The errors we see on validator nodes after restart is :
Telemetry : https://telemetry.polkadot.io/#list/0x3d22af97d919611e03bbcbda96f65988758865423e89b2d99547a6bb61452db3
Substrate version : v1.1.0
Also tried logs with
-lafg
as suggested in other issues, but same outputLogs before restart :
Update: Restarted validators with
pruning archive
and these are the logs now