Closed xlc closed 5 years ago
Could you try to collect some logs with -lafg
?
CC @andresilva @rphmeier
4 validators( 3 validators in 14046, 1 validators in 14071), may be 3 validators afg not > 2/3 weight.
may be restart 4 validators resume grandpa
@gguoss I'm assuming the authorities are stuck on different rounds, if you collect logs for afg
target you should see some messages saying the GRANDPA round the nodes are in. Can you also check if the validator that finalized 14071 is connected to the other validators? I think that validator is on a later GRANDPA round than the other validators.
Logs from validators and some other nodes https://gist.github.com/xlc/82e9c35d95f9e400134de047d6dfea67
logs-from-cennznet-validators-validator-0-in-cennznet-validators-validator-0-0.txt:2019-04-18 01:17:25.549 main DEBUG afg Voter VALIDATOR_0 noting beginning of round (Round(551), SetId(0)) to network.
logs-from-cennznet-validators-validator-1-in-cennznet-validators-validator-1-0:2019-04-18 01:17:32.213 main DEBUG afg Voter VALIDATOR_1 noting beginning of round (Round(550), SetId(0)) to network.
logs-from-cennznet-validators-validator-2-in-cennznet-validators-validator-2-0.txt:2019-04-18 01:17:16.724 main DEBUG afg Voter VALIDATOR_2 noting beginning of round (Round(550), SetId(0)) to network.
logs-from-cennznet-validators-validator-3-in-cennznet-validators-validator-3-0:2019-04-18 01:17:25.558 main DEBUG afg Voter VALIDATOR_3 noting beginning of round (Round(550), SetId(0)) to network.
So it seems that one of the validators progressed to the next round (maybe because the other authorities didn't see its vote), while the other authorities are stuck in round 550 and probably don't have threshold stake to finalize. What you should do to get finality started again is disable all validators and copy the database from validator 0 into the other validators' nodes, this way when you restart the nodes they'll all be at round 551.
We are working on improvements to fix these situations where it can get stuck with a small amount of validators (https://github.com/paritytech/substrate/commit/9631622fca89709914cec8a3680e5034c51b2519 was recently merged which should help as well).
Thanks. I will upgrade the substrate version and do the fix next week and report the results here.
Tried copy the db of validator 0 to other validators and reset all other nodes and it breaks the connection somehow. Maybe relates to #2335.
I am going to pull latest substrate and reset the testnet and see if this happens again.
Not happening anymore.
It happens again. Please let me know if there are anything you need to diagnostic this issue. The testnet is public now. Our telemetry server is not public but we are going to migrate to use polkadot one soon.
Our web UI: https://cennznet.js.org/cennznet-ui/
Our repo: https://github.com/cennznet/cennznet
Use --chain=rimu
to join Rimu testnet. It is also the default network so not specify chain will join to it as well.
Let me know if you need anything, like logs from our validators, or a validator seat.
Most likely fixed in new version.
Tried copy the db of validator 0 to other validators and reset all other nodes and it breaks the connection somehow. Maybe relates to #2335.
I am going to pull latest substrate and reset the testnet and see if this happens again.
Will this workable? I am using rc5
and it happends again.
Based on 7c6474663cdba40422760d21ae0119bfad425e40
All nodes stopped finalization at 14046 / 14071. New node failed to get finalization at all.
Restarting doesn't help.
Anything we can do to diagnostic the issue and resume the finalization process?