paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Finalization issue #2304

Closed xlc closed 5 years ago

xlc commented 5 years ago

Based on 7c6474663cdba40422760d21ae0119bfad425e40 Screen Shot 2019-04-17 at 3 22 47 PM

All nodes stopped finalization at 14046 / 14071. New node failed to get finalization at all.

Restarting doesn't help.

Anything we can do to diagnostic the issue and resume the finalization process?

bkchr commented 5 years ago

Could you try to collect some logs with -lafg?

CC @andresilva @rphmeier

gguoss commented 5 years ago

4 validators( 3 validators in 14046, 1 validators in 14071), may be 3 validators afg not > 2/3 weight.

may be restart 4 validators resume grandpa

andresilva commented 5 years ago

@gguoss I'm assuming the authorities are stuck on different rounds, if you collect logs for afg target you should see some messages saying the GRANDPA round the nodes are in. Can you also check if the validator that finalized 14071 is connected to the other validators? I think that validator is on a later GRANDPA round than the other validators.

xlc commented 5 years ago

Logs from validators and some other nodes https://gist.github.com/xlc/82e9c35d95f9e400134de047d6dfea67

andresilva commented 5 years ago
logs-from-cennznet-validators-validator-0-in-cennznet-validators-validator-0-0.txt:2019-04-18 01:17:25.549 main DEBUG afg  Voter VALIDATOR_0 noting beginning of round (Round(551), SetId(0)) to network.
logs-from-cennznet-validators-validator-1-in-cennznet-validators-validator-1-0:2019-04-18 01:17:32.213 main DEBUG afg  Voter VALIDATOR_1 noting beginning of round (Round(550), SetId(0)) to network.
logs-from-cennznet-validators-validator-2-in-cennznet-validators-validator-2-0.txt:2019-04-18 01:17:16.724 main DEBUG afg  Voter VALIDATOR_2 noting beginning of round (Round(550), SetId(0)) to network.
logs-from-cennznet-validators-validator-3-in-cennznet-validators-validator-3-0:2019-04-18 01:17:25.558 main DEBUG afg  Voter VALIDATOR_3 noting beginning of round (Round(550), SetId(0)) to network.

So it seems that one of the validators progressed to the next round (maybe because the other authorities didn't see its vote), while the other authorities are stuck in round 550 and probably don't have threshold stake to finalize. What you should do to get finality started again is disable all validators and copy the database from validator 0 into the other validators' nodes, this way when you restart the nodes they'll all be at round 551.

We are working on improvements to fix these situations where it can get stuck with a small amount of validators (https://github.com/paritytech/substrate/commit/9631622fca89709914cec8a3680e5034c51b2519 was recently merged which should help as well).

xlc commented 5 years ago

Thanks. I will upgrade the substrate version and do the fix next week and report the results here.

xlc commented 5 years ago

Tried copy the db of validator 0 to other validators and reset all other nodes and it breaks the connection somehow. Maybe relates to #2335.

Screen Shot 2019-04-30 at 10 22 15 AM

I am going to pull latest substrate and reset the testnet and see if this happens again.

xlc commented 5 years ago

Not happening anymore.

xlc commented 5 years ago

It happens again. Please let me know if there are anything you need to diagnostic this issue. The testnet is public now. Our telemetry server is not public but we are going to migrate to use polkadot one soon.

Our web UI: https://cennznet.js.org/cennznet-ui/ Our repo: https://github.com/cennznet/cennznet Use --chain=rimu to join Rimu testnet. It is also the default network so not specify chain will join to it as well.

Let me know if you need anything, like logs from our validators, or a validator seat.

Screen Shot 2019-06-11 at 12 42 03 PM

xlc commented 5 years ago

Most likely fixed in new version.

badkk commented 4 years ago

Tried copy the db of validator 0 to other validators and reset all other nodes and it breaks the connection somehow. Maybe relates to #2335.

Screen Shot 2019-04-30 at 10 22 15 AM

I am going to pull latest substrate and reset the testnet and see if this happens again.

Will this workable? I am using rc5 and it happends again.