paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Validator Nodes crash with Essential task grandpa-voter failed. #6727

Closed wpank closed 4 years ago

wpank commented 4 years ago

More than one Validators in in the Kusama Validator lounge reported their nodes failing with the logs saying something like this:

jul 24 11:27:07 XXXXXX polkadot[14640]: 2020-07-24 11:27:07 💤 Idle (53 peers), best: #3308841 (0xbca0…81f4), finalized #3308838 (0x3b5d…54e4), ⬇ 1.5MiB/s
jul 24 11:27:08 XXXXXX polkadot[14640]: 2020-07-24 11:27:08 🙇 You are using the peer ID QmQZEm9395wgF3ztkfqwNgx3yrqsLEECYB3cd7mZDavyxX. This peer ID uses
jul 24 11:27:08 XXXXXX polkadot[14640]: 2020-07-24 11:27:08 🙇 You are using the peer ID QmPVbzttk7DbyMcAoo95PxNtsGAVq98uLivGZhTaB7Yy7M. This peer ID uses
jul 24 11:27:08 XXXXXX polkadot[14640]: 2020-07-24 11:27:08 🙇 You are using the peer ID QmQZEm9395wgF3ztkfqwNgx3yrqsLEECYB3cd7mZDavyxX. This peer ID uses
jul 24 11:27:08 XXXXXX polkadot[14640]: 2020-07-24 11:27:08 🙇 You are using the peer ID QmPVbzttk7DbyMcAoo95PxNtsGAVq98uLivGZhTaB7Yy7M. This peer ID uses
jul 24 11:27:10 XXXXXX polkadot[14640]: 2020-07-24 11:27:10 🔍 Discovered new external address for our node: /ip4/10.0.0.2/tcp/30333/p2p/12D3KooWSqJ3jxMv1K
jul 24 11:27:11 XXXXXX polkadot[14640]: 2020-07-24 11:27:11 🔍 Discovered new external address for our node: /ip4/10.156.1.51/tcp/30333/p2p/12D3KooWSqJ3jxM
jul 24 11:27:11 XXXXXX polkadot[14640]: 2020-07-24 11:27:11 Essential task grandpa-voter failed. Shutting down service.
jul 24 11:27:11 XXXXXX polkadot[14640]: Error: Input("Essential task failed.")
andresilva commented 4 years ago

Pasting my comments from riot:

I just confirmed that this can happen due to the changes introduced in the keystore API for remote signing. https://github.com/paritytech/substrate/pull/6178/files#diff-233fff7c89a6b4ca96ca4ee31c62088cL385.this would never fail previously. The GRANDPA voter didn't crash, it failed to sign a message and it deliberately took the system down. I think this is the correct behavior as not being able to sign messages is a critical error, and I think it's better to take the node down rather than pretend it is working. In order to trigger the issue I setup an invalid session key through the RPC API (generated a sr25519 keypair and set it as grandpa key which is ed25519). IMO the root issue is that the key was accepted by the server whereas it should have been rejected, otherwise we only run into the problem whenever we try to sign something.

I have no clue how/if this related to authority discovery. We need more info from the people that have had this error.

wpank commented 4 years ago

Also some more anecdotes:

How likely the last two are is very much up in the air.

andresilva commented 4 years ago

I think we can close this now. Seems to be fixed in v0.8.22 and we don't have new reports.

AuroraLantean commented 2 years ago

2022-02-15 23:39:58.380 INFO main sc_cli::runner: Substrate Node
2022-02-15 23:39:58.380 INFO main sc_cli::runner: ✌️ version 4.0.0-dev-b53da9f-x86_64-linux-gnu
2022-02-15 23:39:58.380 INFO main sc_cli::runner: ❤️ by Substrate DevHub https://github.com/substrate-developer-hub, 2017-2022
2022-02-15 23:39:58.380 INFO main sc_cli::runner: 📋 Chain specification: My Custom Testnet
2022-02-15 23:39:58.380 INFO main sc_cli::runner: 🏷 Node name: MyNode01
2022-02-15 23:39:58.380 INFO main sc_cli::runner: 👤 Role: AUTHORITY
2022-02-15 23:39:58.380 INFO main sc_cli::runner: 💾 Database: RocksDb at /tmp/node01/chains/local_testnet/db/full
2022-02-15 23:39:58.380 INFO main sc_cli::runner: ⛓ Native runtime: node-template-100 (node-template-1.tx1.au1)
2022-02-15 23:39:58.645 WARN main sc_service::config: Using default protocol ID "sup" because none is configured in the chain specs
2022-02-15 23:39:58.646 INFO main sub-libp2p: 🏷 Local node identity is: 12D3KooWB7x4EruG852DyxwigtTDsJ82wMV7c1uG5waJE2VxMcpo
2022-02-15 23:39:58.647 INFO main sc_service::builder: 📦 Highest known block at #0
2022-02-15 23:39:58.648 INFO tokio-runtime-worker substrate_prometheus_endpoint: 〽️ Prometheus exporter started at 127.0.0.1:9615
2022-02-15 23:39:58.650 INFO main parity_ws: Listening for new connections on 127.0.0.1:9945.
2022-02-15 23:39:59.326 ERROR tokio-runtime-worker afg: GRANDPA voter error: Signing("Failed to sign GRANDPA vote for round 1 targetting 0x3d3b1dae280f17e0f419618adee764457579da9dc2050250f4ec914cc3e27323")
2022-02-15 23:39:59.326 ERROR tokio-runtime-worker sc_service::task_manager: Essential task grandpa-voter failed. Shutting down service.
Error: Service(Other("Essential task failed."))

I was following this tutorial: https://docs.substrate.io/tutorials/v3/private-network/ Private Network failed to run after inserting a grandpa key...

techwizard210 commented 2 years ago

Everyone. Please help me.

When I run my substrate node server by running this command "./target/release/node-template --dev --ws-external --rpc-external", server works well and start generating blocks first. But after few hours, my server is been stopped and also stopped generating blocks automatically. How can I solve this problem? Here is Terminal Error. Error Best Regards.

artlotz commented 2 years ago

I also had a similar error.

After deleting the chain data in --base-path, it was executed normally.

The command below may help.

./target/release/node-template purge-chain --base-path /tmp/node01 --chain local -y

/tmp/node01 <-- --base-path value used when starting the node

In addition,

If the password entered when generating the Sr25519 and Ed25519 keys is incorrectly entered when starting the node, the following error occurs.

GRANDPA voter error: could not sign outgoing message: Failed to sign GRANDPA vote for round 1 targetting

bellaj commented 2 years ago

if Sr25519 instead Ed25519 for generating gp keys you will get this error