paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Validators dropping peers and failing to gossip block production #11320

Open adamdossa opened 2 years ago

adamdossa commented 2 years ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Hi Substrate Team,

We've been trying to debug the issue below for the last few days, but are hitting a wall. Any advice or tips you can provide would be very much appreciated.

There is a stack exchange issue which offers a high-level description below, but adding more detail here: https://substrate.stackexchange.com/questions/2166/validator-node-dropping-peers-after-producing-a-block-followed-by-a-reorg

For the last few days, around 1/3 of our validator nodes are scoring quite poorly in terms of reward points, missing many, but not all, or their block production slots (we use BABE + GRANDPA): https://mainnet-app.polymesh.network/#/staking

Looking at an individual poorly performing validator, we observe that they seem to produce a block, but that the block fails to propagate to the rest of the network, despite a reasonable initial peer count. Alongside this failure, peer counts tend to drop materially just after the block is produced and gossiped:

Apr 25 07:16:50 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:16:50 💤 Idle (37 peers), best: #2556897 (0x04de…82bb), finalized #2556895 (0xcf6d…6a77), ⬇ 123.2kiB/s ⬆ 151.5kiB/s
Apr 25 07:16:54 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:16:54 🙌 Starting consensus session on top of parent 0x04de4479769fb895cdaac629985ba1161ca1e3f326798973009ebd11579082bb
Apr 25 07:16:54 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:16:54 🎁 Prepared block for proposing at 2556898 [hash: 0x20769becac0a66900a5414ba276f82495f1c096a2af3037c5e5d86ef11ddfd5e; parent_hash: 0x04de…82bb; extrinsics (1): [0x3098…7607]]
Apr 25 07:16:54 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:16:54 🔖 Pre-sealed block for proposal at 2556898. Hash now 0x8cb6aab5906ce62841dc8c18bd3fd1ceb856eb967524987937a37cdc520590a8, previously 0x20769becac0a66900a5414ba276f82495f1c096a2af3037c5e5d86ef11ddfd5e.
Apr 25 07:16:54 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:16:54 ✨ Imported #2556898 (0x8cb6…90a8)
Apr 25 07:16:55 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:16:55 💤 Idle (37 peers), best: #2556898 (0x8cb6…90a8), finalized #2556895 (0xcf6d…6a77), ⬇ 100.5kiB/s ⬆ 132.0kiB/s
Apr 25 07:17:00 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:17:00 💤 Idle (37 peers), best: #2556898 (0x8cb6…90a8), finalized #2556895 (0xcf6d…6a77), ⬇ 20.1kiB/s ⬆ 64.7kiB/s
Apr 25 07:17:00 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:17:00 ✨ Imported #2556898 (0xc087…2639)
Apr 25 07:17:05 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:17:05 💤 Idle (9 peers), best: #2556898 (0x8cb6…90a8), finalized #2556895 (0xcf6d…6a77), ⬇ 10.9kiB/s ⬆ 57.6kiB/s
Apr 25 07:17:06 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:17:06 ♻️  Reorg on #2556898,0x8cb6…90a8 to #2556899,0xf011…2102, common ancestor #2556897,0x04de…82bb
Apr 25 07:17:06 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:17:06 ✨ Imported #2556899 (0xf011…2102)
Apr 25 07:17:10 polymesh-4.1.1-linux-amd64[12043]: 2022-04-25 07:17:10 💤 Idle (12 peers), best: #2556899 (0xf011…2102), finalized #2556897 (0x04de…82bb), ⬇ 14.0kiB/s ⬆ 54.2kiB/s

We turned on -l peerset=trace on one of the impacted nodes, and I am attaching a log file from that node. As an example, if you search for Pre-sealed block for proposal at 2603940 you can see an example of a block being produced, failing to propagate, and peer count dropping (not sure if this is a cause or effect).

Log file: https://drive.google.com/file/d/1YBOQccA7qmJWDMdi5oUu3lKh3W01KzDY/view?usp=sharing

Some puzzles:

Thanks, Adam

Steps to reproduce

This doesn't seem to be reproducible in isolation unfortunately.

Neopallium commented 2 years ago

We have seen that when the bad operator produces a block and broadcasts the peers. The peers begin validating the block and send a request to the operator for the block details, but the operator rejects the request Reason: Request refused.

Here are some log messages from another node (with -lsync flag). This is for a different block that the one from Adam's logs.

The node receives the block from the operator (19:14:30 is the correct time for this block):

2022-04-28 19:14:30.138 DEBUG tokio-runtime-worker sync: Pre-validating received block announcement 0xa1ddd181d516444f20dd86d5e45a2eb749d65fa7da57fb0f5de64e2657039073 with number 2593343 from 12D3KooWBTD1MjUQjb5wbuXR5zacGaWksrf4K4FK8SwbqT9Edmdx    
2022-04-28 19:14:30.139 TRACE tokio-runtime-worker sync: Finished block announce validation: Process { is_new_best: true, who: PeerId("12D3KooWBTD1MjUQjb5wbuXR5zacGaWksrf4K4FK8SwbqT9Edmdx"), announce: BlockAnnounce { header: Header { parent_hash: 0x77db0f424571ec5fdf5cc8ae554b3a8d9b15bbf89e0021bb735d2f86e369e9e7, number: 2593343, state_root: 0x14647ad2105a6303ef37e56a9bbee788c362da5d4865633072f58a8187564ec7, extrinsics_root: 0xee7cbe3434e596bd578ed0db1d4d33e69d43dc15901f70562b045fc00fbb0006, digest: Digest { logs: [DigestItem::PreRuntime([66, 65, 66, 69], [2, 5, 0, 0, 0, 217, 19, 103, 16, 0, 0, 0, 0]), DigestItem::Seal([66, 65, 66, 69], [210, 36, 72, 88, 37, 180, 67, 217, 25, 64, 24, 148, 142, 103, 60, 9, 133, 134, 91, 247, 164, 36, 140, 150, 235, 4, 147, 130, 2, 85, 136, 46, 27, 96, 207, 84, 244, 38, 144, 167, 180, 87, 174, 162, 202, 66, 175, 206, 75, 106, 66, 247, 180, 161, 157, 102, 219, 27, 242, 129, 27, 149, 181, 137])] } }, state: Some(Best), data: Some([]) } }    
2022-04-28 19:14:30.144 TRACE tokio-runtime-worker sync: Announce validation result is nothing

But when it sends a New block request to the operator, the request is refused and the connection is dropped:

2022-04-28 19:14:30.144 TRACE tokio-runtime-worker sync: New block request for 12D3KooWBTD1MjUQjb5wbuXR5zacGaWksrf4K4FK8SwbqT9Edmdx, (best:2593343, common:2593342) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Hash(0xa1ddd181d516444f20dd86d5e45a2eb749d65fa7da57fb0f5de64e2657039073), to: None, direction: Descending, max: Some(1) }    
2022-04-28 19:14:30.529 TRACE tokio-runtime-worker sync: Block request to peer PeerId("12D3KooWBTD1MjUQjb5wbuXR5zacGaWksrf4K4FK8SwbqT9Edmdx") failed: Refused.    
2022-04-28 19:14:30.529 TRACE tokio-runtime-worker sync: 12D3KooWBTD1MjUQjb5wbuXR5zacGaWksrf4K4FK8SwbqT9Edmdx disconnected