Closed phahulin closed 5 years ago
Is this the same or a similar error to https://github.com/paritytech/parity-ethereum/issues/9320
@phahulin is this the same error as #9320 ? How do you see that "validator nodes would sometimes miss their turn" ?
I don't know if it's the same error as #9320 or not.
Since we use AuRa PoA engine, validators are expected to produce blocks in round-robin fashion, in our case every 5 seconds. What we observe is that they sometimes miss their turn, so that after Nth validator next block is mined 10 seconds later by (N+2)th validator. For example, take this block https://core-explorer.poa.network/block/4096058 it was mined at 2018-08-10T10:24:20 UTC, the previous block was correctly mined at 2018-08-10T10:24:15 (5 sec gap), but the next is mined at 2018-08-10T10:24:30 (10 sec gap).
From what we can see from netstats, all validators are online and have enough peers. Also, it doesn't happen each round.
From what we can see from netstats, all validators are online and have enough peers.
Is there anything in the logs? You can probably figure out which validator failed to produce a block, can you check if it's always the same (very likely). It could be a problem with a CPU/RAM/Disk bottleneck.
No, any specific logs (logging="???"
) we should be looking at?
In our Core network what we see is that 15/21 validators sometimes miss blocks. They don't miss them regularly, but combined effect is very noticable.
Using a validator set contract, you can report the authorities missing a block. This could be a good start. I'm not sure the -l poa
flag could help but you can give it a try (list of available logs).
We use on-chain governance to manage validators list. But it won't be a good solution anyway, since almost all of them will get banned.
Reporting isn't banning. You need to monitor your network in some way, (with you block explorer for instance) and find out who misses its slot.
Yeah, we do monitor. The problem is that about 15/21 started to miss their blocks after upgrade
@phahulin Could you grab logs for -lengine=trace
? Although it probably won't help much. Can you relate the skipped authorities to any kind of CPU load that the node is having? Does it only happen on 1.11.8, do you know if this also happens with 2.0.1?
Ok, will do. But it will take some time because we don't have direct access to validators nodes, so I'll have to ask them. Please don't close this issue yet
@andresilva I got logs from one validator from our test network (sokol) taken with logging="engine=trace,poa=trace,miner=trace,reorg=trace,own_tx=trace,perf=trace,stats=trace"
.
Here is a piece of log around the block that this node didn't produce. I formatted them a lit bit and excluded perf
part, because it's very verbose but doesn't seem to be too informative. I can add it if you'll need.
A bit of context:
0x5b2c...
, he didn't produce a block in this round. Validator before him didn't produce the block too.Time | Step | Expected Val | Actual Block N | Block hash |
---|---|---|---|---|
00:03:30 | ...282 | 0xdb12... | 4033697 | 0x194b... |
00:03:35 | ...283 | 0xeD2A... | 4033698 | 0xe2a3... |
00:03:40 | ...284 | 0x4Ca0... | MISSED | - |
00:03:45 | ...285 | 0x5B2c... | MISSED | - |
00:03:50 | ...286 | 0x58ba... | 4033699 | 0x9e8a... |
From what I see, something strange happens with block 4033698
- although it corresponds to 00:03:35, a message that it was imported appears at 00:03:40. I can also add, that I checked on few other nodes and on all of them this block was imported at 00:03:40.
Then there is a message that validator 0x4Ca0...
produced his block with a step gap (so he did produce it!), then reorg happens to 4033698
(how can reorgs happen in AuRa btw?)
A message about "Fetching proposer for step ...285" (the one where this validator should have produced a block) only appears at 00:03:50, while it should have appeared at 00:03:45
I asked him to check if time is in sync on his node, this is the best guess I have so far. Doesn't explain though why he's not constantly missing blocks and when 4033698
was actually produced
Indeed something strange happens at block #4033698, actually it is imported with a different hash first:
2018-08-15 00:03:40 UTC Verifier #1 INFO import Imported #4033698 0x5bbd…99a2 (2 txs, 0.04 Mgas, 1 ms, 0.78 KiB)
Re-orgs in Aura can happen if there's any kind of "partition" to how the nodes see the global state, that could be:
I would try to check that all the clocks are synchronized and that there isn't more than one authority online using the same key. Also, what version were you running before the upgrade? Just to see if I can try to relate this to any change that we might have done.
The previous parity version was 1.10.6
@andresilva, you mention three reasons that a "re-org" can happen. Could you give us a clue where in the code these decisions are being made, please?
Also, we are collecting logs that are very verbose, using
logging="engine=trace,poa=trace,miner=trace,reorg=trace,own_tx=trace,perf=trace,stats=trace"
Which ones can we eliminate, while pursuing debug of these three scenarios?
Hi @andresilva
I got logs from the validator node (0xed2a…1c36
) that produced weird blocks.
Here is the piece of log that I think corresponds to the scenario above. This is piece of log is, again, without perf
logs.
However, in this case, there TONS of messages from perf
, similar to:
2018-08-18 15:17:07 UTC IO Worker #3 TRACE perf queue_ancient_block: 0.04ms
Actually, about 96% of the log file consists of these messages. Also, this is weird
$ grep 'import #' 0xed2a.log | head -n15
2018-08-18 15:17:23 UTC IO Worker #0 INFO import #50799 26/50 peers 4 MiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 3 req/s, 744 µs
2018-08-18 15:17:59 UTC IO Worker #2 INFO import #64261 26/50 peers 5 MiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 3 req/s, 565 µs
2018-08-18 15:18:33 UTC IO Worker #1 INFO import #20193 26/50 peers 5 MiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 4 req/s, 719 µs
2018-08-18 15:19:08 UTC IO Worker #1 INFO import #22098 26/50 peers 5 MiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 2 req/s, 513 µs
2018-08-18 15:19:38 UTC IO Worker #1 INFO import #22098 26/50 peers 5 MiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 3 req/s, 529 µs
2018-08-18 15:20:13 UTC IO Worker #0 INFO import #22098 26/50 peers 9 MiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 2 req/s, 507 µs
2018-08-18 15:20:43 UTC IO Worker #3 INFO import #47483 26/50 peers 6 KiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 3 req/s, 2911 µs
2018-08-18 15:21:18 UTC IO Worker #1 INFO import #55436 26/50 peers 78 KiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 3 req/s, 547 µs
2018-08-18 15:21:48 UTC IO Worker #1 INFO import #55436 26/50 peers 79 KiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 3 req/s, 524 µs
2018-08-18 15:22:18 UTC IO Worker #2 INFO import #55436 26/50 peers 173 KiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 3 req/s, 542 µs
2018-08-18 15:22:53 UTC IO Worker #3 INFO import #48640 26/50 peers 176 KiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 3 req/s, 23245 µs
2018-08-18 15:23:28 UTC IO Worker #2 INFO import #39367 26/50 peers 198 KiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 3 req/s, 67774 µs
2018-08-18 15:24:03 UTC IO Worker #1 INFO import #30719 26/50 peers 199 KiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 4 req/s, 82361 µs
2018-08-18 15:24:39 UTC IO Worker #1 INFO import #57897 26/50 peers 201 KiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 2 req/s, 42810 µs
2018-08-18 15:25:13 UTC IO Worker #2 INFO import #40736 26/50 peers 202 KiB chain 67 MiB db 0 bytes queue 14 MiB sync RPC: 0 conn, 2 req/s, 388 µs
Ancient blocks verification never completes. So the whole issue is probably related to another issue
Expected order of validators is
Time | Step | Validator | Actual Block |
---|---|---|---|
15:23:05 | 306921157 | 0xdb12…ea85 |
4093446 |
15:23:10 | 306921158 | 0xed2a…1c36 |
4093447 |
15:23:15 | 306921159 | 0x4ca0…46a0 |
Discarded fork 4093447 ? |
15:23:20 | 306921160 | 0x5b2c…11e5 |
Missed |
15:23:25 | 306921161 | 0x58ba…34cd |
4093448 |
The way I see it, 0xed2a…1c36
starts to produce new block at the correct time
2018-08-18 15:23:10 UTC IO Worker #3 TRACE engine Fetched proposer for step 306921158: 0xed2a…1c36
but it doesn't complete on time (probably because the node is busy constantly verifying ancient blocks).
As a consequence at 15:23:16
it receives a new block #4093447 built by the next validator 0x4ca0…46a0
with a step gap
2018-08-18 15:23:16 UTC Verifier #0 DEBUG engine Author 0x4ca0…46a0 built block with step gap. current step: 306921159, parent step: 306921157
Then, finally, it produces own version of block #4093447:
2018-08-18 15:23:17 UTC IO Worker #3 TRACE engine generate_seal: Issuing a block for step 306921158.
2018-08-18 15:23:17 UTC IO Worker #3 TRACE miner Received a Regular seal.
which causes the network to abandon previous version of #4093447 built by 0x4ca0…46a0
and adopt the one from 0xed2a…1c36
:
2018-08-18 15:23:17 UTC IO Worker #3 INFO reorg Reorg to #4093447 0x3105…8229 (0xcb66…4903 #4093446 0xcbeb…269f )
2018-08-18 15:23:17 UTC IO Worker #3 TRACE engine Multi ValidatorSet retrieved for block 509355.
2018-08-18 15:23:17 UTC IO Worker #3 DEBUG engine Zooming to epoch for block 0x3105…8229
This makes validator 0x4ca0…46a0
effectively miss a block (is this a bug? shouldn't network abandon "late" version of #4093447)?
bumping priority here.
@wkarshat Only double votes are handled explicitly (when using the Validator Set here), everything else is handled implicitly by chain difficulty.
@phahulin I think your assessment is correct, ancient block import is blocking import of new blocks. Both processes require write lock at least on the database, I'll see if we can reduce contention. I still don't understand why it keeps backtracking, could you try also fetching trace logs for (network
and sync
)? It will be a lot of logs but you can just dump them here (or contact me on gitter/riot) and I'll filter them on my side. I'm sorry but this one I can't replicate easily since I can't seal blocks on POA chain ;).
The reason the chain re-orgs to the "late" block is because it is mined by the expected validator (no steps were missed), the chain difficulty formula will give a higher score to chains with less missed steps. If in the meantime a block #4093448 was built on top of the "old" #4093447 then it would be kept and no re-org would happen.
@andresilva I've sent you some logs in gitter, I'll try to gather some more and send it there too.
Hi @andresilva I've sent you a link to logs with sync
and network
tracing enabled to gitter. It's about 6 hours long and syncing restartrs from 0th block 44 times during that period.
Is this related to https://github.com/paritytech/parity-ethereum/pull/9531 ?
We're also seeing this issue on a chain configured similarly to poa.network (Aura with 5 sec blocktime): https://explorer.sigma1.artis.network/reorgs
Most nodes run parity 2.1.6-stable
Has this been fixed yet?
Yes, this seem to be addressed by https://github.com/paritytech/parity-ethereum/pull/9531
We run PoA-based network (POA Core and POA Sokol). After upgrade to parity 1.11.8 we started to notice that validator nodes would sometimes miss their turn and not produce a block, which slows down the network. This seem to happen randomly to different validators. We tried restarting parity and upgrading to a large node (more cpu/ram), but the issue still persists.
Could someone provide any guide on how to trouble-shoot/debug this to identify the rootcause?