After upgrade to 1.11.8 in PoA-network validators miss blocks

phahulin commented 6 years ago

I'm running:

Which Parity version?: 1.11.8

Which operating system?: Linux

How installed?: via binaries

Are you fully synchronized?: yes

Which network are you connected to?: core

Did you try to restart the node?: yes

We run PoA-based network (POA Core and POA Sokol). After upgrade to parity 1.11.8 we started to notice that validator nodes would sometimes miss their turn and not produce a block, which slows down the network. This seem to happen randomly to different validators. We tried restarting parity and upgrading to a large node (more cpu/ram), but the issue still persists.

Could someone provide any guide on how to trouble-shoot/debug this to identify the rootcause?

wkarshat commented 6 years ago

Is this the same or a similar error to https://github.com/paritytech/parity-ethereum/issues/9320

Tbaut commented 6 years ago

@phahulin is this the same error as #9320 ? How do you see that "validator nodes would sometimes miss their turn" ?

phahulin commented 6 years ago

I don't know if it's the same error as #9320 or not.

Since we use AuRa PoA engine, validators are expected to produce blocks in round-robin fashion, in our case every 5 seconds. What we observe is that they sometimes miss their turn, so that after Nth validator next block is mined 10 seconds later by (N+2)th validator. For example, take this block https://core-explorer.poa.network/block/4096058 it was mined at 2018-08-10T10:24:20 UTC, the previous block was correctly mined at 2018-08-10T10:24:15 (5 sec gap), but the next is mined at 2018-08-10T10:24:30 (10 sec gap).

From what we can see from netstats, all validators are online and have enough peers. Also, it doesn't happen each round.

Tbaut commented 6 years ago

From what we can see from netstats, all validators are online and have enough peers.

Is there anything in the logs? You can probably figure out which validator failed to produce a block, can you check if it's always the same (very likely). It could be a problem with a CPU/RAM/Disk bottleneck.

phahulin commented 6 years ago

No, any specific logs (logging="???") we should be looking at? In our Core network what we see is that 15/21 validators sometimes miss blocks. They don't miss them regularly, but combined effect is very noticable.

Tbaut commented 6 years ago

Using a validator set contract, you can report the authorities missing a block. This could be a good start. I'm not sure the -l poa flag could help but you can give it a try (list of available logs).

phahulin commented 6 years ago

We use on-chain governance to manage validators list. But it won't be a good solution anyway, since almost all of them will get banned.

Tbaut commented 6 years ago

Reporting isn't banning. You need to monitor your network in some way, (with you block explorer for instance) and find out who misses its slot.

phahulin commented 6 years ago

Yeah, we do monitor. The problem is that about 15/21 started to miss their blocks after upgrade

andresilva commented 6 years ago

@phahulin Could you grab logs for -lengine=trace? Although it probably won't help much. Can you relate the skipped authorities to any kind of CPU load that the node is having? Does it only happen on 1.11.8, do you know if this also happens with 2.0.1?

phahulin commented 6 years ago

Ok, will do. But it will take some time because we don't have direct access to validators nodes, so I'll have to ask them. Please don't close this issue yet

phahulin commented 6 years ago

@andresilva I got logs from one validator from our test network (sokol) taken with logging="engine=trace,poa=trace,miner=trace,reorg=trace,own_tx=trace,perf=trace,stats=trace". Here is a piece of log around the block that this node didn't produce. I formatted them a lit bit and excluded perf part, because it's very verbose but doesn't seem to be too informative. I can add it if you'll need.

A bit of context:

This validator is 0x5b2c..., he didn't produce a block in this round. Validator before him didn't produce the block too.
Our step duration is 5 sec, so there are 5 time-slots to produce blocks in this piece of log, but only 3 blocks were produced:

Time	Step	Expected Val	Actual Block N	Block hash
00:03:30	...282	0xdb12...	4033697	0x194b...
00:03:35	...283	0xeD2A...	4033698	0xe2a3...
00:03:40	...284	0x4Ca0...	MISSED	-
00:03:45	...285	0x5B2c...	MISSED	-
00:03:50	...286	0x58ba...	4033699	0x9e8a...

From what I see, something strange happens with block 4033698 - although it corresponds to 00:03:35, a message that it was imported appears at 00:03:40. I can also add, that I checked on few other nodes and on all of them this block was imported at 00:03:40.

Then there is a message that validator 0x4Ca0... produced his block with a step gap (so he did produce it!), then reorg happens to 4033698 (how can reorgs happen in AuRa btw?) A message about "Fetching proposer for step ...285" (the one where this validator should have produced a block) only appears at 00:03:50, while it should have appeared at 00:03:45

I asked him to check if time is in sync on his node, this is the best guess I have so far. Doesn't explain though why he's not constantly missing blocks and when 4033698 was actually produced

andresilva commented 6 years ago

Indeed something strange happens at block #4033698, actually it is imported with a different hash first:

2018-08-15 00:03:40 UTC Verifier #1 INFO import Imported #4033698 0x5bbd…99a2 (2 txs, 0.04 Mgas, 1 ms, 0.78 KiB)

Re-orgs in Aura can happen if there's any kind of "partition" to how the nodes see the global state, that could be:

Network partitions, where nodes will only be able to communicate with a subset of authorities
Dishonest nodes that mine on more than one chain
Unsynchronized clocks

I would try to check that all the clocks are synchronized and that there isn't more than one authority online using the same key. Also, what version were you running before the upgrade? Just to see if I can try to relate this to any change that we might have done.

phahulin commented 6 years ago

The previous parity version was 1.10.6

wkarshat commented 6 years ago

@andresilva, you mention three reasons that a "re-org" can happen. Could you give us a clue where in the code these decisions are being made, please? Also, we are collecting logs that are very verbose, using logging="engine=trace,poa=trace,miner=trace,reorg=trace,own_tx=trace,perf=trace,stats=trace" Which ones can we eliminate, while pursuing debug of these three scenarios?

phahulin commented 6 years ago

Hi @andresilva

I got logs from the validator node (0xed2a…1c36) that produced weird blocks. Here is the piece of log that I think corresponds to the scenario above. This is piece of log is, again, without perf logs.

However, in this case, there TONS of messages from perf, similar to:

2018-08-18 15:17:07 UTC IO Worker #3 TRACE perf  queue_ancient_block: 0.04ms

Actually, about 96% of the log file consists of these messages. Also, this is weird

$ grep 'import      #' 0xed2a.log  | head -n15
2018-08-18 15:17:23 UTC IO Worker #0 INFO import      #50799   26/50 peers      4 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 744 µs
2018-08-18 15:17:59 UTC IO Worker #2 INFO import      #64261   26/50 peers      5 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 565 µs
2018-08-18 15:18:33 UTC IO Worker #1 INFO import      #20193   26/50 peers      5 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  4 req/s, 719 µs
2018-08-18 15:19:08 UTC IO Worker #1 INFO import      #22098   26/50 peers      5 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  2 req/s, 513 µs
2018-08-18 15:19:38 UTC IO Worker #1 INFO import      #22098   26/50 peers      5 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 529 µs
2018-08-18 15:20:13 UTC IO Worker #0 INFO import      #22098   26/50 peers      9 MiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  2 req/s, 507 µs
2018-08-18 15:20:43 UTC IO Worker #3 INFO import      #47483   26/50 peers      6 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 2911 µs
2018-08-18 15:21:18 UTC IO Worker #1 INFO import      #55436   26/50 peers     78 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 547 µs
2018-08-18 15:21:48 UTC IO Worker #1 INFO import      #55436   26/50 peers     79 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 524 µs
2018-08-18 15:22:18 UTC IO Worker #2 INFO import      #55436   26/50 peers    173 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 542 µs
2018-08-18 15:22:53 UTC IO Worker #3 INFO import      #48640   26/50 peers    176 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 23245 µs
2018-08-18 15:23:28 UTC IO Worker #2 INFO import      #39367   26/50 peers    198 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  3 req/s, 67774 µs
2018-08-18 15:24:03 UTC IO Worker #1 INFO import      #30719   26/50 peers    199 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  4 req/s, 82361 µs
2018-08-18 15:24:39 UTC IO Worker #1 INFO import      #57897   26/50 peers    201 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  2 req/s, 42810 µs
2018-08-18 15:25:13 UTC IO Worker #2 INFO import      #40736   26/50 peers    202 KiB chain   67 MiB db  0 bytes queue   14 MiB sync  RPC:  0 conn,  2 req/s, 388 µs

Ancient blocks verification never completes. So the whole issue is probably related to another issue

Expected order of validators is

Time	Step	Validator	Actual Block
15:23:05	306921157	`0xdb12…ea85`	4093446
15:23:10	306921158	`0xed2a…1c36`	4093447
15:23:15	306921159	`0x4ca0…46a0`	Discarded fork 4093447 ?
15:23:20	306921160	`0x5b2c…11e5`	Missed
15:23:25	306921161	`0x58ba…34cd`	4093448

The way I see it, 0xed2a…1c36 starts to produce new block at the correct time

2018-08-18 15:23:10 UTC IO Worker #3 TRACE engine  Fetched proposer for step 306921158: 0xed2a…1c36

but it doesn't complete on time (probably because the node is busy constantly verifying ancient blocks).

As a consequence at 15:23:16 it receives a new block #4093447 built by the next validator 0x4ca0…46a0 with a step gap

2018-08-18 15:23:16 UTC Verifier #0 DEBUG engine  Author 0x4ca0…46a0 built block with step gap. current step: 306921159, parent step: 306921157

Then, finally, it produces own version of block #4093447:

2018-08-18 15:23:17 UTC IO Worker #3 TRACE engine  generate_seal: Issuing a block for step 306921158.
2018-08-18 15:23:17 UTC IO Worker #3 TRACE miner  Received a Regular seal.

which causes the network to abandon previous version of #4093447 built by 0x4ca0…46a0 and adopt the one from 0xed2a…1c36:

2018-08-18 15:23:17 UTC IO Worker #3 INFO reorg  Reorg to #4093447 0x3105…8229 (0xcb66…4903 #4093446 0xcbeb…269f )
2018-08-18 15:23:17 UTC IO Worker #3 TRACE engine  Multi ValidatorSet retrieved for block 509355.
2018-08-18 15:23:17 UTC IO Worker #3 DEBUG engine  Zooming to epoch for block 0x3105…8229

This makes validator 0x4ca0…46a0 effectively miss a block (is this a bug? shouldn't network abandon "late" version of #4093447)?

5chdn commented 6 years ago

bumping priority here.

andresilva commented 6 years ago

@wkarshat Only double votes are handled explicitly (when using the Validator Set here), everything else is handled implicitly by chain difficulty.

@phahulin I think your assessment is correct, ancient block import is blocking import of new blocks. Both processes require write lock at least on the database, I'll see if we can reduce contention. I still don't understand why it keeps backtracking, could you try also fetching trace logs for (network and sync)? It will be a lot of logs but you can just dump them here (or contact me on gitter/riot) and I'll filter them on my side. I'm sorry but this one I can't replicate easily since I can't seal blocks on POA chain ;).

The reason the chain re-orgs to the "late" block is because it is mined by the expected validator (no steps were missed), the chain difficulty formula will give a higher score to chains with less missed steps. If in the meantime a block #4093448 was built on top of the "old" #4093447 then it would be kept and no re-org would happen.

phahulin commented 6 years ago

@andresilva I've sent you some logs in gitter, I'll try to gather some more and send it there too.

phahulin commented 6 years ago

Hi @andresilva I've sent you a link to logs with sync and network tracing enabled to gitter. It's about 6 hours long and syncing restartrs from 0th block 44 times during that period.

phahulin commented 6 years ago

d10r commented 5 years ago

We're also seeing this issue on a chain configured similarly to poa.network (Aura with 5 sec blocktime): https://explorer.sigma1.artis.network/reorgs
Most nodes run parity 2.1.6-stable

AyushyaChitransh commented 5 years ago

Has this been fixed yet?

phahulin commented 5 years ago

Yes, this seem to be addressed by https://github.com/paritytech/parity-ethereum/pull/9531

openethereum / parity-ethereum

After upgrade to 1.11.8 in PoA-network validators miss blocks #9323