Closed AndreiEres closed 1 week ago
The CI pipeline was cancelled due to failure one of the required jobs. Job name: test-linux-stable 2/3 Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7288338
What if, during the waiting period, Babe misses a slot or two and block times elongate? Will we discard a viable candidate, then? And do we care?
What if, during the waiting period, Babe misses a slot or two and block times elongate? Will we discard a viable candidate, then? And do we care?
That's a good question.
Do you mean a situation where a candidate should have been dropped under normal circumstances, but due to a BABE failure, received a chance to be backed? I would say under normal circumstances we have enough time to execute the job. If we have delayed it for so long, it means we're overwhelmed. Dropping this lucky candidate will have more impact than not dropping them in my opinion.
What if, during the waiting period, Babe misses a slot or two and block times elongate? Will we discard a viable candidate, then? And do we care?
We hopefully do not care much. In fact, we should explore back pressure in block production too, meaning a relay chain block producers drops some backing statements because they believe the network to be overloaded. A dispute or implicit escalation could be "paid for" by delaying like 30 candidates for example.
For the last five days, the PR with PVF jobs dropping has been live on kusama-validator-bhs5-0. 10 backing jobs have been dropped
7 jobs were dropped immediately after the node restart (2024-10-17 12:38:07) In the log excerpts, we can see that the time between when the job was in statement distribution and when it failed to execute was about 10 minutes. I assume the node was preparing artifacts during this time.
Three jobs were dropped during regular work without high load. According to the relay parent and the timestamp of the logs, the TTL and current best block were calculated correctly. Therefore, the jobs were no longer viable.
Looked a bit into the error modes, at restart I think it is expected we are late on execution because of PVF compilation, so I would say this PR does the right thing in dropping unneeeded candidates.
Three jobs were dropped during regular work without high load. According to the relay parent and the timestamp of the logs, the TTL and current best block were calculated correctly. Therefore, the jobs were no longer viable.
0x3af20af6d19307d4cc159772826bc1b456a8fa04dbb339cbd0a9f9af38e243b4
- https://grafana.teleport.parity.io/goto/lH-ro8mHR?orgId=1
- https://kusama.subscan.io/block/25395808
- 2024-10-18 14:46:08.049 pvf: Job has a deadline priority=Backing ttl=25395807 now=25395808
- Not viable candidate was advertised.
You can see bellow, clearly from the logs that the parent(0xfff0…ef56)
of the parachain blocks is imported really late like 3 blocks after first block with number #25395805
was imported.
2024-10-18 14:46:06.784 INFO tokio-runtime-worker substrate: 🆕 Imported #25395805 (0x831c…6bff → 0xfff0…ef56)
2024-10-18 14:46:00.734 INFO tokio-runtime-worker substrate: 🆕 Imported #25395807 (0x9d20…5920 → 0xd456…117d)
2024-10-18 14:46:00.590 INFO tokio-runtime-worker substrate: 🏆 Imported #25395807 (0x9d20…5920 → 0x19ff…5027)
2024-10-18 14:45:54.723 INFO tokio-runtime-worker substrate: 🏆 Imported #25395806 (0x4c24…f18f → 0x9d20…5920 2024-10-18 14:45:48.450 INFO tokio-runtime-worker substrate: 🏆 Imported #25395805 (0xc199…d5c6 → 0x4c24…f18f)
0x464e09319a61670fa6c2313e645ae46ef7d5d4276999e1d173b4fe393e4ba38e
- https://grafana.teleport.parity.io/goto/5ULqT8iHg?orgId=1
- https://kusama.subscan.io/block/25415637
- 2024-10-20 00:31:14.079 pvf: Job has a deadline priority=Backing ttl=25415636 now=25415637
- Collation was too late
0x7e792950b378bff8d48d3085c56a8d862cd02ba9b27835ab91fd71d93287ea0a
- https://grafana.teleport.parity.io/goto/D3KeT8mNg?orgId=1
- https://kusama.subscan.io/block/25415637
- 2024-10-20 00:31:14.078 pvf: Job has a deadline priority=Backing ttl=25415636 now=25415637
- Collation was too late
In both this cases the parachain blocks is built and advertised on a fork block that is 2 blocks behind the biggest block.
Overall, I think the code did the right thing here and dropped this PVF execution jobs on forks that won't survive, however this raises a small issue with the implementation of choosing the best block number in PVF workers and the fact that is not fork aware, we can have this relay chain:
A -> B -> C -> E
|------------> D
So, if D import gets delayed for 2 blocks when it gets imported all parachains blocks produced on it will be rejected by the PVF executor, because it already imported E whose block number is at least 2 numbers greater than that of D.
Of course the D fork is most likely abandoned, so probably bailing early on checking parachains blocks is probably the right thing to do, but we should ask around if there is a possibility for D to be picked as the relay chain other nodes chose to build on.
Looked a bit into the error modes, at restart I think it is expected we are late on execution because of PVF compilation, so I would say this PR does the right thing in dropping unneeeded candidates.
Thanks @alexggh, it makes sense.
A -> B -> C -> E |------------> D
So, if D import gets delayed for 2 blocks when it gets imported all parachains blocks produced on it will be rejected by the PVF executor, because it already imported E whose block number is at least 2 numbers greater than that of D.
Good point.
Of course the D fork is most likely abandoned, so probably bailing early on checking parachains blocks is probably the right thing to do, but we should ask around if there is a possibility for D to be picked as the relay chain other nodes chose to build on.
It can happen that B contains a canddiate that gets disputed and reverted, so then the chain will continue to build on D. We need to keep track of the best block number for all active leaves.
Fixes https://github.com/paritytech/polkadot-sdk/issues/5530
This PR introduces the removal of backing jobs that have been back pressured for longer than
allowedAncestryLen
, as these candidates are no longer viable.It is reasonable to expect a result for a backing job execution within
allowedAncestryLen
blocks. Therefore, we set the job TTL as a relay block number and synchronize the validation host by sending activated leaves.