PVF: drop backing jobs if it is too late

AndreiEres commented 2 months ago

Fixes https://github.com/paritytech/polkadot-sdk/issues/5530

This PR introduces the removal of backing jobs that have been back pressured for longer than allowedAncestryLen, as these candidates are no longer viable.

It is reasonable to expect a result for a backing job execution within allowedAncestryLen blocks. Therefore, we set the job TTL as a relay block number and synchronize the validation host by sending activated leaves.

paritytech-cicd-pr commented 2 months ago

The CI pipeline was cancelled due to failure one of the required jobs. Job name: test-linux-stable 2/3 Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7288338

s0me0ne-unkn0wn commented 1 month ago

What if, during the waiting period, Babe misses a slot or two and block times elongate? Will we discard a viable candidate, then? And do we care?

AndreiEres commented 1 month ago

What if, during the waiting period, Babe misses a slot or two and block times elongate? Will we discard a viable candidate, then? And do we care?

That's a good question.

Do you mean a situation where a candidate should have been dropped under normal circumstances, but due to a BABE failure, received a chance to be backed? I would say under normal circumstances we have enough time to execute the job. If we have delayed it for so long, it means we're overwhelmed. Dropping this lucky candidate will have more impact than not dropping them in my opinion.

burdges commented 1 month ago

What if, during the waiting period, Babe misses a slot or two and block times elongate? Will we discard a viable candidate, then? And do we care?

We hopefully do not care much. In fact, we should explore back pressure in block production too, meaning a relay chain block producers drops some backing statements because they believe the network to be overloaded. A dispute or implicit escalation could be "paid for" by delaying like 30 candidates for example.

AndreiEres commented 3 weeks ago

For the last five days, the PR with PVF jobs dropping has been live on kusama-validator-bhs5-0. 10 backing jobs have been dropped

7 jobs were dropped immediately after the node restart (2024-10-17 12:38:07) In the log excerpts, we can see that the time between when the job was in statement distribution and when it failed to execute was about 10 minutes. I assume the node was preparing artifacts during this time.

0xe30c6ff486d78abecde50859daac20bc247c16dc39d129a7f1351fe4fda2a4d8
- 2024-10-17 12:38:30 statement-distribution
- 2024-10-17 12:45:17 candidate-backing: Failed to validate and make available
0x388d36d0de60db8744785eeae0a70947c6bcbc99250ee714f6106eebe01a215f
- 2024-10-17 12:39:24 statement-distribution
- 2024-10-17 12:45:30 candidate-backing: Failed to validate and make available
0x7bd1dfa5727108311a707f8fadf9049d49470dfc86c4788a81bf30ca0e7ec45a
- 2024-10-17 12:39:00 statement-distribution
- 2024-10-17 12:45:30 candidate-backing: Failed to validate and make available
0x51c97250c913ac79e273ba781fc5cfc73693e695f0ab7fb59e249531976be496
- 2024-10-17 12:39:24 statement-distribution
- 2024-10-17 12:45:30 candidate-backing: Failed to validate and make available
0xb2a489c245779aff8a4b2fcfe96fcf9a0af20e1019746d77034d125f9eb46800
- 2024-10-17 12:39:06 statement-distribution
- 2024-10-17 12:45:30 candidate-backing: Failed to validate and make available
0x3584d6129133b8972e8fd6ae3ec41ad18ee24931f99cc514aeb58f7e6da45fc3
- 2024-10-17 12:39:24 statement-distribution
- 2024-10-17 12:45:30 candidate-backing: Failed to validate and make available
0x867df02b097bad13dea1948ceb9950d8fffe7fca4bbd93c8789c71b861f6f8f6
- 2024-10-17 12:38:36 statement-distribution
- 2024-10-17 12:45:30 candidate-backing: Failed to validate and make available

Three jobs were dropped during regular work without high load. According to the relay parent and the timestamp of the logs, the TTL and current best block were calculated correctly. Therefore, the jobs were no longer viable.

0x3af20af6d19307d4cc159772826bc1b456a8fa04dbb339cbd0a9f9af38e243b4
- https://grafana.teleport.parity.io/goto/lH-ro8mHR?orgId=1
- https://kusama.subscan.io/block/25395808
- 2024-10-18 14:46:08.049 pvf: Job has a deadline priority=Backing ttl=25395807 now=25395808
- Not viable candidate was advertised.
0x464e09319a61670fa6c2313e645ae46ef7d5d4276999e1d173b4fe393e4ba38e
- https://grafana.teleport.parity.io/goto/5ULqT8iHg?orgId=1
- https://kusama.subscan.io/block/25415637
- 2024-10-20 00:31:14.079 pvf: Job has a deadline priority=Backing ttl=25415636 now=25415637
- Collation was too late
0x7e792950b378bff8d48d3085c56a8d862cd02ba9b27835ab91fd71d93287ea0a
- https://grafana.teleport.parity.io/goto/D3KeT8mNg?orgId=1
- https://kusama.subscan.io/block/25415637
- 2024-10-20 00:31:14.078 pvf: Job has a deadline priority=Backing ttl=25415636 now=25415637
- Collation was too late

alexggh commented 3 weeks ago

Looked a bit into the error modes, at restart I think it is expected we are late on execution because of PVF compilation, so I would say this PR does the right thing in dropping unneeeded candidates.

Three jobs were dropped during regular work without high load. According to the relay parent and the timestamp of the logs, the TTL and current best block were calculated correctly. Therefore, the jobs were no longer viable.

0x3af20af6d19307d4cc159772826bc1b456a8fa04dbb339cbd0a9f9af38e243b4

https://grafana.teleport.parity.io/goto/lH-ro8mHR?orgId=1

https://kusama.subscan.io/block/25395808

2024-10-18 14:46:08.049 pvf: Job has a deadline priority=Backing ttl=25395807 now=25395808

Not viable candidate was advertised.

You can see bellow, clearly from the logs that the parent(0xfff0…ef56) of the parachain blocks is imported really late like 3 blocks after first block with number #25395805 was imported.


2024-10-18 14:46:06.784  INFO tokio-runtime-worker substrate: 🆕 Imported #25395805 (0x831c…6bff → 0xfff0…ef56)
 2024-10-18 14:46:00.734  INFO tokio-runtime-worker substrate: 🆕 Imported #25395807 (0x9d20…5920 → 0xd456…117d) 
2024-10-18 14:46:00.590  INFO tokio-runtime-worker substrate: 🏆 Imported #25395807 (0x9d20…5920 → 0x19ff…5027)
2024-10-18 14:45:54.723  INFO tokio-runtime-worker substrate: 🏆 Imported #25395806 (0x4c24…f18f → 0x9d20…5920 2024-10-18 14:45:48.450  INFO tokio-runtime-worker substrate: 🏆 Imported #25395805 (0xc199…d5c6 → 0x4c24…f18f)

0x464e09319a61670fa6c2313e645ae46ef7d5d4276999e1d173b4fe393e4ba38e

https://grafana.teleport.parity.io/goto/5ULqT8iHg?orgId=1

https://kusama.subscan.io/block/25415637

2024-10-20 00:31:14.079 pvf: Job has a deadline priority=Backing ttl=25415636 now=25415637

Collation was too late

0x7e792950b378bff8d48d3085c56a8d862cd02ba9b27835ab91fd71d93287ea0a

https://grafana.teleport.parity.io/goto/D3KeT8mNg?orgId=1

https://kusama.subscan.io/block/25415637

2024-10-20 00:31:14.078 pvf: Job has a deadline priority=Backing ttl=25415636 now=25415637

Collation was too late

In both this cases the parachain blocks is built and advertised on a fork block that is 2 blocks behind the biggest block.

Overall, I think the code did the right thing here and dropped this PVF execution jobs on forks that won't survive, however this raises a small issue with the implementation of choosing the best block number in PVF workers and the fact that is not fork aware, we can have this relay chain:

A -> B -> C -> E
|------------> D

So, if D import gets delayed for 2 blocks when it gets imported all parachains blocks produced on it will be rejected by the PVF executor, because it already imported E whose block number is at least 2 numbers greater than that of D.

Of course the D fork is most likely abandoned, so probably bailing early on checking parachains blocks is probably the right thing to do, but we should ask around if there is a possibility for D to be picked as the relay chain other nodes chose to build on.

sandreim commented 3 weeks ago

Looked a bit into the error modes, at restart I think it is expected we are late on execution because of PVF compilation, so I would say this PR does the right thing in dropping unneeeded candidates.

Thanks @alexggh, it makes sense.

A -> B -> C -> E
|------------> D
So, if D import gets delayed for 2 blocks when it gets imported all parachains blocks produced on it will be rejected by the PVF executor, because it already imported E whose block number is at least 2 numbers greater than that of D.

Good point.

Of course the D fork is most likely abandoned, so probably bailing early on checking parachains blocks is probably the right thing to do, but we should ask around if there is a possibility for D to be picked as the relay chain other nodes chose to build on.

It can happen that B contains a canddiate that gets disputed and reverted, so then the chain will continue to build on D. We need to keep track of the best block number for all active leaves.

paritytech / polkadot-sdk

PVF: drop backing jobs if it is too late #5616