paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.89k stars 696 forks source link

PVF: drop backing jobs if it is too late #5616

Closed AndreiEres closed 1 week ago

AndreiEres commented 2 months ago

Fixes https://github.com/paritytech/polkadot-sdk/issues/5530

This PR introduces the removal of backing jobs that have been back pressured for longer than allowedAncestryLen, as these candidates are no longer viable.

It is reasonable to expect a result for a backing job execution within allowedAncestryLen blocks. Therefore, we set the job TTL as a relay block number and synchronize the validation host by sending activated leaves.

paritytech-cicd-pr commented 2 months ago

The CI pipeline was cancelled due to failure one of the required jobs. Job name: test-linux-stable 2/3 Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7288338

s0me0ne-unkn0wn commented 1 month ago

What if, during the waiting period, Babe misses a slot or two and block times elongate? Will we discard a viable candidate, then? And do we care?

AndreiEres commented 1 month ago

What if, during the waiting period, Babe misses a slot or two and block times elongate? Will we discard a viable candidate, then? And do we care?

That's a good question.

Do you mean a situation where a candidate should have been dropped under normal circumstances, but due to a BABE failure, received a chance to be backed? I would say under normal circumstances we have enough time to execute the job. If we have delayed it for so long, it means we're overwhelmed. Dropping this lucky candidate will have more impact than not dropping them in my opinion.

burdges commented 1 month ago

What if, during the waiting period, Babe misses a slot or two and block times elongate? Will we discard a viable candidate, then? And do we care?

We hopefully do not care much. In fact, we should explore back pressure in block production too, meaning a relay chain block producers drops some backing statements because they believe the network to be overloaded. A dispute or implicit escalation could be "paid for" by delaying like 30 candidates for example.

AndreiEres commented 3 weeks ago

For the last five days, the PR with PVF jobs dropping has been live on kusama-validator-bhs5-0. 10 backing jobs have been dropped

7 jobs were dropped immediately after the node restart (2024-10-17 12:38:07) In the log excerpts, we can see that the time between when the job was in statement distribution and when it failed to execute was about 10 minutes. I assume the node was preparing artifacts during this time.

Three jobs were dropped during regular work without high load. According to the relay parent and the timestamp of the logs, the TTL and current best block were calculated correctly. Therefore, the jobs were no longer viable.

alexggh commented 3 weeks ago

Looked a bit into the error modes, at restart I think it is expected we are late on execution because of PVF compilation, so I would say this PR does the right thing in dropping unneeeded candidates.

Three jobs were dropped during regular work without high load. According to the relay parent and the timestamp of the logs, the TTL and current best block were calculated correctly. Therefore, the jobs were no longer viable.

You can see bellow, clearly from the logs that the parent(0xfff0…ef56) of the parachain blocks is imported really late like 3 blocks after first block with number #25395805 was imported.


2024-10-18 14:46:06.784  INFO tokio-runtime-worker substrate: 🆕 Imported #25395805 (0x831c…6bff → 0xfff0…ef56)
 2024-10-18 14:46:00.734  INFO tokio-runtime-worker substrate: 🆕 Imported #25395807 (0x9d20…5920 → 0xd456…117d) 
2024-10-18 14:46:00.590  INFO tokio-runtime-worker substrate: 🏆 Imported #25395807 (0x9d20…5920 → 0x19ff…5027)
2024-10-18 14:45:54.723  INFO tokio-runtime-worker substrate: 🏆 Imported #25395806 (0x4c24…f18f → 0x9d20…5920 2024-10-18 14:45:48.450  INFO tokio-runtime-worker substrate: 🏆 Imported #25395805 (0xc199…d5c6 → 0x4c24…f18f)

In both this cases the parachain blocks is built and advertised on a fork block that is 2 blocks behind the biggest block.

Overall, I think the code did the right thing here and dropped this PVF execution jobs on forks that won't survive, however this raises a small issue with the implementation of choosing the best block number in PVF workers and the fact that is not fork aware, we can have this relay chain:

A -> B -> C -> E
|------------> D

So, if D import gets delayed for 2 blocks when it gets imported all parachains blocks produced on it will be rejected by the PVF executor, because it already imported E whose block number is at least 2 numbers greater than that of D.

Of course the D fork is most likely abandoned, so probably bailing early on checking parachains blocks is probably the right thing to do, but we should ask around if there is a possibility for D to be picked as the relay chain other nodes chose to build on.

sandreim commented 3 weeks ago

Looked a bit into the error modes, at restart I think it is expected we are late on execution because of PVF compilation, so I would say this PR does the right thing in dropping unneeeded candidates.

Thanks @alexggh, it makes sense.

A -> B -> C -> E
|------------> D

So, if D import gets delayed for 2 blocks when it gets imported all parachains blocks produced on it will be rejected by the PVF executor, because it already imported E whose block number is at least 2 numbers greater than that of D.

Good point.

Of course the D fork is most likely abandoned, so probably bailing early on checking parachains blocks is probably the right thing to do, but we should ask around if there is a possibility for D to be picked as the relay chain other nodes chose to build on.

It can happen that B contains a canddiate that gets disputed and reverted, so then the chain will continue to build on D. We need to keep track of the best block number for all active leaves.