paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.64k stars 579 forks source link

PVF: prioritise execution depending on context #4632

Open sandreim opened 1 month ago

sandreim commented 1 month ago

From the node perspective, the goal is to trade off parachain liveness for finality . When finality lag hits a threshold, the mechanism should kick in and enforce the following priority:

In case of node overload, this change adds back pressure to candidate backing but not dropping backed candidates. Due to async backing delayed candidate backing still allows for candidates to be backed in next relay chain block if relay parent does not expire.

Some more details on the subject can be found in this discussion

burdges commented 1 month ago

I'm not prefectly happy about disputes being higher priority than approvals, depending upon what approvals means:

If we delay approval announcements then we risk adversaries breaking soundness when honest nodes get bogged down.

If we delay approval checks, then we risk unecessary escalations. If we add some "I'm here but overworked" vote then we incur that code complexity.

We do not necessarily have a choice but to prioritize disputes of course, since disputes mess everything up anyways. We could discuss "soft" options like a dispute appearing on-chain turns off backing & inclusion for a few blocks. It's tricky picking how long though: A dispute could resolve fast, but not a timeout dispute, although maybe polkavm fixes that.

sandreim commented 1 month ago

I'm not prefectly happy about disputes being higher priority than approvals, depending upon what approvals means:

If we delay approval announcements then we risk adversaries breaking soundness when honest nodes get bogged down.

If we delay approval checks, then we risk unecessary escalations. If we add some "I'm here but overworked" vote then we incur that code complexity.

This is what I had in mind and will indeed cause escalations, but that is fine it back pressures on backing PVF execution (least priority) which means less approval work for next block allowing the network to recover.

We do not necessarily have a choice but to prioritize disputes of course, since disputes mess everything up anyways. We could discuss "soft" options like a dispute appearing on-chain turns off backing & inclusion for a few blocks. It's tricky picking how long though: A dispute could resolve fast, but not a timeout dispute, although maybe polkavm fixes that.

I totally agree, we will have to prioritise disputes and maybe we should not starve approvals, like we would do with backing and still allow 10% approval pvf execution vs 90% disputes or smth like that.

burdges commented 1 month ago

Alright delaying the approval checks is secure. It'll just cause more slowdowns, which'll ultimately hit backing.

If approvals remain at 100%, but backing stops completely for one block, then we've made time to process one dispute, well except for the extra reconstruction overhead.We might've more than one dispute of course, but someone is being billed for them. We cannot stop backing compeltely either, but..

Also, if approvals run at maybe 60% or something, then we typically still finish approval checks before we no-show, which avoids escalations.

I think the first question here is: At the time disputes occur, how hard do we back pressue our local backing? We could stip 2-3 backing slots or something every time a dispute occurs, even if we do not feel the pressure ourselves?

AndreiEres commented 2 weeks ago

@sandreim @burdges I'd like to clarify the final numbers:

Please, correct me if I took it wrong.

sandreim commented 2 weeks ago

My current thinking is to still allow 10% backing. Getting at least 10% approvals is not going to have any impact on finality if it is lagging because of disputes, but will reduce the number of no-shows which is beneficial. However, because disputes are something that happen very rarely I don't think it is worth doing it.

To get a clear answer we should run some dispute load tests on Versi with both scenarios.

burdges commented 2 weeks ago

We're discussing situations where disputes have occured across many cores, or in many times slots for only a few parachains.

If we've only a few parachains causing many disputes, then maybe they could be handled in sequence, but not sure doing so warrants extra complexity. We might however sequence disputes anyways, like maybe oldest first so reversions turn the rest into remote disputes.

In principle, we could do remote disputes more lazily, but we've no concensus upon which disputes are remote, unless grandpa finalizes something remote from the dispute's perspective. Again not sure if we need code for this: Yes, this could same considerable CPU during a storm of correct disputes of invalid blocks, but that's even more rare. Imho, incorrect storms of disputes of valid blocks sounds way more likely.

We should likely keep that part simple for now, but just discussing the design space there.

Anyways..

Approvals should imho never halt completely, not sure how much reserve, but 10% sounds reasonable.

We only start with backing tasks when other queues are empty.

Approvals might never be empty, so not sure this works per se.

In theory, I'd halt backing completely if we've really exhausted all other resources. We'd maybe want some exceptions though, like sytem parachains dong ellections, or DKGs, or maybe other governance. If we'd want such special reserved system parachains, then we might start by simply never halting backing completely, and then later if necessary then add per parachain configujration there.

Approval's should always have more reserved capacity, as otherwise approvals might never catch up vs backing.

At what points should we trigger this back pressure on backing?

That's like 5 possible overworked metrics. Any thoughts on what's simplest? If we only do one, then queue fullness sounds good, because maybe our queue fills up because others no-show. If we do two, then maybe add network no-shows or finality. I think beyond this then we should more carefully assess.

sandreim commented 2 weeks ago

We're discussing situations where disputes have occured across many cores, or in many times slots for only a few parachains.

If we've only a few parachains causing many disputes, then maybe they could be handled in sequence, but not sure doing so warrants extra complexity. We might however sequence disputes anyways, like maybe oldest first so reversions turn the rest into remote disputes.

With validator disabling these things are looking very good now. I don't think we should optimize for disputes and in this current proposal they have max priority.

We should likely keep that part simple for now, but just discussing the design space there.

Sure, we just want some simple prioritization to better control system load. Further optimizations should be done later, only if needed.

Approvals should imho never halt completely, not sure how much reserve, but 10% sounds reasonable.

What is the reasoning for this 10% ? Do you believe that reducing no-shows by a small margin improves the overall outcome of a dispute storm scenario ?

We only start with backing tasks when other queues are empty.

Approvals might never be empty, so not sure this works per se.

Yes, that might be true to some extent, so we need to reserve some % resources for backing.

In theory, I'd halt backing completely if we've really exhausted all other resources. We'd maybe want some exceptions though, like sytem parachains dong ellections, or DKGs, or maybe other governance. If we'd want such special reserved system parachains, then we might start by simply never halting backing completely, and then later if necessary then add per parachain configujration there.

Yes, we should reserve capacity for the system parachains, giving something like 10% to backing no matter the load should not make the situation worse.

Approval's should always have more reserved capacity, as otherwise approvals might never catch up vs backing.

Yeah, they have 90% if no disputes. We can consider lowering backing reservation to just 5%, 1 in 20 executions is guaranteed to be backing. If these are light blocks it should still work fine. However if blocks are 2s, it will take 40s for the the candidate to be backed by which time it's relay parent has already expired.

At what points should we trigger this back pressure on backing?

  • Individual criteria stop us overworking ourselves:

    • Slow/delay backing based upon our CPU load --- I dislike this one because high CPU load means we're wasting less silicon.

This is hard to measure in practice, but the prioritization proposed here actually achieves this without measuring the load directly.

  • Halt/delay/slow backing if we're currently in no-show.

I think it is a good idea, this can happen for example because availability recovery is slow because of the nework. In this case, there wouldn't be any actual CPU load, so the prioritization doesn't help at all.

  • Delay/slow backing if our queue is too full --- This might be more senseitive than no-shows, but what does it mean if our queue in not full, but we're in no-show?

The prioritization here should delay backing if queue is full in terms of there is more work that we can sustain in a reasonable time. The fullness depends on how much time the execution takes. You could have 10 executions queued and each one will take 2s, but you could also have 30 executions of 100ms each. Hard to know how full the queue is until you execute it. We could optimize and add a TTL for the backing executions to drop them if the relay parent has expired.

  • Network criteria stop us overworking others:

    • Slow/delay backing dependent upon finality lag --- Relatively simple to configure.

This must be done in the runtime. We just need to put finality proofs and we can select how many get backed.

  • Slow/delay backing dependent upon how many no-shows exist overall --- This is nicely objective, and much more reactive, but maybe too reactive, and adds more parameters.

No-shows tipically will raise the CPU/network load as more tranches needed to cover so we will a bit later slow down backing. Doing it earlier as you propose can improve the situation probably faster

That's like 5 possible overworked metrics. Any thoughts on what's simplest? If we only do one, then queue fullness sounds good, because maybe our queue fills up because others no-show. If we do two, then maybe add network no-shows or finality. I think beyond this then we should more carefully assess.

We want to implement the prioritization first. Then we do some glutton testing and calibrate the backing percentage. We should also test 10% approval reservation.

burdges commented 2 weeks ago

Approvals should imho never halt completely, not sure how much reserve, but 10% sounds reasonable. What is the reasoning for this 10% ?

I'd actually prever higher minimum for approvals, like max 70% for disputes, minimum 25% for approvals, and minimum 5% for critical system parachains, but no backing reservation for other parachains, or even non-criticial system parachains really.

We could likely go below this 25% but then we should think about crticial vs non-critical system parachains. We could ignore crticial vs non-critical system parachains and simply reserve more approval times for them. Or maybe we'll have so few system parachains that 10% is alraedy enough we can ignore this distinction?

Do you believe that reducing no-shows by a small margin improves the overall outcome of a dispute storm scenario ?

Yeah, every no-show add 2.25 more checkers.

This is hard to measure in practice, but the prioritization proposed here actually achieves this without measuring the load directly.

Alright yeah let's not directly measure CPU load. :)

Halt/delay/slow backing if we're currently in no-show. I think it is a good idea, this can happen for example because availability recovery is slow because of the nework. In this case, there wouldn't be any actual CPU load, so the prioritization doesn't help at all.

Interesting, the queue could be empty but slow recovery might be causing no-shows. At what times do we expect this?

At least one scenario goes: Some validators get elected who have network disruptions to many other validators, like NATs or BGP attacks. Or maybe some validators try earning themselves extra points in availability by voting early that they have their chunk, assuming they'll get it eventually. If enough, then this hurts availability.

I guess back pressure makes sense here, but it's clearly secondary to the queue.

Delay/slow backing if our queue is too full --- This might be more senseitive than no-shows, but what does it mean if our queue in not full, but we're in no-show? The prioritization here should delay backing if queue is full in terms of there is more work that we can sustain in a reasonable time. The fullness depends on how much time the execution takes. You could have 10 executions queued and each one will take 2s, but you could also have 30 executions of 100ms each. Hard to know how full the queue is until you execute it. We could optimize and add a TTL for the backing executions to drop them if the relay parent has expired.

Can we just assume worse case here? We'd tune based on gluttons I guess, so maybe that's easiest anyways?

Slow/delay backing dependent upon finality lag --- Relatively simple to configure. This must be done in the runtime. We just need to put finality proofs and we can select how many get backed.

Why? I'd think the opposite: We know what our grandpa is voting on, so if it's too old than apply back pressure.

Slow/delay backing dependent upon how many no-shows exist overall --- This is nicely objective, and much more reactive, but maybe too reactive, and adds more parameters. No-shows tipically will raise the CPU/network load as more tranches needed to cover so we will a bit later slow down backing. Doing it earlier as you propose can improve the situation probably faster

Alright so we've boiled down my criteria list to: a. worst case queue based estimate b. overall no-show count c. our own local no-shows

We want to implement the prioritization first.

Yeah of course, one thing at a time. We can talk later about whether we really need all of a,b,c. :)

I'd missed one factor above: We need back pressure not only in our own backing like discussed here, but also in the relay chain block production: If dishonest or confused backers produce backing statements while being back presured, then an honest relay chain block producer should skip their statments too. We need not rush into doing this, since it only matters if nodes are being dishonest, conecting poorly, etc.

Anyeways, we'd eventually wire some of those three criteria into (x) back pressuring doing & gossiping backing statements, and (y) placing backing statments into relay chain blocks.

AndreiEres commented 1 week ago

In my proposal we have 3 levels of priority. If we talk about giving 5-10% of execution to backing for system parachains, it may end up adding another level. Instead, I'd put system parachains backing in the same queue as approval work.

So we'll have that mapping:

I'd start with a 70/20/10 split and then we can tweak it as we test. What do you think @sandreim @burdges?

burdges commented 1 week ago

Are those numbers CPU time "priorities" for those tasks when back pressure gets applied? Or are they overall allocations at all times? If a distinguished state, then any idea how quickly we can enter and leave the "back pressured" state?

I presume those priorities fall through, so if there are no disputes then the 100% is divided among the remaining categories?

Approvals & system parachains backing should not be given the same weight. We do not notice this problem right now, but if someone does something silly like give longer runtimes (JAM) then this'll break.

Approvals should be given much more CPU time, which under good conditions resembles the paramater choices in https://github.com/paritytech/polkadot-sdk/issues/640. In other words, if there are no problems then approvals should've roughly relayVrfModuloSamples + 3 times the CPU time of all backing jobs. I'm still eye balling that +3 there but anyways..

At a more intuitive level, two backing statement creates work for like 35 other nodes, but possibly 1000 other nodes if we're experencing network disruptions, so approvals should be like 14 times more CPU than all backing, even if zero no-shows occur, not so different from the relayVrfModuloSamples + 3 estimate. Ideally this would be higher when back pressure gets applied.

Anyways..

It sounds like you do not have a distinguished back pressure state yet, so then maybe the number should be:

70% disputes
28% approvals & availability
2% all parachains, both system and other.

If there are no disputes then we'd choose backing over approvals 6.7 % of the time, which sounds reasonable. It's still possible this change results in more delayed backing votes, but that likely signals some problems elsewhere. Any thoughs @alexggh ?

Also..

We can seperate system and other partachains once we eventually add some distinguished back pressure state, but the simplest breakdown under back pressure would be

70% disputes
28% approvals & availability
2% system parachain backing
0% other parachain backing

In this, worse case includes several meanings:

  1. the disruption looks more than transatory -- If we'd a transatory disruption then we'd recover quickly anyways, but maybe applying & unapplying this quickly works too.
  2. system parachain make up maybe half of our usage -- At present, this looks likely becasue silly people want smart contracts on AssetHub. We could say "true" system parachains here, and exclude AssetHub & smart contract chains from that list, but this ratio simplifies things.
alexggh commented 1 week ago

I presume those priorities fall through, so if there are no disputes then the 100% is divided among the remaining categories?

Yes, we definitely, need to implement it like that, otherwise we waste capacity.

If there are no disputes then we'd choose backing over approvals 6.7 % of the time, which sounds reasonable. It's still possible this change results in more delayed backing votes, but that likely signals some problems elsewhere. Any thoughs @alexggh ?

The way I see it, backing is more time sensitive than approving, but approving is more important than backing, so our prioritisation mechanisms needs to reflect that. I don't think it is a problem if we delay backing if we are overloaded since we need to apply back-pressure to not create more approval work, but it is an issue if we delay backing because we have a steady stream of approvals and always prioritise approvals over backing or if we add a significant delay to backing when would actually be better served, by doing the backing first and then do all the other approvals(since we've got time for them).

I worry, that allocating just 2% to backing might add the extra delay in situations where we are not overloaded. My simple heuristic would be something like that: If the execution queue is under NEW_WORK_PER_BLOCK(MODULO_SAMPLES + 1), backing should be allocated first, If not use this statistical prioritisation.

Bare in mind that we have at least 2(maybe 4 in the future) parallel execution streams so approval votes wouldn't be starved because we should have no more than one block to back.

burdges commented 1 week ago

That 2% becomes 6.7% if there are no disputes.

What is NEW_WORK_PER_BLOCK? Yes, we could do backing first if the execution queue were under MODULO_SAMPLES+1, and no other back pressure trigger applied, but..

Bare in mind that we have at least 2(maybe 4 in the future) parallel execution streams so approval votes wouldn't be starved because we should have no more than one block to back.

This make sense only if we detect the system needs back pressue, and then change the rules. All those backing votes create more work for the whole system.

I launched off into discussing detection above, but afaik the current discussion is really only about back pressue without explicit detection, aka only using disputes as the trigger in a simplistic way.

alexggh commented 1 week ago

What is NEW_WORK_PER_BLOCK?

It is how many parachain blocks we expect normally to execute/verify per relay-chain blocks it should be around MODULO_SAMPLES+1.

I launched off into discussing detection above, but afaik the current discussion is really only about back pressure without explicit detection, aka only using disputes as the trigger in a simplistic way.

Yeah, simplistic is what I'm thinking as well, but this is not only for disputes, I'm more concerned about the situation where sharding breaks because of no-shows and nodes have to check a lot more candidates than the available CPU resources, so we need to slow-down so we can catch up.

I'm not too opinionated about this, but I think any solution/split we end up with needs to make sure it doesn't affect the base-case where the node is not over-loaded.

burdges commented 1 week ago

Ain't clear "expect" is well defined there. We've the current queue length which depend upon our assignements from parameters and no-shows, and the unknowable workload they represent.

I'll suggest roughly this logic:

If there exists a dispute that's pending, aka not yet being run, then we probabalisatically run the dispute 70-80% of the time, or fall back to the non-dispute system 30-20% of the time.

In our non-dispute system..

Again it's possible this happy queue test messes things up, because now we do not back unless we're caught up on our workload from the previous relay chain block, so this'll require some care and revision. We could set HAPPY_QUEUE higher or count our own no-shows and/or others no shows here. We're looking at the queue first because we think it's more sensitive, but no shows being less sensitive maybe better, although simply setting HAPPY_QUEUE higher works too.

If you need a constant then HAPPY_QUEUE = 12 or similar an over estiamte that'll make it less sensitive, and would let you debug the logic without plumbing in the MODULO_SAMPLES parameter. An advantage of starting with HAPPY_QUEUE larger than desired is that it'll be less disruptive, and we can figure out how much we like the system under bad network conditions first. As a further initial simplication you can assume everything is a true system parachain for initial testing.