Open eskimor opened 1 year ago
I like this idea and also think that more batching is the way to go as it minimises the overhead. I will expand this some more soon
It makes sense to introduce a CoreTime
primitive which represents the unit for measuring how much block space is available for a parachain at a given relay parent. On-demand parachains will need to specify the neededCoreTime
when acquiring block space. The CoreTime
is enforced in the backing phase - PVF execution killed passed allowed time. I propose an enum for the fractions of a core - assuming 100ms slices we get smth like 5 half core
, 10 full core
or more than 10 when boosting. Also we should assume there is some extra overhead/underutilisation for a Full core if we fraction it and then execute multiple PoVs in parallel rather than a single one.
The parachains are assigned to cores. For fractions they can be also assigned to same core, then grouped in a CoreGroup
that can hold up to MaxCoreTime
of entries.
MaxCoreTime
is a configuration parameter that defines the amount of execution bandwidth validators need to provision per relay chain block.
MaxCoreTime
should be BACKING_TIMEOUT multiplied by number pvf workers allowed in parallel. A good start would indeed be 4 PVF workers which means, 8s for async backing, and 16s for async backing assuming (timeout is 4s for 2s of parachain block execution).
This is for sure useful for more exotic scheduling as described in Further Scaling Extensions
above, but we can start with the current state which is 1 full core only. Later when we can add support for fractions and boost for parallel block production.
When dealing with large core groups it could be possible that the validator has more PVF work to do than what it is possible in a given time frame. Execution priority is Disputes first , approval voting then backing last to enforce faster finality while also slowing down parachain block production by backing less candidates in a group.
We should do this of course. We should remain aware that workload variance per relay chain block increases somewhat by doing this, but maybe no too much, and total variance even less so.
We'll still merge paritytech/polkadot#6782 or some flavor of course. We'd ideally still sign approvals for multiple core groups ala paritytech/polkadot-sdk#732 too, but it becomes less valuable now.
Yeah, I also don't think https://github.com/paritytech/polkadot-sdk/issues/701 makes that much sense if we move to CoreGroup
, but it still might if we cap the groups to a smaller CoreTime
. Ideally we want to do more with less validators (maxing out on the current spec vs adding more validators).
We need to figure out a light algorithm for validators assigned to different cores to choose the subset(s) of possible core group candidates that they want to back, accounting for the fact that each parachain can produce as few as 0 candidates and an unbounded amount of valid candidates.
I thought a bunch more about core groups, and I don't think it's quite the right solution, nor does it really align with the scheduling infrastructure proposed here https://github.com/polkadot-fellows/RFCs/pull/3 . That's still a pending RFC so I won't take it as a given, but I haven't seen any other solid proposals for dynamic scheduling.
To be clear, this SHOULD be an RFC, and I'm treating this as an initial discussion where nothing is set in stone.
The original issue stated these goals
- Less backing votes/statements.
- Less pressure on availability - we have bigger candidates were chunking is more meaningful.
- Smaller bitfields.
- Less assignments and less approvals.
I would reframe this way
The main issue with core groups is that it hides a large amount of complexity in the validators of a backing group choosing which parachain candidates to back together. Any solution there would probably require running a consensus protocol among backers, and leaves out the possibility for candidates to be backed at different points in time, i.e. for backing to be a continuous and asynchronous process that feeds into the relay chain.
It also doesn't overlap nicely with the probabilistic scheduling approach in RFC-3, where backers don't know exactly which chains are scheduled on which cores at which times, but just roughly what could be scheduled. We are able to reuse backing statements from previous rounds to include stuff that didn't quite make it in.
We should apply this same approach to core groups: what we want to do is bundle many small state transitions onto a single core. We don't know exactly which chains they're going to be from, and we do want backing statements about them to be reusable in later relay-chain blocks.
I'm still in favor of doing some kind of bundling. Goals (1) and (2) are pretty obviously useful at this point. So here are all the things that stay the same as in core groups:
Here's what's different:
To be honest, the only thing we need for cores is to specify their limits
// Per-relay-chain-block core specs
struct CoreResourceLimits {
// Total across all candidates backed in the same relay-chain block
total_execution_time: Duration,
// Total across all candidates backed in the same relay-chain block
total_data_size: u32,
// The minimum amount of execution time a single state transition backed on this core can use.
// State transitions on this core are always rounded up to the nearest multiple of this.
execution_time_step: Duration,
// The minimum amount of data a single state transition backed on this core can use.
// State transitions on this core are always rounded up to the nearest multiple of this.
data_size_step: u32,
// maybe some overhead budgeting constants, i.e. each additional candidate has an expected overhead
}
If we update the CandidateCommitments
to have two fields:
struct CandidateCommitments {
// ...
execution_time: ...,
pov_size: ...,
}
then we get this really nice property that the amount of candidates as well as the distribution of resources across candidates going into a core can vary from block to block. This is not just core groups, but dynamic market-based core groups. Where we can regulate the amount of candidates parachains produce with the Region
market.
I'm thinking of modifying RFC-3 so that regions just specify the general resource consumption rate of a parachain and then chains get to decide how many blocks they make, how big they are, etc without being attached to any particular size and rate.
- A bunch of candidates get bundled together. They become available together, taking up 1 bit in the bitfield.
- Approval checkers assigned to the core recover the data for the whole bundle and validate all these candidates together.
Yeah, this sounds like the essence anyways. I like..
Interestingly, we might even solve the "weights are too aggressive" problem here, but without touching the weights, although this becomes fraught elsewhere. cc @Sophia-Gold
We do the availability bundling on the network layer, because (1) doesn't let us get an overarching erasure-root. a. Each backer will have only backed a subset of the candidates, but should answer network requests with chunks for all the candidates they've backed as a batch b. Signing a bit in the bitfield means you have all your chunks from all the candidates in the bundle
Yeah this seems fine. It's possible to censor blocks you never backed, but so what.
Approval checkers / dispute participants recovering from chunks need to do N erasure-decodings/recodings while validating candidates. In this respect, each candidate still carries a constant cost, but each erasure-coding will have 1/N the size of the aggregate.
Importantly, we fetch exactly the same availability chunk indices for each parablock in the availability core, so we only do the setup phase for one index set. It's thus no different from doing one big decoding, except maybe a couple cache misses.
As an aside, we do not technically need availability and approvals cores to be the same either, but not doing so adds complexity, performs slightly worse if you must rerun the setup phase, and afaik yields no benefits. It'd look like this:
Yet reconstruction now fetches chunks with three optimization preference levels:
In essence, your proposal does the grouping in 2, but you could do the grouping in 4, if you accepting having bigger bitfields, and this nasty reconstruction optimization preference levels mess. I doubt this improves anything.
It's possible to censor blocks you never backed, but so what.
It's possible to delay availability on blocks you never backed, but if you never backed them they've still been backed by enough other validators in the same group. Seems just as censorable as the current system - you know who backed it when it goes on-chain and can DoS them. It's faster as a validator to get your chunks in a single request but as an example:
Importantly, we fetch exactly the same availability chunk indices for each parablock in the availability core, so we only do the setup phase for one index set. It's thus no different from doing one big decoding, except maybe a couple cache misses.
OK, this is good info. Yes - glad this amortizes nicely, though we would need some new APIs to do that in practice and it's not a blocker for an initial implementation.
your proposal does the grouping in 2, but you could do the grouping in 4, if you accepting having bigger bitfields, and this nasty reconstruction optimization preference levels mess. I doubt this improves anything.
I also considered doing the grouping in 4 but it seems like it's always best to do the grouping as early as possible and then move it later when that presents challenges.
As an aside, we do not technically need availability and approvals cores to be the same either
Yes, and RFC-3 I've proposed introducing some "hypercores" for making scheduling probabilistic (i.e. relay chain block authors can temporarily "widen" cores and bitfields up to some per-block limit to allow making up for missed opportunities to post stuff on-chain)
As another aside, it's worth noting that RFC-5 just specifies the proportion of resources each parachain should have on the core. That interface would work for approaches where we have one block per core per relay-chain block, but could also adapt to core-sharing under the hood. e.g. if a parachain wants to, they could just build 2x as many blocks, half the size, and some of them will go in earlier, some of them will go in later. Obviously this comes with some additional cumulative PoV usage, so it's not something you can do for free, but still may be useful.
RFC-3 also now doesn't phrase things in terms of "blocks" but rather core resource consumption rates, so it should be totally forward-compatible with these changes
Assume backing groups of 4 or 5 validators with a backing threshold of 2. If X is backed by 1 & 2 and Y is backed by 3 & 4, but not 1 & 2, then X < X+Y cannot become available if 3 & 4 withhold chunks in Y.
Imho, this is not really an immediate problem, and these backing groups sound too large anyways, but it'll maybe bother @AlistairStewart or sr labs or complicate billing. I suppose grouping in inclusion fixes this, but likely not worth those headaches.
We could've logic that mitigates this under your grouping in backing model too. It's easiest if backing are 2 out of 3 but other options:
Example 1. If you make the relay chain block after availability fails then you could retry with the chunks you received, but not with the chunks you did not recieve. This is bad in Aurababe, since you're not anonymous 3/4th of the time, but you're anonymous in sassafras, so then it'll work reasonably well.
Example 2. If availability is really slow for some availability core, then we dynamically add more find grained votes to the bitfield, before abandoning the whole availability run.
As a bike shed, I'd think names like "hypercores" or even "core groups" confuse people, so I'd suggest something more inspired by the protocol.
Also..
If cores contain multiple blocks then we'll likely validate them in parallel. We discussed flavors of SPREE being truly a separate block, but this sounded excessive to me, and is not how Basti envisions SPREE. We might anyways have a truly separat-ish blocks from the same parachain being validated in parallel here, so one could imagine multiple blocks from the same parachain having merged PoV witnesses.
This issue has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/altering-polkadots-fork-choice-to-reduce-da-load/3389/13
Any idea how hard @rphmeier's flavor of core groups is going to be?
Any idea how hard @rphmeier's flavor of core groups is going to be?
Essentially this requires changes across the entire stack, so it will not be easy at all. I'm going to start working on a PoC after I have a very good understanding of the details.
Ahh, I suppose the initial idea changed little after backing, plus whatever cumulus does, but the improved version by Rob demands deep changes in availability too.
I've drafted a PoC plan based on the discussion here. Nothing here is set in stone, so feel free to comment. I expect the PoC to be very well documented and incrementally built. @eskimor @rphmeier @burdges If this sounds good, I will start to create individual tickets for the work.
Minimizing the constant overheads in approval checking and availability recovery. Compatiblity with RFC-1, RFC-3, RFC-5.
n_validators / n_cores
is less than 5.ExecutionGroup
based on the backed candidates in the inherent data ExecutionGroup
from shuffled per candidate backers.ExecutionGroup
ExecutionGroupIndex
instead of CoreIndex
.ExecutionGroup
ExecutionGroup
will trigger the check of that invidual candidate, not the whole group.max_candidates_per_execution_core
: u32,
max_cores_per_backer
: u32, // maximum number backing groups the validator can be part of
6s prablocks 1000 paravalidators ( n_validators ) 300 n_cores 6s block times 30 needed_approvals 4 max_candidates_per_execution_core
core_groups: 75 n_cores / max_candidates_per_execution_core
total assignments + approvals: ~ 4500 2 * (needed_approvals * core_groups) / n_validators)
validations per validator per relay chain block = 2,5 * max_candidates_per_execution_core = 10
From a birds eye view it becomes a necessity to bump CPUs from 8 to 10 or more to accomodate this scale and be able to absorb any surge in block space utilization ( longer PVF runs).
The testing will use a mix of PVF execution times and PoV sizes using Glutton, that should amount to a target of 50% utilization of block space and time. We can easily adiust the configuration Gluttons to increase/decrease that target.
At least one new API will be needed to expose the ExecutionGroup
mappings and create_inherent
and enter
changes for creating theExecutionGoups
on each relay chain block.
CandidateCommitments
could include additional fields like execution_time
and pov_size
which can be used as criteria when the runtime creates the ExecutionGroups
mappings.
As we fetch the same availability chunk indices for each candidate in the ExecutionGroup
we only do the setup phase for one chunk index set. CPU usage for doing such N reconstructs should be roughly the same as doing a single reconstruct.
We’d still prefer to fetch the PoV from the backing group up to some limit (currently 128KB), but if that fails we’ll do a normal chunk recovery. There are additional optimizations left on the table, which can be implemented later such as:
ExecutionGroup
(setup phase can be reused )For the PoC it would be best if we could reuse CandidateIndex and CoreIndex from assignments certificates and make them point to ExecutionGroup
indices.
The target numbers assume worse case scenario of all cores in a group use 100% of their maximum execution time means we need to bump PVF workers to 5 which is 50% of total node CPU bandwidth. The other 50% must be enough for the last 2 top CPU consumers, networking and availability recovery.
The erasure recovery and network optimizations could possibly give back 1 CPU that can be used in validation.
Important: It's relay chain block producers who allocate candidates to cores, but it's still backers on the hook for the execution time. We need thus backers to place the resource consumption, in time and memory, into their backing statements. In principle, the backers statements about memory usage determine if nodes are running parachain blocks in parallel.
We need thus backers to place the resource consumption, in time and memory, into their backing statements.
I imagined this could go in the candidate receipt, where collators actually determine it and backers just check if the resource requirements hold. PoV size should be committed to in the candidate receipt anyway, for a bunch of reasons, but committing to execution time accurately is probably a bit harder so it'd tend to be an overestimate.
backers can be assigned to more than 1 core, bounded by a configuration value. This allows to still have a backing group of 5 when n_validators / n_cores is less than 5.
I was imagining we'd still keep it as a single "core", but just have multiple things go onto it. If we stick with 1 core per candidate, it'll lead to the number of cores per relay-chain block fluctuating based on what gets posted. That seems odd. My mental model is based around regions, so I'm just imagining that backers work on all the regions that are assigned to the core, however many of them there may be. For the PoC we have to do something different, so maybe that's what this is. But we could achieve the same thing by just having the availability_cores
runtime API give a list of scheduled para-ids, not a single one.
- Runtime computes a bounded set of ExecutionGroup based on the backed candidates in the inherent data
- A configuration variable N will specify the maximum size of a group
Wouldn't we rather have the resource constraints specified on a per-core rather than a per-group basis?
as suggested above. I don't see why we need to compute anything or what the ExecutionGroup
struct would even look like. We just need to make sure that the Vec<Candidate>
posted passes some predicate fn resource_pred(TotalCoreLimits, &[Candidate]) -> bool
. That predicate would take into account things like the number of candidates, minimum resource usage per candidate, maximum across all, etc.
Or is an ExecutionGroup
essentially an "approval core"?
availability distribution starts up to N chunk fetches ExecutionGroup from shuffled per candidate backers. a bit of 1 now means availability of all candidates in an ExecutionGroup
Yeah, this sounds right
Approval checkers get assigned to ExecutionGroupIndex instead of CoreIndex.
We could still call these cores. Not sure we need a new term, just that we need to change the semantic meaning of a core. Everything else looks right here.
An opposite vote on a candidate from an ExecutionGroup will trigger the check of that invidual candidate, not the whole group.
Yep
max_candidates_per_execution_core: u32, max_cores_per_backer: u32, // maximum number backing groups the validator can be part of
We should be thinking in regions, not cores. e.g. we should support including multiple candidates from the same process in a single bundle. max_candidates
is one variable but ideally we're accommodating many candidates of different sizes, so we need something more detailed to describe resource limitations.
I'm not sure why a backer would need to be assigned to many cores at once. At least in light of RFC-3, we will already have many processes multiplexed on the same core, so a single backing group can work on all those regions at once anyway.
Maybe fine for a PoC, but I expect this'd have to change a lot for production.
The target numbers assume worse case scenario of all cores in a group use 100% of their maximum execution time means we need to bump PVF workers to 5 which is 50% of total node CPU bandwidth.
Shouldn't it be the case that the sum of all candidates on a single core use e.g. 2 seconds of execution time or whatever the limit per core is? I suppose we could push higher when taking parallelism into account. However, at that point we are starting to break compatibility with RFC-1/3 as we assume that 1 parachain is even capable of occupying 100% of a core. Of course, with block authoring/import parallelism on the Cumulus side we can probably achieve this.
However I'd still push for the approach where a single "core" can just handle larger amounts of throughput per-block, but we expect that many chains are simultaneously multiplexed across them. e.g. every core can handle 8s of execution per 6s, but any individual state transition put on that core is at most 2s of execution. This'd let us lean into the amortization without introducing any new concepts, except for the resource allocation schedule of the core.
My assumption here was that we do not want to change backing or implement parts of RFC-3. But yeah, the PoC should still in principle be as close as possible to production. Something like this should be workable
ExecutionGroups
based on execution time and pov sizes in CanidateCommitments
and includes them in the relay chain block.ExecutionGroups
(slower)ExecutionGroups
, and panics if the grouping fails any of the criteria (duplicates, resource limits breached, not all candidates included, not efficient grouping, etc)Shouldn't it be the case that the sum of all candidates on a single core use e.g. 2 seconds of execution time or whatever the limit per core is? I suppose we could push higher when taking parallelism into account. However, at that point we are starting to break compatibility with RFC-1/3 as we assume that 1 parachain is even capable of occupying 100% of a core. Of course, with block authoring/import parallelism on the Cumulus side we can probably achieve this.
Why should a parachain using 100% of the core break RFC-1,3 ? That would basically be what we have now, and it means it will be the only thing in an ExecutionGroup
However I'd still push for the approach where a single "core" can just handle larger amounts of throughput per-block, but we expect that many chains are simultaneously multiplexed across them. e.g. every core can handle 8s of execution per 6s, but any individual state transition put on that core is at most 2s of execution. This'd let us lean into the amortization without introducing any new concepts, except for the resource allocation schedule of the core.
Yes, that is doable, but I would like to keep it simpler. We can just consider 1 PVF worker per core per relay chain block -> so we're limited by 6s of execution. But, since these ExecutionGroups
also stack with the v2 assignments and coalescing of approvals. We can get some more parallelism and optimizastion going on at that level, but in practice we do end up getting more than 6s of execution per relay chain block. We just have to be careful to not overcommit on the PVF workers. But even if it happens occasionally on some nodes, is easy to prioritize executions at PVF worker level in this order: disputes, approval, backing - if a node has too much PVF work to do, it will simply back less candidates.
Can you explain the struct ExecutionGroup
? I still don't understand it.
I am just thinking that a Core
can have a per-block workload which is comprised of state transitions from many processes. This writeup seems to assume that a core only has one process on it per block, so we need this ExecutionGroup
struct to put them together.
type ExecutionGroup = Vec<CommittedCandidateReceipt>
, where they are all grouped together on the same core?
Why should a parachain using 100% of the core break RFC-1,3 ?
It doesn't, but it sounds like you're talking about cores (or groups?) which can process more than 2s of execution time. Once we pass a certain threshold it's pretty hard for a parachain to keep up, which I think is OK.
Yes, its Vec<CommittedCandidateReceipt>
, and yes it's just a bunch of candidates for which their execution sums up to at most 2s.
This issue has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629/1
This issue has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629/3
This issue has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629/8
The idea evolved to push the bundling into the other direction: To builders/collators. Candidate Descriptor changes are already being worked on in the course of #1829 .
Although less dynamic, that's less relay chain complexity, probably the right choice.
Backers are no longer responsible for cores, but for core groups. A core group consists of e.g. 4 cores. The backers will then try to back 4 cores at once and bundle them up into a single bundle they vote on. There will be a single backing vote per backer per bundle. If a PoV turns out invalid the backer will simply not put into the bundle.
Cores will then only matter for scheduling and for backers in the sense that they are bundling PoVs up. Likely backers will communicate about individual PoVs with each other to build up those bundles efficiently.
From then on we have statements per bundle, bitfields were bits represent those bundles and approvals/assignments that operate on bundles. Basically with a bundle size of 2, we should be as good with 6s block times as we were with 12s, with a bundle size of 4 we are even twice as good and should be able to support twice as many parachains as we can support now. This optimization does optimize the whole stack:
The nice thing is, we can scale bundles with machines becoming more powerful - we are no longer bottlenecking on network and constant overhead.
Roadmap:
Further Scaling Extensions
We can have big and little cores. A big core is actually representing a physical core. Meaning, POVs of two big cores will be executed in parallel. But we can also have smaller cores, with half the execution time for example. Then groups would have cores of different sizes. E.g. a core group with 4 big/full cores, another could have 8 little cores (each offering half the execution time). Parachains can order big/little/tiny cores as they see fit.
Path for a parachain: A parachain with little blockspace demand can take a small core, if demand grows it can scale up to a big core, if that still is not sufficient it can use boost, taking two cores for parallel execution of two PoVs at once.