Coretime Scheduling Regions

polkadot-fellows / RFCs

Proposals for change to standards administered by the Fellowship.

https://polkadot-fellows.github.io/RFCs/

Creative Commons Zero v1.0 Universal

111 stars 51 forks source link

Coretime Scheduling Regions #3

Closed rphmeier closed 11 months ago

rphmeier commented 1 year ago

This introduces a new scheduling primitive for the relay-chain known as Coretime Scheduling. It is intended to be a companion to paritytech/polkadot#1 but can function with any higher-level logic for allocating cores.

This is intended to be a general solution designed with some of the following features in mind:

on-demand coretime
elastic scaling
scheduling multiple chains to share a core with minimal friction
posting many small blocks to a core every relay-chain block

There may be other use-cases we don't know of yet, but I'm fairly confident that this solution is general enough to accommodate pretty much anything we throw at it.

gavofyork commented 1 year ago

Would it be reasonable to describe this as taking a Parachain-oriented, rather than Core-oriented, approach? I.e. in that it does't describe what each core should be doing, but rather which parachains should be validated and at what frequency divisor.

Is this is necessary condition of the current implementation?

rphmeier commented 1 year ago

Would it be reasonable to describe this as taking a Parachain-oriented, rather than Core-oriented, approach?

The main reason for regions being anchored to specific cores is so validators assigned to a core have an indication of which parachain candidates they should or shouldn't validate. If regions didn't hold any core information, we'd either have to generate the mapping between regions and cores in the runtime (requiring low runtime scheduling overhead is Goal (1)) or validators could potentially work on any and all candidates, and wouldn't have much idea of what their group-mates were doing.

In the latter case, we'd also need all validators to coordinate off-chain about which parachains which groups intend to work on. I'm not aware of any low-overhead solutions for doing that.

I wouldn't rule out the possibility of removing core indices from regions, but it'd require fairly significant rewrites of the networking protocols. Maybe something to iterate towards, but we are constrained by fairly deep implementation assumptions.

but rather which parachains should be validated and at what frequency divisor.

Actually this approach does exactly that - it just goes a little further and constrains which core a parachain can be validated on. But note that parachains can have multiple regions corresponding to many different cores, and their effective rate is the sum of all their regions' rates. And cores can have an arbitrary number of regions assigned to them, as long as the block production rate never exceeds 1 across all of them.

On the actual validator implementation side, I believe the code will be much more parachain-focused than core-focused. Cores only come into question for the relay-chain block author, when selecting parachain blocks on regions that are ready, just to make sure that it's not posting more than the allowed amount of parachain blocks per core. The rest happens asynchronously.

gavofyork commented 1 year ago

Could you explain why there's a need to introduce an anonymous identifier (RegionId) and correspondingly indexing scheduling information in terms of this identifier rather than simply indexing things in terms of a low-level period and the core index? This necessitates the maintenance of a reverse lookup and in general anonymous IDs are a bit horrible.

If PoVs were offered based on a CoreId rather than RegionId wouldn't this provide adequate information to ensure Parachains were not over-scheduled?

If we distill things a little (assume maximum == 0 and avoid the indirection of RegionSchema) we end up with RegionId mapping to a Para assignee, the core on which it will run, the blocks it should be taking on that core (start, duration and stride) and a variable to track its usage so far count. Here, RegionId is not in any way meaningful - it's an anonymous key, instances of which must be tracked in some way to be able to place one in a PoV. This anonymous identifier is exposed all the way from the high-level brockerage system chain down to the PoV of the collator. Yet for any given (start, duration, core) triplet, I would expect there to be only one ParaId, which would seem to indicate that a self-describing key including the core index and a period of time would do just as well.

rphmeier commented 1 year ago

Could you explain why there's a need to introduce an anonymous identifier (RegionId)

If regions are ever made mutable, i.e. are possible to split by duration or decompose by frequency, then having stable IDs could be important on the node-side. It's less necessary in an immutable paradigm, but I'd bet on needing some limited amount of mutability in the future.

The only property we really need from Region IDs is uniqueness (within a relay-chain). I don't think the format of it changes the implementation much, although self-describing does help reduce storage proof overheads on paras. If region IDs are passed around on the network, nodes still need to check with the runtime APIs that the region exists and which para it is assigned to.

This necessitates the maintenance of a reverse lookup and in general anonymous IDs are a bit horrible.

Could you elaborate? The first reverse lookup is ParaId, RegionId => () which we need to maintain for collators to quickly index and prove the regions belonging to their parachain. That'd be the same regardless of the region ID format. With an anonymous region ID we do have to look up the full region in the state proof as well, but I think we'd need to do that anyway, since the ID wouldn't be fully self-describing.

The second reverse lookup is RegionId => Region which we need on validators to check the state of regions over time and on network events. All the information in the Region structure is needed on the node-side, so what RegionId is doesn't matter much here as long as it's unique.

Yet for any given (start, duration, core) triplet, I would expect there to be only one ParaId

Regions' data or schemas aren't guaranteed to be unique at all. In fact, one ParaId could own multiple identical regions. Core-sharing between two chains, for instance, is easily implemented as two regions, with the same (start, duration, core) triplet and a rate_numerator of RATE_DENOMINATOR/2 (i.e. block production rate of 1/2). In this proposal we don't attempt to fix the specific relay-chain blocks that parachains get to post blocks within, just the average rate and maximum over a period of time.

gavofyork commented 1 year ago

If regions are ever made mutable, i.e. are possible to split by duration or decompose by frequency, then having stable IDs could be important on the node-side. It's less necessary in an immutable paradigm, but I'd bet on needing some limited amount of mutability in the future.

Surely "mutability" in this context would be better termed "reversion of allocation" and I don't see why we would ever want to support it. You can trade and move the bulk Coretime you buy but once you commit to an allocation of that Coretime, I don't see why would want to add substantial complexity onto not only the Relay-chain but also the Broker chain to be able to revert that commitment.

gavofyork commented 1 year ago

Regarding "reverse lookup" I was referring to there being two distinct maps providing perspectives into the same underlying relation. In this case RegionId is mapped to a struct including a assignee, but there is also a map mapping assignee to RegionId (for iteration).

gavofyork commented 1 year ago

Regions' data or schemas aren't guaranteed to be unique at all. In fact, one ParaId could own multiple identical regions. Core-sharing between two chains, for instance, is easily implemented as two regions, with the same (start, duration, core) triplet and a rate_numerator of RATE_DENOMINATOR/2 (i.e. block production rate of 1/2). In this proposal we don't attempt to fix the specific relay-chain blocks that parachains get to post blocks within, just the average rate and maximum over a period of time.

What I said does not exclude this.

For example, if between times T and T', both cores paritytech/polkadot#1 and paritytech/polkadot#2 were to be shared equally between two paras 1000 and 1001, this could equally be expressed by two schemas:

(Core(1), Timeslice(T)) -> vec[1000, 1001],
(Core(2), Timeslice(T)) -> vec[1000, 1001],
(Core(1), Timeslice(T')) -> vec[],
(Core(2), Timeslice(T')) -> vec[],

and

AnonymousKey1 -> (Core(1), Para(1000), T, T', 1/2)
AnonymousKey2 -> (Core(1), Para(1001), T, T', 1/2)
AnonymousKey3 -> (Core(2), Para(1000), T, T', 1/2)
AnonymousKey4 -> (Core(2), Para(1001), T, T', 1/2)

1000 -> AnonymousKey1 -> ()
1000 -> AnonymousKey3 -> ()
1001 -> AnonymousKey2 -> ()
1001 -> AnonymousKey4 -> ()

In the top schema, the keys are distinct even though the region usage is the same for two different cores. You don't need to introduce the AnonymousKeys (in the second schema). Obviously without AnonymousKey the whole thing is rather a lot simpler.

gavofyork commented 1 year ago

Low-latency instantaneous Coretime scheduling isn't especially well expressible in the timeslicing schema, since the timeslices are assumed to be of the order of 100 blocks. However, I don't think the Broker chain could handle any low-latency scheduling, regardless of the schema since at high usage it would be too much information to be sensibly aggregating and passing through UMP. Instantaneous Coretime is a fundamentally different beast to Bulk Coretime - it cannot be turned into an NFT, split or composed. Because of this, it's surely best to handle directly on the Relay-chain.

rphmeier commented 1 year ago

As an aside, breaking things down into timeslices isn't something I see as a requirement on the relay-chain. They are useful for higher-level logic as a simplification and packaging of coretime, but the relay-chain is more flexible when able to handle things at block-level granularity.

Though for the bulk of this proposal, breaking things down by timeslice is not a significant factor and I'd be open to changing things on that basis, though timeslices are only unique within periods 28 * DAYS.

For example, if between times T and T', both cores https://github.com/polkadot-fellows/RFCs/pull/1 and https://github.com/polkadot-fellows/RFCs/pull/2 were to be shared equally between two paras 1000 and 1001, this could equally be expressed by two schemas: (Core(1), Timeslice(T)) -> vec[1000, 1001], (Core(2), Timeslice(T)) -> vec[1000, 1001],

This mapping does not scale very well as-is, in the sense that it would require either a) relay-chain runtime logic for creating/cleaning up these storage entries for upcoming and expired timeslices - this is counter to goal (1) of this RFC, which is to push scheduling overhead onto nodes and keep it out of the runtime b) tens of thousands of simultaneous storage items on the relay chain to be created eagerly c) constant blitting of upcoming timeslice information by the broker chain onto the relay chain

The regions solution is also designed to accommodate finer-grained scheduling. A simple Vec<ParaId> only accommodates equal splits of coretime, whereas unequal splits are likely to be useful in the future. A simple use-case would be a parachain deciding to sell off 1/4 of its coretime for the next day, due to low demand, or likewise having a pallet on a parachain which is willing to reimburse buyers for all regions of arbitrary size, so long as blocks remain sufficiently full. That's not where we are today, but making decisions to enable such possibilities is quite important in my view.

The primary reason to have a mapping between ParaIds and assigned regions (with sufficiently unique IDs) is to enable elastic scaling. I believe that any proposal which does not allow a simple lookup from ParaId to all near-future scheduling information across all cores is a non-starter in the implementation of elastic scaling and would need to be reworked at that point, which I would very much like to avoid. What the ID is for that purpose is fairly immaterial, but uniqueness is quite important. A mapping from (CoreId, Timeslice) -> ParaId would require iterating all cores for the timeslice just for nodes to determine this information. Spending relay-chain execution time in computing this mapping is also wasteful.

We should have a single scheduling mechanism on the relay chain that handles all allocations of coretime, regardless of which mechanisms or chains create and manage those allocations - for the tens of thousands of on-demand claims along with longer-term scheduled paras. If not, we would need to introduce a separate scheduling mechanism for on-demand, and node and runtime implementers would need to maintain two parallel implementations of scheduling tracking. Given the already substantial complexity of Polkadot, I believe this is best avoided by unifying both and this will pay dividends in implementation time and maintenance overhead.

Note that this proposal's format need only be used in the implementation of allocate on a broker-chain, which would transform and blit a region, and could even be abstracted away from the broker entirely via XCM. i.e. if the broker is enshrined, the relay chain can accept blitting information in whatever format the broker wants and transform it under the hood.

There are other goals like making up for missed opportunities in a timely way that aren't met by the above mapping but that's all described in the RFC text.

eskimor commented 1 year ago

A simple Vec only accommodates equal splits of coretime

This does not seem to be true, e.g.: [A,A,A,B], would give A 3x the time it gives to B.

rphmeier commented 1 year ago

You can do that, but it doesn't scale very well beyond very simple fractions. It also doesn't accommodate situations where you've split a region into e.g. 10 parts and only 1/10 is currently allocated while the other 9/10 are being traded and haven't yet been allocated.

rphmeier commented 12 months ago

For a given relay chain parent you have x cores for para y. We just accept x candidates, which core we use for which should not matter. Processes on a computer also don't care on which cores they are run. For forks and such, I don't see how regions directly solve this problem. If we wanted to be explicit for any reason, we could equally well put the core index in the receipt.

I hadn't made it completely clear in the RFC text. Of course this is the end result we want.

Hidden complexity; here be dragons: https://github.com/polkadot-fellows/RFCs/pull/3#discussion_r1284489038

rphmeier commented 12 months ago

If we wanted the scheduler to use regions even for on-demand, how would this look like?

First off, there's no distinction between an on-demand core and a non-on-demand core.

There are only workloads per core, defined in paritytech/polkadot#5 .

oversimplified pseudocode, but here's what I'm imagining, for instance:

enum CoreAssignment {
  Task(ParaId),
  InstantaneousPool,
}

struct OnDemandTracker {
  last_reset: BlockNumber,
  consumed: RationalOver57600,
  rate: PartOf57600,
}

impl OnDemandTracker {
  fn has_space(&self, needed: RationalOver57600, now: BlockNumber) -> bool {
     ((now.saturating_sub(self.last_reset)) * self.rate) - self.consumed >= needed
  }

  fn consume(&mut self, used: RationalOver57600) {
    self.consumed += used;
  }
}

storage OnDemand: StorageMap<CoreId, OnDemandTracker>;

// code implementing RFC-5
fn assign_core(core_id: CoreId, Vec<(CoreAssignment, PartOf57600)>) {
  // create regions directly for all `CoreAssignment::Task`.

  let instantaneous_rate: PartOf57600 = ...; // from `CoreAssignment::Instantaneous`
  if instantanous_rate == 0 {
    OnDemand::remove(core_id);
  } else {
    OnDemand::insert(core_id, OnDemandTracker { last_reset: now, consumed: 0, rate: instantaneous_rate });
  }
}

fn schedule_on_demand(para_id: ParaId) {
  const WINDOW_SIZE = 5;

  let needed_resources = 57600; // assume entire core is consumed for one relay-chain block in total.

  // find first core with enough available resources
  // for illustration only.
  // in practice, iteration is not efficient. a linked list of cores with enough resources could be used, 
  // with one iteration per block to bring cores back into the linked list.
  for (core_id, resources) in OnDemand::iter_mut() {
    if !resources.has_space(needed_resources) { continue }
    resources.consume(needed_resources);

    let start = now;
    let end = start + WINDOW_SIZE;

    // create a region that allows the parachain to make one block in the next 5.
    // this code could easily be adapted to allow for multiple on-demand blocks to be purchased at once.
    create_region(Region {
      core: core_id,
      schema: RegionSchema { start, end, maximum: Some(needed_resources), rate: needed_resources },
      consumption: 0,
      assignee: para_id,
    });
    break;
  }
}

rphmeier commented 11 months ago

GitHub is broken so I can't respond in-line, but.

Still, using different definitions of regions seems off. That might be fine, if one is better suited for markets and one is better for scheduling, but so far, not even the ids would match. Maybe a reason to not yet vote on RFC-1?

They serve different purposes. One is for markets, the other is for off-chain scheduling which is quickly verifiable on-chain. RFC-1 wasn't designed with collator/validator interaction in mind so there are necessarily different goals and designs for those.

How would they learn about the offline backing group and how would they coordinate on who is going to work on what? All solvable, I guess, but also over complicating

That was the point of the socratic dialogue. That hidden complexity lies in the assumption that "collators and validators are smart". I'm not saying this is easy, but backers detecting that other groups aren't working is strictly easier. Though this is all probably moot given https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629?u=rphmeier as we don't truly need to introduce more backing groups in order to schedule more cores.

Burst cores also are an artifact that are worked around in the above proposal. It's true that we need to limit load, but asking chains to order extra on-demand cores just to use coretime they already paid for and avoid the "roadblocks" of occupied cores feels like such a non-starter that we should not even consider it. If the system is overloaded and coretime was oversold they should ideally end up getting refunds, not paying more. Would you use an airline which overbooked seats, refused you the right to board, and then offered to let you on if you paid extra money? This is much less of an issue when we remove core affinity, so I will attempt to rework this proposal without any core affinity whatsoever.

They are not limited in size and have no ordering. Not sure how this is supposed to work?

Why is ordering important? Only because of friction between parachains scheduled on the same core. And even so, it is a very incomplete solution. Removing that friction is done in this proposal by adding burst cores, therefore ordering is unimportant. If core affinity is removed altogether, then this becomes even less important, though there is still some friction, the effect of this friction on particular parachains averages out system-wide.

I think high load is actually the most likely reason for slow availability even longer term, because backers are incentivized to make the candidate available. If they fail to do so, they lose out on rewards while already having invested all the work.

Practically speaking, with our approvals protocol the amount of work which any particular validator must do is only equal in expectation. Sometimes validators may be selected to do much more work, and sometimes much less. The distribution of workload in approvals for validators is Gaussian. Therefore, validators already typically run below "capacity", as they may end up drawing a heavy workload from the far right tail of the distribution and this must not cause the network to fall over. The amount of system-wide burst cores certainly moves this distribution over somewhat, but the case that is actually worth worrying about is

when all burst cores are utilized and
some validators draw 3-sigma or higher levels of workload

Burst cores definitionally cannot be utilized more than 50% of the time (as they are only used to cancel out missed opportunities). It's not clear that introducing a small number of burst cores actually leads to anything other than occasional slight finality delays on a few blocks, where a significant number of their tranche-0 checkers have all drawn 3-sigma workloads and then must be covered by tranche-1 validators. This increases network load slightly, and certainly linearly in the number of cores, but with a very small constant factor due to the probabilities involved.

rphmeier commented 11 months ago

I am closing this RFC in anticipation of a simpler & more powerful replacement.