ironcore-dev / ironcore

Cloud Native Infrastructure as a Service
https://ironcore-dev.github.io/ironcore
Apache License 2.0
30 stars 5 forks source link

Multi-Level Scheduling Proposal #1128

Open lukasfrank opened 1 month ago

lukasfrank commented 1 month ago

Proposed Changes

balpert89 commented 1 month ago

+1 for the proposal. This addresses a lot of current pain points in the stack:

Going into more detail. The current proposed approach follows a decentralized solution whereas the centralized one (in the alternatives section) follows the network stack solution. There are some disadvantages with a clustered solution, such as lack of guaranteed availability. For networking this is okay because if a critical infrastructure component is down, networking is affected anyway. The scheduling part should not have such a big impact for computing - using the decentralized approach would isolate impact on a pool level. Another aspect with a centralized solution is you have to deal with a lot of "boilerplate" challenges such as "eventual consistency", giving room for possible race conditions. The Reservations solution solves this pretty elegantly because you can introduce a time peroid until a scheduler waits for its decision and only takes the pools into consideration it finds in the status slice. Therefore, my vote goes with a decentralized approach here.

On the topic of the "scheduling decision". The reservation system is meant to have a decision who can provide the requested resources. Another controller can then use this to actually decide which one of those pools to actually use for the the Machine resource. This enables a similar behavior compared with Node <-> Pod scheduling in vanilla Kubernetes. Another point to consider is that you can enable "system" reservations to accomodate for resources that are exclusively reserved for system applications.

Some aspects that are unclear for me:

lukasfrank commented 1 month ago
  • who decides the rating on a given status entry, is this similar to a priority? How does this influence the scheduling decision?

Only the pool provider can calculates the rating (since it's the component to check if the reservation can be fulfilled) and it is a metric on "how good the Reservation fits onto the related pool". It should be understood as a hint for the scheduler to take the decision.

  • how will arbitrary resources be announced, such as the already mentioned EPC Memory? Or e.g. dedicated graphics cards?

In the distributed approach: There is no need anymore for announcing resources. The resource "owner" (the pool provider) is in charge of taking or rejecting the reservation and needs to keep track of all the resources. If arbitrary resources aren't available on a specific host, the reservation will be declined.

@balpert89 Does that make sense to you?

balpert89 commented 1 month ago

The rating part is clear for me now, thanks for addressing.

In the distributed approach: There is no need anymore for announcing resources. The resource "owner" (the pool provider) is in charge of taking or rejecting the reservation and needs to keep track of all the resources. If arbitrary resources aren't available on a specific host, the reservation will be declined.

Does that mean we will deprecate the allocatable / available (https://github.com/ironcore-dev/ironcore/blob/main/api/compute/v1alpha1/machinepool_types.go#L30-L33) fields as they are not required anymore?

lukasfrank commented 1 month ago

The rating part is clear for me now, thanks for addressing.

In the distributed approach: There is no need anymore for announcing resources. The resource "owner" (the pool provider) is in charge of taking or rejecting the reservation and needs to keep track of all the resources. If arbitrary resources aren't available on a specific host, the reservation will be declined.

Does that mean we will deprecate the allocatable / available (https://github.com/ironcore-dev/ironcore/blob/main/api/compute/v1alpha1/machinepool_types.go#L30-L33) fields as they are not required anymore?

Correct, there would be no need for this fields anymore. In case if it's used to aggregate the resources of the entire infrastructure, we can offer metrics and aggregate it the kubernetes way.