kubernetes-sigs / kube-scheduler-wasm-extension

All the things to make the scheduler extendable with wasm.
Apache License 2.0
86 stars 17 forks source link

support the cycle state in guest #16

Closed sanposhiho closed 6 months ago

sanposhiho commented 12 months ago

What would you like to be added:

Support CycleState in guest.

Why is this needed:

The cycle state is one of the important struct passed to each extension point, which is initialized at the beginning of each scheduling. And we use it as a way to pass something from one extension point to another. One common usecase is pre-calculate something in PreFilter or PreScore and use pre-calculation result in Filter or Score. (PreFilter/PreScore is called once per one scheduling, but Filter/Score is called for every potential Nodes. That's why we want to do precalculation instead of calculating every time Filter/Score is called) https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#extension-points

Completion requirements:

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

codefromthecrypt commented 11 months ago

in the http-wasm, we had to make the guest scoped to a request/response cycle, even if it is re-used after that. I think with some thinking, we can pin a guest to cycle state if that helps make things more efficient or less complex.

sanposhiho commented 11 months ago

think with some thinking, we can pin a guest to cycle state if that helps make things more efficient or less complex.

I have a similar thought. https://github.com/kubernetes-sigs/kube-scheduler-wasm-extension/pull/6#discussion_r1207996051

The preemption is the only place where the same plugin is called for the same scheduling cycle in parallel. (talking with my memory, need to double-check) So, in other places, we should be able to basically use the same instance for the same scheduling cycle.

codefromthecrypt commented 11 months ago

ok so if I understand, there are two concerns:

So, I think that to solve this, we need to:

Is this correct?

sanposhiho commented 11 months ago

Yes, correct. That is mostly the same as what I'm thinking.

pin an instance to the scheduling cycle. (take a module instance from pool on PreFilter, release back on Permit)

But, Permit may not be called actually, for example, when all Nodes are filtered out in Filter phase. So, we can simplify it by taking an instance from pool on PreFilter and then if PreFilter is called with a different Pod later, we can assume the past scheduling cycle was finished and release back an instance.

sanposhiho commented 11 months ago

/assign

sanposhiho commented 11 months ago

Also, pin an instance to the scheduling cycle allows us to implement cache for several things on the guest. The k8s resources shouldn't get changed during the scheduling cycle; we can cache them.

codefromthecrypt commented 11 months ago

But, Permit may not be called actually, for example, when all Nodes are filtered out in Filter phase. So, we can simplify it by taking an instance from pool on PreFilter and then if PreFilter is called with a different Pod later, we can assume the past scheduling cycle was finished and release back an instance.

I don't fully understand how this will work, because something pinned to a scheduling cycle won't be in the pool anymore. It sounds like we are back to one-at-a-time use, not pooling.

Do you mean we have a shared instance with state about the pod it is working on, and when that doesn't match the pod arg, we know to invalidate data? Or do you mean to still have a pool, but also need a queue of in-flight instances, which we double-check later if orphaned due to not reaching Permit stage?

sanposhiho commented 11 months ago

Forgot to consider the binding cycle in my previous comment. The cycle state is passed from the scheduling cycle to the binding cycle, so we need to use the same instance in the binding cycle as well.

You may have already understood something by looking at #21, but I explain about the implementation detail a bit here so that someone can look back this later.

21 tracks which Pod is in which cycle and records which instance we give to which Pod.

Let's see through some scenarios:

Pod is rejected in the scheduling cycle

For example, all Nodes are filtered out in Filter extension point.

  1. PodA scheduling is started.
  2. PreFilter is executed. The wasm plugin initializes the guest or uses the existing one from the pool, and assigns it to PodA. Assign means that this instance won't be used by anyone else until it's unassigned from PodA.
  3. Filter is executed. The wasm plugin uses the instance assigned to PodA.
  4. (PodA is rejected by some plugins. And PodB's scheduling gets started next.)
  5. PreFilter is executed for PodB. The wasm plugin unassigns the instance from PodA, and put this instance to the pool. Then it does the same as (2) for PodB.

Pod is successfully scheduled and going to the binding cycle.

Note: the binding cycle is executed in parallel.

  1. PodA scheduling is started.
  2. PreFilter is executed. The wasm plugin initializes the guest or uses the existing one from the pool, and assigns it to PodA. Assign means that this instance won't be used by anyone else until it's unassigned from PodA.
  3. Filter is executed. The wasm plugin uses the instance assigned to PodA.
  4. Permit is executed. The wasm plugin records that PodA will go to the binding cycle.

4a. The binding cycle is started for PodA. In all extension points in the binding cycle, the wasm plugin uses the instance assigned to PodA. (Actually we haven't yet supported any extension points in the binding cycle though) 5a. The binding cycle is finished and PostBind is executed. The wasm plugin unassign the instance from PodA, and put this instance to the pool. 4b. (PodB's scheduling gets started.) 5b. PreFilter is executed for PodB. The wasm plugin does not unassign the instance from PodA because it knows that PodA is moved to the binding cycle and the instance is still in use. It does the same as (2) for PodB.

(After (3), 4a -> 5a and 4b -> 5b are executed in parallel.)

possible performance degradation?

No much except the preemption. As said, in preemption, Filter, AddPod, and RemovePod may be executed for the same Pod in parallel. So, we need to take a lock during Filter, AddPod, and RemovePod so that the same instance won't be executed from the different goroutine at the same time.

codefromthecrypt commented 11 months ago

thanks, looks like we have some test cases to write! thanks for saying what they should be. I can help with these once the code is in shape.

kerthcet commented 11 months ago

I think we can support preemption later, not that urgent, wdyt? @sanposhiho This is similar to our other plugin implementations like VolumeRestrictions.ReadWriteOncePod, we support the preemption when graduating to GA.

A basic question about host-guest mode, if we have multi plugins, and each plugin has implemented different extension points

cc @codefromthecrypt @sanposhiho

sanposhiho commented 11 months ago

I'm unclear about the current status on the preemption, but the current #21 has already supported the preemption by taking lock during Filter, AddPod, and RemovePod. So, that should be working with preemption now.

if we have multi plugins, and each plugin has implemented different extension points

You mean multi wasm plugins, right? We don't yet support them. (we should support them later though) Basically, currently only one plugin, which hosts the guest instance, is used in the scheduler, and that plugin can load only one .wasm.

kerthcet commented 11 months ago

I'm unclear about the current status on the preemption, but the current https://github.com/kubernetes-sigs/kube-scheduler-wasm-extension/pull/21 has already supported the preemption by taking lock during Filter, AddPod, and RemovePod. So, that should be working with preemption now.

Fine, I'll take a look.

kerthcet commented 11 months ago

You mean multi wasm plugins, right? We don't yet support them.

Good to start. If we have a vague design how it looks like, it would be great, just in case back and forth.

sanposhiho commented 11 months ago

Yep, create the issue to remember that: https://github.com/kubernetes-sigs/kube-scheduler-wasm-extension/issues/23

sanposhiho commented 6 months ago

/close

cycle state is already supported