kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.23k stars 645 forks source link

KEP-1397: descheduler integration with evacuation API as an alternative to eviction API #1354

Open ingvagabund opened 4 months ago

ingvagabund commented 4 months ago

Is your feature request related to a problem? Please describe.

The descheduler eviction policy is built on top of the eviction API. The API currently does not support eviction requests that are not completed right away. Instead, any eviction needs to either succeed or be rejected in response. Nevertheless, there are cases where an eviction request is expected to only initiate eviction. While getting confirmation or rejection of the eviction initiation (or its promise).

Describe the solution you'd like

Utilize evacuation API as an alternative to eviction API. As an interim solution (until the evacuation API is available) allow to interpret pods with a special annotation as a request for eviction initiation instead of expecting an immediate eviction.

Resolves: https://github.com/kubernetes-sigs/descheduler/issues/1397

k8s-ci-robot commented 4 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from ingvagabund. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubernetes-sigs/descheduler/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
songtao98 commented 1 month ago

Hi, @ingvagabund . I have some questions about how does this KEP works in Descheduling Framework.

  1. Do you expect to use this new eviction handling mechanism to replace existing PodEvictor (who call Eviciton API directly), so that all users evict their pods through it?
  2. Do you consider to take one more step to make it as a extension point, e.g. Evict, so that users can decide which way to evict their pods, and existing PodEvictor or your extension by this KEP can both be implemented as a EvictPlugin(different with existing EvictorPlugin)?
ingvagabund commented 1 month ago

@songtao98 thank you for asking the right questions.

  1. Do you expect to use this new eviction handling mechanism to replace existing PodEvictor (who call Eviciton API directly), so that all users evict their pods through it?

Eventually, that's the ultimate goal. However, the first implementations will provide both eviction API and evacuation API. So we have some soft time for early adopters to share their experience and corner case for which evacuation API does not work/scale well. Followed by discussions with evacuation API designers about possible resolutions before the new API goes GA.

  1. Do you consider to take one more step to make it as a extension point, e.g. Evict, so that users can decide which way to evict their pods, and existing PodEvictor or your extension by this KEP can both be implemented as a EvictPlugin(different with existing EvictorPlugin)?

Not right now. Ultimately, the new evacuation API should be a special case of the eviction API. The details are yet to be discussed.

songtao98 commented 1 month ago

thank you for asking the right questions.

@ingvagabund Thank you for your reply! For question 1, the strategy sounds great. For question 2, I'm not sure if someone has ever proposed a custom Evictor mechanism(to customize the real evictor like PodEvictor, not currently EvictorPlugin), but we think this capability will eventually be meaningful and already implemented one. We are also looking forward to more discussion about details, and if the community think it's the right way, we will be more than happy to contribute : )

ingvagabund commented 1 month ago

We think this capability will eventually be meaningful and already implemented one. We are also looking forward to more discussion about details, and if the community think it's the right way, we will be more than happy to contribute : )

Few sig calls back we discussed the possibility of writing mini-keps. Turning the low level eviction mechanism into a plugin is a good example for it :) If there are good use cases where a user can benefit from a custom eviction mechanism. Would you be willing to write a proposal for this? The current proposal focuses mainly on integration with the new evacuation API (and annotation as an interim solution). Pluginazing the evictor itself is related, yet it deserves its own attention.

songtao98 commented 1 month ago

Few sig calls back we discussed the possibility of writing mini-keps. Turning the low level eviction mechanism into a plugin is a good example for it :) If there are good use cases where a user can benefit from a custom eviction mechanism. Would you be willing to write a proposal for this? The current proposal focuses mainly on integration with the new evacuation API (and annotation as an interim solution). Pluginazing the evictor itself is related, yet it deserves its own attention.

For sure! We are certainly willing to draft a proposal for customizing the eviction mechanism. However, it might take some time as I need to organize some details and perform the necessary abstractions. BTW, regarding the mini-keps you mentioned, are there any specific format requirements? If not, I will follow the current KEP template.

ingvagabund commented 1 month ago

BTW, regarding the mini-keps you mentioned, are there any specific format requirements? If not, I will follow the current KEP template.

We don't have any dedicated template for the mini-keps atm. I have utilized the upstream k8s template (which might put the bar too high). Nevertheless, I find it quite useful. Even filling in the parts that seem to be too far away from the descheduler project. Yet, they help me to ask myself questions that are important to answer. Let's use the upstream template for now until we find a better one.

gcezaralmeida commented 1 month ago

Hey, this is something that I'm interested in also. The idea of Descheduler evicting Kubevirt VM pods to Live Migrate then will help a lot the community.

ingvagabund commented 1 month ago

Hey @gcezaralmeida, I'd be interested in hearing more about your use cases. Happy to extend the proposal with other use cases if you have any in mind.

knelasevero commented 3 weeks ago

Do we consider the alpha implementation here to be an workaround until the until the evacuation API is fully integrated (in the future, in beta)?

I understand trying to deliver value earlier and then already start to get feedback, but have you checked with Feature Requesters and stakeholders if that is really what they want? Maybe it makes sense to start with the evacuation API integration from the start.

knelasevero commented 3 weeks ago

In Filip's proposal (I need to read it in more detail, but I skimmed it already), there is this notion for Evacuator Prioritization and multiple evacuators (and a lot of other details not considered here). To me it sounds like when going from alpha implementation (here) to beta, and trying to assimilate the other complexities from the real evacuator API too much would change to call the graduation here a graduation from alpha to beta? Maybe it would be another alpha for a while before becoming beta? (or, again, as above comment, it would be worth to consider all those things from the start...)

knelasevero commented 3 weeks ago

This looks great and is very detailed. I need to read both proposals a bit more deeply to give more insights, but wanted to give my initial thoughts so the KEP can go forward

atiratree commented 2 weeks ago

Do we consider the alpha implementation here to be an workaround until the until the evacuation API is fully integrated (in the future, in beta)?

I understand trying to deliver value earlier and then already start to get feedback, but have you checked with Feature Requesters and stakeholders if that is really what they want? Maybe it makes sense to start with the evacuation API integration from the start.

FIY, we are still discussing the Evacuation API feature and it was deferred from 1.31

ingvagabund commented 2 weeks ago

Do we consider the alpha implementation here to be an workaround until the until the evacuation API is fully integrated (in the future, in beta)?

Yes

I understand trying to deliver value earlier and then already start to get feedback, but have you checked with Feature Requesters and stakeholders if that is really what they want? Maybe it makes sense to start with the evacuation API integration from the start.

I did. This is an important feature for KubeVirt. There's a strong interest to have this feature merged sooner than later. Waiting until the Evacuation API is implemented/alpha would be the best course of actions. Yet, it's not a blocker for implementing the first alpha of this feature.

ingvagabund commented 2 weeks ago

In Filip's proposal (I need to read it in more detail, but I skimmed it already), there is this notion for Evacuator Prioritization and multiple evacuators (and a lot of other details not considered here). To me it sounds like when going from alpha implementation (here) to beta, and trying to assimilate the other complexities from the real evacuator API too much would change to call the graduation here a graduation from alpha to beta? Maybe it would be another alpha for a while before becoming beta? (or, again, as above comment, it would be worth to consider all those things from the start...)

Currently, the promotion to beta requires implementation of the evacuation API. Thus, we need to stay aligned with it and can not promote to beta until the evacuation API is beta.

knelasevero commented 2 weeks ago

Do we consider the alpha implementation here to be an workaround until the until the evacuation API is fully integrated (in the future, in beta)?

Yes

I understand trying to deliver value earlier and then already start to get feedback, but have you checked with Feature Requesters and stakeholders if that is really what they want? Maybe it makes sense to start with the evacuation API integration from the start.

I did. This is an important feature for KubeVirt. There's a strong interest to have this feature merged sooner than later. Waiting until the Evacuation API is implemented/alpha would be the best course of actions. Yet, it's not a blocker for implementing the first alpha of this feature.

I know it is not a blocker. I see it more from the frame of:

Is this delivering enough value when implemented as a workaround to justify not waiting for the actual thing that you want to use? Is the feedback and initial users you will get now be worth implementing this, even if the thing will be quite different when the Evacuation API is here (in beta or when it is safe to use it or w/e)?

ingvagabund commented 2 weeks ago

Is this delivering enough value when implemented as a workaround to justify not waiting for the actual thing that you want to use? Is the feedback and initial users you will get now be worth implementing this, even if the thing will be quite different when the Evacuation API is here (in beta or when it is safe to use it or w/e)?

Disclaimer: this is my personal comment and it does not have to reflect the current reality.

We don't have any participating initial users besides the discussions I have had with some KubeVirt representatives. So currently there's only one user. The feedback is currently limited and influenced by the evacuation API proposal. I don't have any solid data to support anything I have proposed in the current proposal. The design was crafted in a best-effort manner in my best conscience. Taking into account many what-if scenarios. We still don't have many participating contributors in the descheduler community to extend the discussions/reviews and hand craft a better design. Practically, any contribution from my side is done either in my free time of when I prioritize my work in my current company.

I am aligned with your perspective and acknowledge you asked the right and important questions. I see two options. To wait (hard to say for how long) and collect more feedback with an assumption the design will improve. Or, pin the current design, produce v1alpha1 implementation, collect the feedback and see how well/bad the design proves the be.

knelasevero commented 1 week ago

We still don't have many participating contributors in the descheduler community to extend the discussions/reviews and hand craft a better design

You focus on the design reviews not being comprehensive for lack of contributors, but that is actually an argument to wait, no? Since if you consider the repo to have less support than it needs it would be harder to maintain a workaround than it would be to maintain what you envision to be the final implementation (that would also start at alpha, but would not have big changes of core internals, while evolving initially).

To wait (hard to say for how long) and collect more feedback with an assumption the design will improve. Or, pin the current design, produce v1alpha1 implementation, collect the feedback and see how well/bad the design proves the be.

I also see these two options, and I see pros and cons coming from both of them. I think the workaround design is great, I wouldn't consider it to be bad. I think it was detailed and a lot has been considered. Just raised this discussion to be the devil's advocate and check other paths. This is an important feature for KubeVirt -> but how urgent, I think that's the question. Also do they know of the plan to migrate later to the evacuation API approach? Would they explicitly prefer not to wait, or did they just say they need it asap?

with an assumption the design will improve

But you also assume that, or the goal of migrating to the intended final implementation using the actual evacuation API (and ditching the previous one) would not be written down here in you EP.

fossedihelm commented 2 days ago

Reading the great discussion here! To me, the proposal looks good, and thank you for this effort!

This is an important feature for KubeVirt -> but how urgent, I think that's the question. Also, do they know of the plan to migrate later to the evacuation API approach? Would they explicitly prefer not to wait, or did they just say they need it asap?

Answering this question: Yes, this is an essential feature for KubeVirt. Let's start from the beginning: live migration is a key feature of KubeVirt; it allows the workload not to be compromised. For this reason, it is chosen as the default behavior during an eviction, drain, etc... KubeVirt is also shipped by default with Openshift, and currently, Descheduler and KubeVirt behavior are not compatible with each other. Without this workaround, a cluster admin would have to choose between having an active KubeVirt core feature and the Descheduler, and this would put them in a difficult situation that would be best avoided. As expressed above this is a workaround that will be overridden with the evacuation API and KubeVirt will move to the evacuation API accordingly. IOW, yes, KubeVirt side it would definitely be better to have this ASAP. Thank you!

knelasevero commented 1 day ago

Reading the great discussion here! To me, the proposal looks good, and thank you for this effort!

This is an important feature for KubeVirt -> but how urgent, I think that's the question. Also, do they know of the plan to migrate later to the evacuation API approach? Would they explicitly prefer not to wait, or did they just say they need it asap?

Answering this question: Yes, this is an essential feature for KubeVirt. Let's start from the beginning: live migration is a key feature of KubeVirt; it allows the workload not to be compromised. For this reason, it is chosen as the default behavior during an eviction, drain, etc... KubeVirt is also shipped by default with Openshift, and currently, Descheduler and KubeVirt behavior are not compatible with each other. Without this workaround, a cluster admin would have to choose between having an active KubeVirt core feature and the Descheduler, and this would put them in a difficult situation that would be best avoided. As expressed above this is a workaround that will be overridden with the evacuation API and KubeVirt will move to the evacuation API accordingly. IOW, yes, KubeVirt side it would definitely be better to have this ASAP. Thank you!

Thanks! Having your direct input here is important. Thanks for chiming in!