Autoscaler needs to be decisive about which Pods to kill when scaling down applications

nimakaviani commented 5 years ago

In what area(s)?

/area autoscale

Describe the feature

Following the conversations in https://github.com/knative/serving/issues/3759, we have noticed that knative app scale-down is done by changing the replica count on the application deployment. This delegates to k8s to choose the pods to kill which may result in suboptimal scale-down strategies.

To give an example, with instances A and B of the same Knative application, at the time of scaling down, if instance A is serving multiple requests and instance B is idle, I would expect the autoscaler to try and terminate instance B, converge as quickly as possible, and save on resource usage. However, with k8s having no knowledge of how busy the pods are and what requests they are serving, changing the replica count will result in k8s randomly selecting the pod to terminate.

nimakaviani commented 5 years ago

/cc @duglin

cppforlife commented 5 years ago

this is how k8s decides which pods to delete: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replicaset/replica_set.go#L684-L694

duglin commented 5 years ago

a link to a similar discussion in K8s-land: kubernetes/kubernetes#45509

duglin commented 5 years ago

I added this to the agenda for today's api wg call. Wasn't sure if it's an api thing or a scaling thing - I decide on api since it might require a specialized Kn-Deployment type of resource to be created if we can't get K8s's Deployment resource to be modified.

greghaynes commented 5 years ago

How would this work with containerconcurrency > 1? e.g. I have containerconcurrency of 10 and 3 pods, my concurrency has just decreased to 20 and therefore we would like to scale down. It is very unlikely that all 20 of these requests are on the same two pods, and more than likely theres a somewhat even distribution among the 3 pods. Even if we had the ability to pick which pod to terminate our decision isn't that much better than randomly killing.

I think this is actually first a networking problem - we'd need to stop forwarding requests to a pod of our choosing when we'd like to scale down, wait for it to completely drain, then terminate it. We also would likely want smarts to re-add a pod to our networking pool if we decide to scale back up before that pod has been fully drained.

vagababov commented 5 years ago

The simple answer is that it should not matter.

First: pods just crash, nodes crash, k8s decides to migrate the pod to another node due to scaling event, etc.

Second: in the Q-P we do Server.Shutdown, which means all the pending requests in will be processed, before terminating and we will stop accepting new requests in the meantime.

The more complex facets are: requests that take a very long time to complete might not complete, but this should be subsumed by the First.

nimakaviani commented 5 years ago

@vagababov it is desirable for the system to converge faster. ReplicaSets already prioritize scale-down in the following order. so yes, even though at the end it should not matter, it saves time (and presumably money) to converge faster.

not-ready < ready, unscheduled < scheduled, and pending < running

@greghaynes regarding your point below:

I think this is actually first a networking problem - we'd need to stop forwarding requests to a pod of our choosing when we'd like to scale down, wait for it to completely drain, then terminate it. We also would likely want smarts to re-add a pod to our networking pool if we decide to scale back up before that pod has been fully drained.

the intention is to be smarter about finding a pod of our choosing to go through the draining process. Perhaps this can be achieved by looking at len(reqChan) plus a function on the last request's start time and the specified timeoutSecond for a container.

vagababov commented 5 years ago

by looking at len(reqChan) plus a function on the last request's start time and the specified timeoutSecond for a container.

This also doesn't work. Since you presume all requests take the same amount of time. Which they usually don't. :-)

greghaynes commented 5 years ago

@vagababov Although youre right we cannot guarantee this will never happen, its not hard to make up a case where, if requests are much longer than our scale frequency, we can completely starve a cluster from ever being able to complete a request if pods are scaled up and down repeatedly. I think its reasonable to try and minimize this failure mode where we have control over it.

@nimakaviani Do you mean this is only to cover deciding which pod we would potentially drain, not to actually perform the drain and then kill? I think i'd rename this issue "Scale down applications without killing long running requests" to cover this whole thing. My main issue with talking about killing pods specifically is IMO thats the trivial part - the hard part is coordinating the pod draining sequence.

vagababov commented 5 years ago

@greghaynes but why won't be able to complete? Since our scaling metric is concurrent requests -- they are all still very much in progress, hence we'd scale down only if some of them completed (which is a contradiction to your statement)? Or am I missing here something?

Additionally, as I said above the server shutdown and between k8s killing the pod (default is 30s) all the requests still have time to complete. Of course, if they are taking more than 30s, it's a possibility, but I still don't see how we can achieve 0 requests processed ever case.

greghaynes commented 5 years ago

right, its never 0. The math on this is breaking my brain but if you assume you can scale up and down quickly relative to request lifetime there can be a significant number of requests left per pod as you scale down that get killed. In my 3 node with concurrency 20 case it could easily be 1/3 of the connections get killed if each node has even load (and it could be worse if we pick unluckily).

nimakaviani commented 5 years ago

This also doesn't work. Since you presume all requests take the same amount of time. Which they usually don't. :-)

@vagababov It will be a best effort regardless. But it will address some obvious choices, e.g. between a pod that is not doing any work vs. a pod that is actively processing requests. The extra 30sec can cascade to a big number if we just neglect the chance for optimization.

@greghaynes the story it is a combination of deciding on what to drain and, making sure what is drained will get killed. Point here is that by relying on ReplicaSets we have no way of enforcing what pod will be killed after the decision on draining is made. Once we have more control on what to kill, the solution to not "Scale down applications without killing long running requests" can be built on top of it.

duglin commented 5 years ago

I view this as a 2 step problem: 1 - have support for choosing which pod to kill - so this is probably either a change to Deployments or create a new resource that gives us this ability 2 - have the code/smarts to leverage this. And I think this includes all of the concerns/solutions described above. So within this one there might also be two parts: a) the smarts to drain/taint pods that we want to eventually kill, and b) have the ability to tell the Deployment which to kill.

1 might be the easier one to solve simply because there's less magic/smarts involved. But once we have that in place we can them work perfecting 2).

To @vagababov point about things can always die... true but to me those are different cases than the system purposely messing up by playing Russian roulette with my instances :-)

vagababov commented 5 years ago

Except it's not russian roulette -- one will always die :)

On Tue, Jul 9, 2019 at 6:24 PM Doug Davis notifications@github.com wrote:

I view this as a 2 step problem: 1 - have support for choosing which pod to kill - so this is probably either a change to Deployments or create a new resource that gives us this ability 2 - have the code/smarts to leverage this. And I think this includes all of the concerns/solutions described above. So within this one there might also be two parts: a) the smarts to drain/taint pods that we want to eventually kill, and b) have the ability to tell the Deployment which to kill.

1 might be the easier one to solve simply because there's less magic/smarts involved. But once we have that in place we can them work perfecting 2).

To @vagababov https://github.com/vagababov point about things can always die... true but to me those are different cases than the system purposely messing up by playing Russian roulette with my instances :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/knative/serving/issues/3886?email_source=notifications&email_token=AAF2WX3SNJ2KG5E2VC7FVW3P6U25PA5CNFSM4HIG36GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSABXQ#issuecomment-509870302, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF2WX24J32NPGOS36RWLK3P6U25PANCNFSM4HIG36GA .

markusthoemmes commented 5 years ago

I agree that it is desirable in certain cases to kill a specific pod rather than a random one.

A given in our current system is that all requests are finished before we actually shut down a pod. If that's not the case, we have a bug and should address that separately.

As @duglin mentioned, the issue is much bigger than just being smart about choosing one. We also don't have the means at hand to do that today. Essentially, we'd need to implement our own ReplicaSet resource, where we can kill pods of our choosing without the reconciler scaling the set up again immediately.

Choosing the right pod is not quite simple either. The state of connections per container is in the extreme case distributed to all pods. Keeping track of all of that state won't be feasible. Asking them all about their current state sounds fishy at scale.

I feel like we should convince ourselves that there is big enough of a problem with the current random mechanism to justify adding that level of complexity.

greghaynes commented 5 years ago

It seems this was discussed heavily at last weeks autoscaling WG meeting - can someone translate that in to outcomes on this issue?

duglin commented 5 years ago

The state of connections per container is in the extreme case distributed to all pods.

Perhaps but it seems to me that as we go forward, especially when we consider things like async jobs where there may not be an external connection for the life of the job, we need a reliable way to know what the state of the system is. Otherwise our choices w.r.t. scaling up and down are just guesses at best and our users will expect a more deterministic algorithm to be used. If not their results will appear almost random at times.

Here's another use case for why we need this state and why killing the wrong instance would cause undesirable results:

let's say we scaled up to 2 instances for some reason
one instance is running a very long running job
the other instance idle - no new requests for a while
if we pick the busy instance to kill, or drain, then when minScale=1 we'll continue to have 2 instances since I assume the "1" instance would apply to the idle instance when it should be applied to the busy one. That's twice as many instances alive than we should have.
also, all subsequent requests that come in will be sent to the idle instance and keep it alive long than it should (when minScale=0). Or if it ever does scale down to zero it'll cause a new one to be created when the instance processing the long-running job should have taken it on (assuming containerConcurrency > 1). Again twice as many instances as we should have.

Keeping track of all of that state won't be feasible

Not sure I agree. We may have to get the control plane more involved in the traffic flow than it is today (or than we'd like), but with all of the complex algorithms that are being worked on to try to be really smart about our scaling, if we don't have accurate input to those algorithms then we won't get smart (expected) results. And that will result in unhappy consumers.

cleverguy25 commented 5 years ago

We are facing this issue as well, for async background jobs. I wonder why we have not considered a Busy probe just like a readiness probe, where not busy pods are prioritized for termination?

markusthoemmes commented 5 years ago

@cleverguy25 can you elaborate on how you'd envision that busyprobe to look like?

vagababov commented 4 years ago

@nimakaviani , can you please fill out the template: https://docs.google.com/document/d/1s6IIU98bi5FlRNmmBaLAn1rgoleK_ovcL746L7NHq0c/edit and put it into the Scaling/Feature Tracks directory on the shared drive?

vagababov commented 4 years ago

Since this is a meta issue, I am going to farm some issues out of it and keep this as the parent one and will add this checklist here, so we can check items as we go. This will help us to track the feature in the project: https://github.com/knative/serving/projects/31

[X] Proposal to WG
[X] Project Approved
[X] Design Doc Reviewed/Approved #6133
[ ] Implementation #6134
[ ] UX Changes (no need for internal projects that aren't exposed to the end user)
[ ] Docs changes #6135

vagababov commented 4 years ago

This hopefully should land in 0.14

markusthoemmes commented 4 years ago

It has landed neither in 0.15 nor in 0.16. @taragu do you think you'll land this in 0.16?

salanki commented 4 years ago

Very curious about this feature. Seems as the work has stalled. Cheering in to hope it get's finished!

vagababov commented 4 years ago

All the work for this has been reverted.

markusthoemmes commented 4 years ago

What do we want to do with this now? It has been moved to the ice-box and all code has been reverted.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

ddelange commented 2 years ago

For future readers: there is a relevant proposal on k8s side ref https://github.com/kubernetes/enhancements/pull/3190 most likely aiming for k8s 1.26

ddelange commented 2 years ago

Alternatively, currently available in beta since k8s 1.22, could knative temporarily add the controller.kubernetes.io/pod-deletion-cost annotation to pods that are currently handling requests? maybe also cc @mattmoor

ddelange commented 2 years ago

Hi :wave:

Opened follow-up issue #13075

Ideas are welcome!

knative / serving

Autoscaler needs to be decisive about which Pods to kill when scaling down applications #3886

In what area(s)?

Describe the feature