Closed nimakaviani closed 3 years ago
/cc @duglin
this is how k8s decides which pods to delete: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replicaset/replica_set.go#L684-L694
a link to a similar discussion in K8s-land: kubernetes/kubernetes#45509
I added this to the agenda for today's api wg call. Wasn't sure if it's an api thing or a scaling thing - I decide on api since it might require a specialized Kn-Deployment type of resource to be created if we can't get K8s's Deployment resource to be modified.
How would this work with containerconcurrency > 1? e.g. I have containerconcurrency of 10 and 3 pods, my concurrency has just decreased to 20 and therefore we would like to scale down. It is very unlikely that all 20 of these requests are on the same two pods, and more than likely theres a somewhat even distribution among the 3 pods. Even if we had the ability to pick which pod to terminate our decision isn't that much better than randomly killing.
I think this is actually first a networking problem - we'd need to stop forwarding requests to a pod of our choosing when we'd like to scale down, wait for it to completely drain, then terminate it. We also would likely want smarts to re-add a pod to our networking pool if we decide to scale back up before that pod has been fully drained.
The simple answer is that it should not matter.
First: pods just crash, nodes crash, k8s decides to migrate the pod to another node due to scaling event, etc.
Second: in the Q-P we do Server.Shutdown
, which means all the pending requests in will be processed, before terminating and we will stop accepting new requests in the meantime.
The more complex facets are: requests that take a very long time to complete might not complete, but this should be subsumed by the First
.
@vagababov it is desirable for the system to converge faster. ReplicaSets already prioritize scale-down in the following order. so yes, even though at the end it should not matter, it saves time (and presumably money) to converge faster.
not-ready < ready, unscheduled < scheduled, and pending < running
@greghaynes regarding your point below:
I think this is actually first a networking problem - we'd need to stop forwarding requests to a pod of our choosing when we'd like to scale down, wait for it to completely drain, then terminate it. We also would likely want smarts to re-add a pod to our networking pool if we decide to scale back up before that pod has been fully drained.
the intention is to be smarter about finding a pod of our choosing
to go through the draining process. Perhaps this can be achieved by looking at len(reqChan)
plus a function on the last request's start time and the specified timeoutSecond
for a container.
by looking at len(reqChan) plus a function on the last request's start time and the specified timeoutSecond for a container.
This also doesn't work. Since you presume all requests take the same amount of time. Which they usually don't. :-)
@vagababov Although youre right we cannot guarantee this will never happen, its not hard to make up a case where, if requests are much longer than our scale frequency, we can completely starve a cluster from ever being able to complete a request if pods are scaled up and down repeatedly. I think its reasonable to try and minimize this failure mode where we have control over it.
@nimakaviani Do you mean this is only to cover deciding which pod we would potentially drain, not to actually perform the drain and then kill? I think i'd rename this issue "Scale down applications without killing long running requests" to cover this whole thing. My main issue with talking about killing pods specifically is IMO thats the trivial part - the hard part is coordinating the pod draining sequence.
@greghaynes but why won't be able to complete? Since our scaling metric is concurrent requests -- they are all still very much in progress, hence we'd scale down only if some of them completed (which is a contradiction to your statement)? Or am I missing here something?
Additionally, as I said above the server shutdown and between k8s killing the pod (default is 30s) all the requests still have time to complete. Of course, if they are taking more than 30s, it's a possibility, but I still don't see how we can achieve 0 requests processed ever case.
right, its never 0. The math on this is breaking my brain but if you assume you can scale up and down quickly relative to request lifetime there can be a significant number of requests left per pod as you scale down that get killed. In my 3 node with concurrency 20 case it could easily be 1/3 of the connections get killed if each node has even load (and it could be worse if we pick unluckily).
This also doesn't work. Since you presume all requests take the same amount of time. Which they usually don't. :-)
@vagababov It will be a best effort regardless. But it will address some obvious choices, e.g. between a pod that is not doing any work vs. a pod that is actively processing requests. The extra 30sec can cascade to a big number if we just neglect the chance for optimization.
@greghaynes the story it is a combination of deciding on what to drain and, making sure what is drained will get killed. Point here is that by relying on ReplicaSets we have no way of enforcing what pod will be killed after the decision on draining is made. Once we have more control on what to kill, the solution to not "Scale down applications without killing long running requests" can be built on top of it.
I view this as a 2 step problem: 1 - have support for choosing which pod to kill - so this is probably either a change to Deployments or create a new resource that gives us this ability 2 - have the code/smarts to leverage this. And I think this includes all of the concerns/solutions described above. So within this one there might also be two parts: a) the smarts to drain/taint pods that we want to eventually kill, and b) have the ability to tell the Deployment which to kill.
1 might be the easier one to solve simply because there's less magic/smarts involved. But once we have that in place we can them work perfecting 2).
To @vagababov point about things can always die... true but to me those are different cases than the system purposely messing up by playing Russian roulette with my instances :-)
Except it's not russian roulette -- one will always die :)
On Tue, Jul 9, 2019 at 6:24 PM Doug Davis notifications@github.com wrote:
I view this as a 2 step problem: 1 - have support for choosing which pod to kill - so this is probably either a change to Deployments or create a new resource that gives us this ability 2 - have the code/smarts to leverage this. And I think this includes all of the concerns/solutions described above. So within this one there might also be two parts: a) the smarts to drain/taint pods that we want to eventually kill, and b) have the ability to tell the Deployment which to kill.
1 might be the easier one to solve simply because there's less magic/smarts involved. But once we have that in place we can them work perfecting 2).
To @vagababov https://github.com/vagababov point about things can always die... true but to me those are different cases than the system purposely messing up by playing Russian roulette with my instances :-)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/knative/serving/issues/3886?email_source=notifications&email_token=AAF2WX3SNJ2KG5E2VC7FVW3P6U25PA5CNFSM4HIG36GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSABXQ#issuecomment-509870302, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF2WX24J32NPGOS36RWLK3P6U25PANCNFSM4HIG36GA .
I agree that it is desirable in certain cases to kill a specific pod rather than a random one.
A given in our current system is that all requests are finished before we actually shut down a pod. If that's not the case, we have a bug and should address that separately.
As @duglin mentioned, the issue is much bigger than just being smart about choosing one. We also don't have the means at hand to do that today. Essentially, we'd need to implement our own ReplicaSet
resource, where we can kill pods of our choosing without the reconciler scaling the set up again immediately.
Choosing the right pod is not quite simple either. The state of connections per container is in the extreme case distributed to all pods. Keeping track of all of that state won't be feasible. Asking them all about their current state sounds fishy at scale.
I feel like we should convince ourselves that there is big enough of a problem with the current random mechanism to justify adding that level of complexity.
It seems this was discussed heavily at last weeks autoscaling WG meeting - can someone translate that in to outcomes on this issue?
The state of connections per container is in the extreme case distributed to all pods.
Perhaps but it seems to me that as we go forward, especially when we consider things like async jobs where there may not be an external connection for the life of the job, we need a reliable way to know what the state of the system is. Otherwise our choices w.r.t. scaling up and down are just guesses at best and our users will expect a more deterministic algorithm to be used. If not their results will appear almost random at times.
Here's another use case for why we need this state and why killing the wrong instance would cause undesirable results:
Keeping track of all of that state won't be feasible
Not sure I agree. We may have to get the control plane more involved in the traffic flow than it is today (or than we'd like), but with all of the complex algorithms that are being worked on to try to be really smart about our scaling, if we don't have accurate input to those algorithms then we won't get smart (expected) results. And that will result in unhappy consumers.
We are facing this issue as well, for async background jobs. I wonder why we have not considered a Busy probe just like a readiness probe, where not busy pods are prioritized for termination?
@cleverguy25 can you elaborate on how you'd envision that busyprobe to look like?
@nimakaviani , can you please fill out the template: https://docs.google.com/document/d/1s6IIU98bi5FlRNmmBaLAn1rgoleK_ovcL746L7NHq0c/edit
and put it into the Scaling/Feature Tracks
directory on the shared drive?
Since this is a meta issue, I am going to farm some issues out of it and keep this as the parent one and will add this checklist here, so we can check items as we go. This will help us to track the feature in the project: https://github.com/knative/serving/projects/31
This hopefully should land in 0.14
It has landed neither in 0.15 nor in 0.16. @taragu do you think you'll land this in 0.16?
Very curious about this feature. Seems as the work has stalled. Cheering in to hope it get's finished!
All the work for this has been reverted.
What do we want to do with this now? It has been moved to the ice-box and all code has been reverted.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
For future readers: there is a relevant proposal on k8s side ref https://github.com/kubernetes/enhancements/pull/3190 most likely aiming for k8s 1.26
Alternatively, currently available in beta since k8s 1.22, could knative temporarily add the controller.kubernetes.io/pod-deletion-cost
annotation to pods that are currently handling requests? maybe also cc @mattmoor
Hi :wave:
Opened follow-up issue #13075
Ideas are welcome!
In what area(s)?
/area autoscale
Describe the feature
Following the conversations in https://github.com/knative/serving/issues/3759, we have noticed that knative app scale-down is done by changing the
replica
count on the application deployment. This delegates to k8s to choose the pods to kill which may result in suboptimal scale-down strategies.To give an example, with instances A and B of the same Knative application, at the time of scaling down, if instance A is serving multiple requests and instance B is idle, I would expect the autoscaler to try and terminate instance B, converge as quickly as possible, and save on resource usage. However, with k8s having no knowledge of how busy the pods are and what requests they are serving, changing the replica count will result in k8s randomly selecting the pod to terminate.