Scale down a deployment by removing specific pods

roberthbailey commented 7 years ago

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): deployment scale down pod

Is this a BUG REPORT or FEATURE REQUEST? (choose one): FEATURE REQUEST

It would be convenient to be able to scale down a deployment by N replicas by choosing which pods to remove from the deployment.

Ideally, this would have a similar interface as for removing selected instances from a GCE managed instance group (https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/delete-instances).

Today, you can approximate this by issuing API calls in rapid succession that might sometimes result in that effect (remove the relevant label from the pod in question, scale the deployment and its underlying replicaSet down, then delete the pod), but it's hacky and relies on winning a race.

@kubernetes/sig-apps-feature-requests

davidopp commented 7 years ago

Is this a duplicate of #4301?

gfernandessc commented 7 years ago

Yes, it seems like it could be achieved after #4301 is done. However, it still look likes a hack given your summary of the likely implementation of #4301:

For that we are proposing to use something called evictionCost which is localized per-controller and is only used by the controller. We are not planning to implement that feature soon.

The simplest solution we need is a command that executes "delete pod and scale down replica controller by one". Hopefully this would be something that could be implemented sooner than #4301.

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

roberthbailey commented 6 years ago

Ping @kubernetes/sig-apps-feature-requests -- this is a specific request from a user. Can you confirm that you've seen the request and whether it falls into your existing backlog? (I understand that it was probably below the radar during the v1 migration last year)

enisoc commented 6 years ago

This isn't in our existing backlog as far as I know. Can you elaborate on the use case for choosing which Pods to delete during scale down? Is it a human doing a one-time repair operation? Or some higher level automation that knows more about which Pods ought to be preferred for deletion?

In other words, is this something that could be automated if we enhance or allow customization of the deletion precedence rules? Or is the need really to delete an arbitrary set of Pods that no automation could be expected to predict?

gfernandessc commented 6 years ago

Anthony-- the original use case that prompted this feature request was the following:

We have a single node pool for data nodes. We then have a separate deployment for each different type of index. The deployments share the node pool but take up the whole node.

In certain scenarios we want to reduce one deployment by 1 (or N) and increase a different deployment by 1 (maintaining the same total number of nodes). To not disrupt serving, we would manually select the pods that had the least amount of indices being served (or manually migrate the indices of that node), and remove those specific pods.

IIUC the current deletion precedence should allow for a quick succession of "delete pod" -> "scale deployment down" to end up with the right state, as long as there's enough 'grace period' or the new pods take long enough to get into the Ready state.

That might actually be sufficient for our specific use case since nodes stay in the Terminating status for a while.

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten /remove-lifecycle stale

fejta-bot commented 6 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

roberthbailey commented 6 years ago

Re-opening this issue as I don't feel the use case was fully explored or a reasonable workaround found.

roberthbailey commented 6 years ago

To be clear, I'm ok if we close this feature request consciously because we've chosen not to implement it. But I think it's a reasonable request and should at least be considered.

jan-g commented 6 years ago

/remove-lifecycle rotten

jan-g commented 6 years ago

This would be very handy to have; I'm using mini(-ish)kube as a testing environment and the ability to permanently kill off specific pods (without replacement) is precisely what I'm after for staging some system tests.

soltysh commented 6 years ago

Personally, I don't see this reasonable for deployments. By its nature, deployments or rs/rc are not bound to any specific pods, all that matters is their number. If you're talking about specific pods you should work with statefulset, imo.

antoinne85 commented 6 years ago

I wound up on this issue because I was consider how my company might implement our current worker pattern within Kubernetes (with some optimizations).

In our case, we have workers that scale up and down as the workload increases/decreases. At some point, though, they run out of work (e.g. a queue is completely emptied), at which point you'd want to start spinning down the workers (pods, in this case) that no longer have any work to do. A plain scale operation doesn't seem to fit because you don't control which pod gets whacked (it could be one of the workers still in-progress).

It seems that you could accomplish a similar end goal (for us, anyway) with the use of Job but the operational semantics are a little different with those.

For example, at least in our case, if you use a Deployment, as work finishes up the question/thing that needs automated is: "how many of these things do I need to be running at once", and that number will eventually move to 0.

However, if you use a Job the question/thing that needs automated is: "Do I currently have one of these Job types in-progress? If so, and it hasn't yet been marked as completed and waiting for all of the pods to finish, how many do I need to be running at once? If so, and it has already been marked as completed, how many more should I fire up and is it okay for this pre-existing Job to continue to run (possibly for a long time) even though some of the pods under it have completed? If I've not already got one of these Job types in-progress, go fire one up."

All of that is just a long way of saying that because Jobs have some built-in control behaviors, there's more to consider when looking to scale them up/down.

Now, all of this is just my perspective based on my own design goals, but I would think they're not terribly far off from where some other folks might like to land.

chavacava commented 6 years ago

In my company we have the exact same case of that of @antoinne85: workers that pick tasks from a queue. When queue is (momentarily) empty we need to downscale workers but only those that are idle.

For example, for scaling down from 5 to 3, we need/dream something like:

kubectl scale --replicas=3 deployment/hello-server --pods=hello-server-66cb56b679-dkvfj,hello-server-66cb56b679-xc22k

tigh-latte commented 6 years ago

@antoinne85 I'm in the exact same boat. My Pods have jobs that could last anywhere from 3 minutes to 20. It's kind of sad when, 10 minutes through a 20 minute job, the pod gets killed because my 3 minute task in another pod finished.

jeremywadsack commented 6 years ago

@antoinne85, @chavacava, @Tigh-Gherr we do exactly that pattern. We use a KubernetesJob to run a worker that scales up (starts the Job) when items are added to the queue. When the queue is empty the worker terminates (successfully) and the Job completes. I think that Job is exactly what you are looking for rather than selecting pods within a deployment to terminate.

See https://github.com/keylimetoolbox/resque-kubernetes for our implementation for Resque and Resque-backed ActiveJob.

I am interested in this because sometimes a given pod in a deployment becomes unhealthy and I'd like to manually remove the pod and scale down the deployment. It happens very rarely and is mostly an issue of something else we need to address (the unhealthy pod).

antoinne85 commented 6 years ago

@jeremywadsack The job pattern can work, but it introduces more things to keep track of for the end developer.

For example, suppose my queue gets filled with 10 items, so I start a Job and scale it.

At some point the queue is empty and at least one of the pods exits successfully. But there are still others out there wrapping up their work.

So if more messages come in between the time the queue empties and the time all the Jobs exit, we're left trying to sort out questions about how many instances of the original job are still running and how many we should start.

Of course, if you want to allow unbounded compute usage you can just start up more Jobs and they'll finish whenever they finish, but if you're trying to constrain a given workload's resource consumption that's a lot of bookkeeping. Plus, perhaps most importantly, Deployments work with the HPA, making it easy to set up your scaling goals.

jeremywadsack commented 6 years ago

@antoinne85 Good points.

At the moment our implementation is naive and adds a new Job for each worker. So when the queue is empty a pod terminates and the Job completes. On the other end, the enqueue hook spins up a new Job only if the total number of jobs is less than the set maximum.

I would like to use Job scaling (rather than a Job per worker), but I haven't dug into that. I would expect to be able to implement that with a fixed completion count (.spec.completions) on the termination side. The scaling side would be a bit more complex but I think we would just look at the number of running (not completed) pods in the job and increase the total .spec.completion to add another worker.

I looked at HPA when we went down this road and the custom metrics integration seemed (to me) like much more effort than having pods terminate when the queue was empty. How do you report queue-lengths for HPA? Does that require a custom metrics server?

antoinne85 commented 6 years ago

@jeremywadsack We haven't actually implemented the HPA yet, but we know we're going to need it for other aspects of our infrastructure as we continue to migrate applications, so we're perhaps a little more open to the additional setup/overhead for that kind of thing.

Though, as I said, we haven't implemented it yet, we're expecting to gather metrics through Prometheus and expose them through the Custom Metrics API and scale with the HPA from those. For us, this is to normalize and streamline the way our teams collect metrics and configure scaling.

This seems very much like the classic case of "no right answer" where everyone has to make the tradeoffs that work for their team. Obviously, everyone is going to do whatever it takes to get a good setup for their organization. For us, if that means we have to do some bookkeeping on Jobs to get where we want to be, then that's what we'll do. It would be nice, however, if this kind of thing were a planned application strategy in Kubernetes. Maybe it doesn't get implemented with deployments--maybe Jobs get some fancy new feature--the final implementation could go a dozen different directions, but some K8S-level support for getting finer control over which instances of a scalable resource get blown away or a way to "recycle" an exited Job or some other option (most likely much better than my offerings) would be welcomed by my organization.

atomaras commented 6 years ago

We have the exact same scenario that @chavacava described. This pattern can't be achieved properly with Jobs

ant31 commented 6 years ago

@antoinne85 Not only the code, but also the task of every pod of a replicaset should be same. If they are, even slightly, different it should not be grouped under the same Deployment/RS. Your workers/pods are processing multiple kinds of tasks, then they are not identical. Statefulsets, jobs may fit better. Also, I would just consider creating a deployment per worker and then get a fine grain control over each individual pod to select which one to remove.

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

canhnt commented 6 years ago

/remove-lifecycle stale

chaolou commented 5 years ago

Also we need the feature in order to support continuous connection when scale down, now statefulset is only support delete pods in order, but not specific pods by user

BenStull commented 5 years ago

Not being able to specify which pod will be killed when downscaling a stateful set means that it's impossible to target a specific pod's PV for upgrade. This is important for being able to scale up variable-performance PaaS managed disks in cloud services.

s2maki commented 5 years ago

I am having the same issue, and have resorted to a mechanism of "if every pod on a node is declaring itself idle, cordon the node and terminate it". Then, when the ready count of the deployment is less than the desired count, drop the desired count.

It's working, but it lends itself to race conditions and doesn't play well with other apps in the same cluster.

I keep thinking that maybe there's another way to go about this rather than "remove one without replacing it" that others are suggesting.

When a pod goes busy, it could tell k8s that if there's a scale-down event, it would prefer to stick around. It's not a hard rule, merely a suggestion. Each pod could set that flag on itself when it starts a lengthy operation, and clear it when it's done. When a deployment scale gets reduced, k8s first picks pods not marked as busy to terminate. It only kills a busy pod if there are no idle ones to terminate.

jwalters-gpsw commented 5 years ago

Our organization is also interested in this since we have a situation which is similar to the one described here. The in-progress worker tasks can be very long but also there is significant variability between the duration of the queue reading workers.

Two things. Where in the k8s code is the part which selects the pod(s) to descale?

Also, perhaps there could be a probe, similar to the liveness and readiness probes, which would indicate evict-ability. Seems that could be done in a backwardly compatible way.

dseapy commented 5 years ago

Would it be possible to set the grace period really high and when work queue is empty just scale deployment to 0? Having idle pods exit when receiving the TERM signal, with busy pods continuing to work.

s2maki commented 5 years ago

I'm struggling to understand how this isn't a more general problem with deployments. Even in the example given in the documentation for deployments (describing rollout of an nginx upgrade) it seems like disruption of active work is a problem.

In the provided example, any HTTP connections currently open at the time a pod is deleted are going to be broken. The effect is made way worse for sites where connections need to stay open for a period of time (streaming, downloads, websockets, etc), but isn't zero for normal short-lived connections.

Is this resolved by the grace period? I mean, does nginx go into "graceful shutdown" and close its listener upon receipt of the TERM signal, and only connections that survive for 30 more seconds get disrupted? Or are all active connections dropped and web clients have to reestablish/retry a connection? But that's not in the HTTP spec (except for dealing with no response data having been received at all) and exposes race conditions (what if the request was a POST and the connection gets closed; re-POSTing automatically could be dangerous).

Of course this is just one example, albeit the given one. But it seems to me that just about every single use case for a deployment would have to face the issue of abandoning work in progress. The only real difference that I can tell between the nginx upgrade example and most of the use cases described on this thread is time. Maybe 30 seconds grace is enough for most use cases envisioned by the designers of the deployments controller?

But just increasing the grace period time limit isn't enough since the controller logic wants to shut down a pod of its choosing, which may happen to be a busy one while other pods are idle and would be happy to shut down immediately. If the chosen pod is busy and many others aren't, the controller is going to get stuck waiting for the busy one and keep resources tied up unnecessarily.

Seems to me that a pod needs to be able to reject a termination request and have the deployment controller go searching for a different pod to shut down. Maybe the pod can issue a "cancel" response to a shutdown request, maybe it can tell the API that it expects be busy at the start of the operation, maybe there could be a heartbeat API to reset/extend the grace period. I don't know.

Or maybe, as deployments are intended to deal with short-lived tasks, that there needs to be a new controller type (or an extension of one of the others) that considers the needs of queue processing tasks where some of the operations could be long-running. Queue processing with long operations is quite a common pattern (even web serving is a just a form of queue processing), and there doesn't appear to be any k8s controller designed to handle it.

marcomancuso commented 5 years ago

Anyone having success scaling up/down using custom controllers? Would it be feasible?

bsklaroff commented 5 years ago

My workaround is to have idle pods delete themselves specifically with kubectl delete pod {hostname} --timeout=1s, and then immediately scale down the deployment by one replica. This will kill only the idle pod and none of the running ones.

Unfortunately, deployment upgrades still kill my pods with long-running jobs. To combat this, I set the terminationGracePeriod to 100 days, and I catch the SIGTERM signal on all pod processes (ex. using the signal library in python). The SIGTERM handler just sets a global flag that will have the pod kill itself instead of look for a new job off the queue once it's done with its current job. To set terminationGracePeriod on a deployed deployment: kubectl patch deployment {deployment_name} -p '{"spec":{"template":{"spec":{"terminationGracePeriodSeconds":8640000}}}}'

Finally, I'm running on google kubernetes engine with their cluster autoscaler enabled, so I have to prevent the autoscaler from killing the nodes my pods are running on. You can set a node un-scale-downable with kubectl annotate node {node_name} cluster-autoscaler.kubernetes.io/scale-down-disabled=true --overwrite=true. Then set scale-down-disabled=false when there are no pods left running on the node.

This is all a hairy kludge and I wish kubernetes had more built-in support for this use case. That said, my system is working for now. Best of luck to anyone struggling with similar problems.

lovejoy commented 5 years ago

I'd like add a pr to support this feature: you need to annotation the pod which you want to delete when scale down a deployment.

lovejoy commented 5 years ago

see pr #75763

atomaras commented 5 years ago

@lovejoy Annotation is not flexible enough. There are much simpler ways without needing the pod to call out to k8s. Lifecycle hooks are better like preStop canStop and postStop

atomaras commented 5 years ago

@lovejoy Also this needs to happen at runtime. We do not statically know which pod is working on a queue message that takes long and these change every second

duglin commented 5 years ago

Just brainstorming on ideas on how a user can ask for this... what if we allowed them to use a new field called spec.terminatePods which includes a list of pod names to delete and upon doing so it'll decrease the spec.replicas and removes that pod from the terminatePods list. This is similar to Finalizers. I think this should remove the race condition people are worried about when trying to do a pod.Terminate() followed by a spec.replicas-- very quickly.

Lots of edges to think about (e.g. what if that pod is already gone, do we still decrement?).

lee0c commented 5 years ago

I'm working with a similar scenario as @chavacava & @antoinne85 , with the added complication that jobs in the queue might be processed by a worker in several seconds or a couple of hours. This makes using jobs a little more complex, as one worker could handle ten messages or 1 & many messages could come in in the time it takes a worker to finish one of the longer running processes.

daudn commented 5 years ago

Would be really handy to be able to only scale down the pods that have no processes running on them.

I am implementing a system where tasks are added to a queue and picked up by workers (Pods in a specific Node Pool) My tasks run from 5 - 40 mins. Everything is smooth until I reach the last few tasks in my queue. Pseudo code for the events:

10 workers currently active.
8 tasks remaining,
Scaling down to 8
2 pods that get scaled down can be any of out the 10. Long-running tasks are usually killed and bounce back into the queue to get picked up again!

Kubernetes, HELP!

dror-weiss commented 5 years ago

@daudn , we're struggling with the same issue and as this issue is over 2 years old I'm having doubts it will be addressed soon.

Have you managed to figure out a solution?

daudn commented 5 years ago

@dror-weiss The only solution is to use CloudRun. There is a problem with that as well. CloudRun has a timeout on GKE, it's not supposed to. When the timeout of CloudRun on GKE is removed, I think that's the most optimal solution. My work is on hold until Google fixes the issue and removes the timeout when running CloudRun with GKE.

chavacava commented 5 years ago

We are experimenting a workaround with Argo, things look good so far.

daudn commented 5 years ago

@chavacava I had a look into workflows and wasn't too impressed since it's rather static. Basically piping the output of the first 'process' as input for the next process.

Do you think it's worth me giving it a shot? Can you give me a lil summary of how you are using Argo and for what purpose? If implemented in Argo, does Argo have full control of Nodes? Since my jobs are compute-intensive and long-running, I require my nodes to be scaled down immediately after the completion of running processes.

chavacava commented 5 years ago

@daudn Argo, as any framework, has its limitations but contrary to CloudRun it can be deployed in our in-premises clusters, and that is a very important point for us. As you mention, in Argo you can pipe data (results) from a process to another, but you can also create artefacts that are saved in a central storage (typically an object store), then you need some glue-code to retrieve the right artefacts for each process. We have made some POCs with single-process workflows (we have a process that takes ~24hs to complete) and with map-reduce workflows of shorter processes. We are still testing but the big picture seems more clean than the previous (fragile)hacks on K8S we have tested

PS: on GCP, Argo nodes scale down about ~10 minutes after idle

jeremywadsack commented 5 years ago

@daudn, while I'm looking at Argo, if you think it's overkill for your situation, it sounds like you might be able to accomplish what you need with jobs and autoscaling. We use that now with compute intensive jobs. By setting resource request (or affinities) we can ensure that a job pod is scheduled into the right node group. With autoscaling the nodes get added when the jobs are scheduled and removed when they complete.

On Mon, Jun 17, 2019 at 8:21 AM SalvadorC notifications@github.com wrote:

@daudn https://github.com/daudn Argo, as any framework, has its limitations but contrary to CloudRun it can be deployed in our in-premises clusters, and that is a very important point for us. As you mention, in Argo you can pipe data (results) from a process to another, but you can also create artefacts that are saved in a central storage (typically an object store), then you need some glue-code to retrieve the right artefacts for each process. We have made some POCs with single-process workflows (we have a process that takes ~24hs to complete) and with map-reduce workflows of shorter processes. We are still testing but the big picture seems more clean than the previous (fragile)hacks on K8S we have tested

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/45509?email_source=notifications&email_token=AAB25LCUU4LLFSTT6YBZVVLP26TYJA5CNFSM4DKRNV5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX3QG7A#issuecomment-502727548, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB25LEGKDRXXOY7HYQVHFTP26TYJANCNFSM4DKRNV5A .

-- Jeremy Wadsack

daudn commented 5 years ago

@chavacava So even with Argo the scaling issue remains. Problem is, we do about 300 runs a year. each run has 48 parallel processes on separate CPUs. That's 300 x 48 x 10 = 144,000 compute hours/year (simply waiting to scale down) The CPU's are 6Core 24 Gb. So 144,000 x $0.214 = $30,816 per year wasted on waiting for nodes to scale down.

@jeremywadsack Jobs work on top of Kubernetes, the entire problem is the underlying issue. GKE doesn't allow scale-down of nodes instantly.

xphh commented 5 years ago

Finally found this issue! This also bother me a lot. So what is the solution guys? I found a similar way that using Jobs to terminate pods which need to. However, it's so weired for handling workers as jobs. Hope a built-in way.

daudn commented 5 years ago

@xphh if you don't have long running tasks, CloudRun is the most efficient solution. However, CloudRun has a timeout of 15 minutes. So it is useless to my use-case. Hope it helps!

xphh commented 5 years ago

@daudn Thanks for your advice, but we do have long running tasks (mainly for video streams). So after discussed with my colleagues about several possible ways, we decided to utilize the readiness probe. As far as we know, kubernates would scale down the 'NotReady' pods with a higher priority, so we can send a signal to inform our program in pod (which we want it down) to disable the specific readiness probe. However, we do know that it's hacky and depends on kubernates controller's internal implementations, but it's realy a easy way to achieve our goal.

daudn commented 5 years ago

@xphh I tried loads of hacky solutions. Unfortunately, none of them were suitable solutions for a production-ready application. Have you implemented your solution or are you trying to implement it? In case you have implemented it, and it works fine do let me know. I can give readiness probe another attempt again!

kubernetes / kubernetes

Scale down a deployment by removing specific pods #45509