kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
540 stars 182 forks source link

Ability to Scale Karpenter Provisioned Nodes To 0 On Demand Or By Schedule During Off Hours #1177

Open ronberna opened 5 months ago

ronberna commented 5 months ago

Description

What problem are you trying to solve? We've recently begun the migration from using ASG's (AutoScaling Groups) and CAS (Cluster AutoScaler) to Karpenter. With ASG's, as part of cost saving measures, our EKS clusters are scaled down during off hours and weekends in lower environments, and then scaled back up during office hours. This was performed by running a lambda at a scheduled time to set the min/max/desired settings of the ASG to 0. The current values of the min/max/desired settings before the update to 0 are captured and stored in ssm. For the scale up, the lambda reads this ssm parameter to set the ASG min/max/desired values. With Karpenter, this is not possible.

As a workaround, we have a lambda that will patch the cpu limit of the nodepool and set it to 0 so that no new Karpenter nodes will be provisioned. The lambda will then take care of deleting the previously provisioned Karpenter nodes. We have a mix of workloads running in the cluster with some using HPA and some not, so trying to scale down all of the deployments to remove the Karpenter provisioned nodes will not work. It has also been suggested to delete the nodepool and reapply it via a cronjob. This option will also not work since some of our clusters are in a controlled environment.

The ask here is to introduce a feature in Karpenter that will handle scaling down/up all Karpenter provisioned nodes on-demand via a flag or possibly with the update of the cpu limit, Karpenter will not provision any new nodes and will also clean up previously provisioned nodes without having to introduce additional cronjobs, lambdas, or deleting nodepools.

How important is this feature to you? This feature is important as it will help with AWS cost savings by not having EC2 instances running during off hours and not having to add additional components (lambdas, cronjobs, etc...) to aid with scaling Karpenter provisioned instances.

jonathan-innis commented 4 months ago

Karpenter will not provision any new nodes and will also clean up previously provisioned nodes without having to introduce additional cronjobs, lambdas, or deleting nodepools

We've had some conversation about this among the maintainers. IMO, this features basically comes down to -- should we consolidate based on limits? If you apply a more restrictive limit to your NodePool, does that mean that you are implying that the NodePool should deprovision nodes until it gets back to complying with its limits.

IMO: This strikes me as an intuitive desired state mechanism -- you have set a new desired state on your NodePool -- implying that you no longer support a given capacity. Now comes the more difficult question: Should Karpenter force application pods off of your nodes unsafely if you have enforced stricter limits on your NodePool and those pods have nowhere else to schedule? This breaks current assumptions that we have around the safety of disruption -- that is, if we disrupt a node (unless it is due to spot interruption), we assume that we are doing so assuming that we can reschedule the existing pods on the node onto some other capacity (either existing or new). This feature would have us force delete pods regardless of whether they can schedule or not -- which starts to look a bit scary.

This option will also not work since some of our clusters are in a controlled environment

I know you mentioned that you can't delete the NodePool to spin down nodes but I'm curious what you mean by "controlled environment". Wouldn't updating the limits also cause similar changes to your cluster that I assume would also be subject to this "controlled environment?"

ronberna commented 4 months ago

If you apply a more restrictive limit to your NodePool, does that mean that you are implying that the NodePool should deprovision nodes until it gets back to complying with its limits.

Yes, I believe this is what is being implied. If the cpu limit is set to 0, that would mean that we want to deprovision existing nodes, similar to setting the min/max/desired values to 0 for an ASG. Even if something similar to an ASG Scheduled Action was introduced to where I can create a configuration inside the NodePool to deprovision existing nodes and not spin up any additional nodes.

A flaw that we've uncovered with our current approach of using a lambda to patch the cpu limit to 0 and then delete existing Karpenter provisioned nodes is that if a node was provisioned right before the cpu limit was set and is now in the "NotReady" state, this node will not get cleaned up as it is not yet recognized as an active node and will remain running. We're having to come up with a solution to rerun the lambda multiple times to make sure nodes get cleaned up if this happens. We will not only have to delete the finalizer from the node before deleting from the cluster, but we will also have to terminate the node in AWS as a kubectl delete node will delete it from the cluster, but will not delete it from AWS. As long as the node is still in AWS, Karpenter will not provision a new node.

Should Karpenter force application pods off of your nodes unsafely if you have enforced stricter limits on your NodePool and those pods have nowhere else to schedule?

Yes. This is the behavior that currently happens for ASG's. Our pods will stay in a Pending state until the next workday when the ASG min/max/desired settings are updated to their previous work hour values. With no nodes running during non-work hours our savings are pretty significant.

I know you mentioned that you can't delete the NodePool to spin down nodes but I'm curious what you mean by "controlled environment".

By controlled environment we mean that certain changes to the environment will require going through change control (testing the change, creating change request, verifying test results, getting approvals to implement said request, implementing the change, verifying the change). Doing this daily is not feasible IMO. Yes, technically patching the limit is subject to the "controlled environment", but it's easier based on our current process to patch the cpu limit with a scheduled lambda function as opposed to deleting an entire k8s resource and having to go through the steps mentioned above in order to kick off a pipeline to get the resource re-applied. That's why the ask here is to have this feature built into Karpenter. If designed properly, IMO, this would be a huge win.

cp1408 commented 3 months ago

You can use below yaml to delete and create Karpenter nodes, Logic is to delete the nodepool on friday and re-create on sunday. i have tested this in non-prod and it is running without any issues from a while.

---
apiVersion: v1
kind: Namespace
metadata:
  labels:
    kubernetes.io/metadata.name: karpenter-cron
  name: karpenter-cron
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: karpenter-cron
  name: karpenter-cron
  namespace: karpenter-cron
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: karpenter-cron
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch", "describe"]
    #
  - apiGroups: ["karpenter.sh"]                 
    resources: ["nodepools"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete", "describe"]
    #
  - apiGroups: ["batch"]                 
    resources: ["jobs", "cronjobs"]
    verbs: ["get", "list", "watch", "create", "describe"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: karpenter-cron
subjects:
- kind: ServiceAccount
  name: karpenter-cron
  namespace: karpenter-cron
roleRef:
  kind: ClusterRole
  name: karpenter-cron
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: karpenter-cron-cm
  namespace: karpenter-cron
data:
  karpenter-nodepool.yaml: |
    apiVersion: karpenter.sh/v1beta1
    kind: NodePool
    metadata:
      annotations:
      name: default
    spec:
      disruption:
        budgets:
        - nodes: 10%
        consolidationPolicy: WhenUnderutilized
        expireAfter: 720h
      limits:
        cpu: 1000
      template:
        spec:
          nodeClassRef:
            name: default
          requirements:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
          - key: karpenter.k8s.aws/instance-category
            operator: In
            values:
            - t
            - r
            - m
            - c
          - key: karpenter.k8s.aws/instance-generation
            operator: Gt
            values:
            - "2"
          - key: karpenter.sh/capacity-type
            operator: In
            values:
            - on-demand
          - key: karpenter.k8s.aws/instance-cpu
            operator: In
            values:
            - "4"
            - "8"
            - "16"
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: karpenter-nodepool-delete-cron
  namespace: karpenter-cron
spec:
  schedule: "55 17 * * FRI"
  startingDeadlineSeconds: 20
  successfulJobsHistoryLimit: 1
  suspend: false
  jobTemplate:
    spec:
      completions: 1
      ttlSecondsAfterFinished: 10
      parallelism: 1
      completions: 1
      template:
        spec:
          containers:
          - name: karpenter-scale
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              echo "List all the karpneter nodes"
              kubectl get nodes -l karpenter.sh/nodepool
              echo "List nodepool"
              kubectl get nodepool
              echo "Deleting NodePool"
              kubectl delete nodepool default
              sleep 5s
              echo "List all the karpneter nodes"
              kubectl get nodepool -A
              kubectl get nodes -l karpenter.sh/nodepool
              echo "script executed"
              echo "completed"
          restartPolicy: OnFailure
          serviceAccountName: karpenter-cron
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: karpenter-nodepool-create-cron
  namespace: karpenter-cron
spec:
  schedule: "55 17 * * SUN"
  startingDeadlineSeconds: 20
  successfulJobsHistoryLimit: 1
  suspend: false
  jobTemplate:
    spec:
      completions: 1
      ttlSecondsAfterFinished: 10
      parallelism: 1
      completions: 1
      template:
        spec:
          containers:
          - name: karpenter-scale
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              echo "creating nodepool"
              kubectl apply -f /home/karpenter-nodepool.yaml
              echo "nodepool created"
              kubectl get nodepool -o yaml
              sleep 5s
            volumeMounts:
            - name: karpenter-nodepool
              mountPath: /home
          restartPolicy: OnFailure
          serviceAccountName: karpenter-cron
          volumes:
            - name: karpenter-nodepool
              configMap:
                name: karpenter-cron-cm
ronberna commented 3 months ago

Unfortunately deleting and reapplying nodepool resources is not an option for us. What would be ideal, IMO, would to have something like the disruption budget schedule that we can set that would basically scale down all instances provisioned by a given nodepool

felipewnp commented 3 months ago

I've stumbled upon this issue after doing the same thing @cp1408 suggested.

My cronjob does:

I think the ideal scenario is something like this:

  1. Set karpenter nodepool limits.cpu to 0
  2. There would be a flag like driftConsolidation: soft / hard (to address what @jonathan-innis said:)

Should Karpenter force application pods off of your nodes unsafely if you have enforced stricter limits on your NodePool and those pods have nowhere else to schedule?

  1. Karpenter starts a soft / hard drift consolidation
qoehliang commented 1 month ago

Has anyone stumbled across a decent solution for partially shutting down Karpenter provisioned nodes when the definitions of the Karpenter and it's NodePools are defined in a GitOps tool like Argo CD which has self-healing and automated sync enabled. If I delete a NodePool, Argo CD will re-sync/re-create the NodePool object as it is defined in a GitHub repository.

One scenario we have been able to consider is terminating Argo CD prior to deleting the NodePool or even patching the NodePools limits.cpu to 0. This however doesn't come without flaws, as some of our consumers require granular/partial shutdown of their NodePools. For example, I want to shutdown NodePool A which is used by team A, but keep NodePool B up which is used by team B who need their nodes up. By terminating Argo CD, we effectively halt all ability for team B to perform GitOps related changes onto the cluster. But if we keep Argo CD up, then it will reconcile NodePool A which is not desired.

Another option would be to automate commits to our upstream GitHub repositories to comment out the NodePool specification, but was hoping to avoid this, as it will flood our GitHub repository commits with daily shutdown and startup commits.

Finally, we considered scaling down Deployments/StatefulSets/Jobs to allow for Karpenter to automatically shutdown NodePools, but again, given majority of the workloads are deployed via Argo CD which will reconcile the replica state (as most of our consumers define a hard-coded replica count in their GitHub repository). We would be left with the same problem above where we would either have to terminate Argo CD so it doesn't re-sync the workloads, or force all of our users to no longer define replica count in their workloads and rely on things like HPAs.

The most intuitive option seems to be directly committing changes to the GitHub repository that Argo CD watches but was wondering if anyone has faced similar issues and have any suggestions for alternative approaches to enable granular shutdown of Karpenter provisioned nodes.

Roberdvs commented 1 month ago

Has anyone stumbled across a decent solution for partially shutting down Karpenter provisioned nodes when the definitions of the Karpenter and it's NodePools are defined in a GitOps tool like Argo CD which has self-healing and automated sync enabled. If I delete a NodePool, Argo CD will re-sync/re-create the NodePool object as it is defined in a GitHub repository.

One scenario we have been able to consider is terminating Argo CD prior to deleting the NodePool or even patching the NodePools limits.cpu to 0. This however doesn't come without flaws, as some of our consumers require granular/partial shutdown of their NodePools. For example, I want to shutdown NodePool A which is used by team A, but keep NodePool B up which is used by team B who need their nodes up. By terminating Argo CD, we effectively halt all ability for team B to perform GitOps related changes onto the cluster. But if we keep Argo CD up, then it will reconcile NodePool A which is not desired.

Another option would be to automate commits to our upstream GitHub repositories to comment out the NodePool specification, but was hoping to avoid this, as it will flood our GitHub repository commits with daily shutdown and startup commits.

Finally, we considered scaling down Deployments/StatefulSets/Jobs to allow for Karpenter to automatically shutdown NodePools, but again, given majority of the workloads are deployed via Argo CD which will reconcile the replica state (as most of our consumers define a hard-coded replica count in their GitHub repository). We would be left with the same problem above where we would either have to terminate Argo CD so it doesn't re-sync the workloads, or force all of our users to no longer define replica count in their workloads and rely on things like HPAs.

The most intuitive option seems to be directly committing changes to the GitHub repository that Argo CD watches but was wondering if anyone has faced similar issues and have any suggestions for alternative approaches to enable granular shutdown of Karpenter provisioned nodes.

Check ArgoCD's sync windows.

We're currently using them to avoid the GitOps reconciliation when scaling down the Deployments off-hours like you mention, but you could also use them to prevent NodePool recreation if you handle those via GitOps.

Pilotindream commented 3 weeks ago

Hello @ronberna, May your share some example of the Lambda that you mention here: "As a workaround, we have a lambda that will patch the cpu limit of the nodepool and set it to 0 so that no new Karpenter nodes will be provisioned. The lambda will then take care of deleting the previously provisioned Karpenter nodes."

Am I right that it first set cpu on provisioner to 0 then delete nodes and then delete ec2 instances on AWS?

felipewnp commented 3 weeks ago

@Pilotindream I can't say for @ronberna , but I do this as well and yes.

This is the right order.

I can bring you the shell script tomorrow.

In my case, I run it inside my kubernetes cluster as a cronjob, since I have a pair of nodes not managed by karpenter.

Pilotindream commented 3 weeks ago

@felipewnp, thanks for your reply. It would be nice if you can you me example of script. Will wait for your reply. Also, how you dead with finalizer on nodes that you are deleting, since simple kubectl delete node is not working until I manually delete finalizer. Thanks a lot again!

barryib commented 3 weeks ago

@olsib wrote a great blog post on how to scale down to zero (for now) on staging environments https://aircall.io/blog/tech-team-stories/scale-karpenter-zero-optimize-costs/.

felipewnp commented 3 weeks ago

@Pilotindream the link provided by @barryib is right, you can go from there!

wa20221001 commented 1 week ago

@olsib wrote a great blog post on how to scale down to zero (for now) on staging environments https://aircall.io/blog/tech-team-stories/scale-karpenter-zero-optimize-costs/.

Thanks so much for this!, great read! I wonder how would you deal with making sure the CPU limit is in sync with git (especially if it is updated say every few days) ? We did a quick test and saw the cpu limit is never synced to git again (as expected). Is using namespace resource quotas enough in your use case?

felipewnp commented 1 week ago

@olsib wrote a great blog post on how to scale down to zero (for now) on staging environments https://aircall.io/blog/tech-team-stories/scale-karpenter-zero-optimize-costs/.

Thanks so much for this!, great read! I wonder how would you deal with making sure the CPU limit is in sync with git (especially if it is updated say every few days) ? We did a quick test and saw the cpu limit is never synced to git again (as expected). Is using namespace resource quotas enough in your use case?

If you use gitops, in the script where you change the karpenter nodepool cpu limit, you could commit the changes to your git repo as well.

olsib commented 1 week ago

@wa20221001 if you use ArgoCD you can use ignoreDifferencies as described in the blog post. Here is a snippet for you to review.