practo / k8s-worker-pod-autoscaler

Kubernetes autoscaler for the workers. Resource is called WPA. Queues Supported: SQS, Beanstalkd.
https://medium.com/practo-engineering/launching-worker-pod-autoscaler-3f6079728e8b
Apache License 2.0
159 stars 31 forks source link

Does WPA kill pods if the queue length decreases? #144

Open robin-ency opened 2 years ago

robin-ency commented 2 years ago

We have deployed WPA on AWS EKS and are using it with some success. We have typically tried running with a maximum of 10 pods (replicas) and all went well. Now we have increased this to 50 and we are noticing pods are randomly quitting with no discernible reason. Its like there is a service that is shutting down a pod midway.

Does WPA kill pods midway as the queue length decreases? We have long running tasks that need around 1-2 hours to process, so we need the pods to finish completely and then quit. WPA should only be responsible for scaling up, and for scaling down, it should not kill existing running pods. Is this what you mean by "disruption"? does "disruption" mean you will:

A. Kill running pods B. Not create additional pods and leave the running pods on

We are running a python app on EKS, this is the YAML for the app:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sqs-consumer
spec:
  selector:
    matchLabels:
      app: sqs-consumer
  replicas: 1
  template:
    metadata:
      labels:
        app: sqs-consumer
    spec:
      hostNetwork: true
      containers:
      - name: sqs-consumer
        image: XXXXXXXXXXXXXXXXXXXXXXX.dkr.ecr.XXXXX.amazonaws.com/
        env:
          - name: ENVIRONMENT
            value: "production"
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: CONTAINER_ID
            valueFrom:
              fieldRef:
                fieldPath: metadata.uid
        resources:
          limits:
            memory: "2Gi"
          requests:
            memory: "2Gi"

This is the current WPA config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: workerpodautoscaler
  namespace: woker-pod-autoscaler
  labels:
    app: workerpodautoscaler
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
spec:
  replicas: 1
  selector:
    matchLabels:
      app: workerpodautoscaler
  template:
    metadata:
      labels:
        app: workerpodautoscaler
    spec:
      serviceAccountName: workerpodautoscaler
      tolerations:
        - effect: NoExecute
          operator: Exists
        - effect: NoSchedule
          operator: Exists
      containers:
        - name: wpa
          env:
          image: practodev/workerpodautoscaler:{{ WPA_TAG }}
          imagePullPolicy: Always
          command:
            - /workerpodautoscaler
            - run
            - --resync-period=20
            - --wpa-threads=10
            - --aws-regions={{ WPA_AWS_REGIONS }}
            - --sqs-short-poll-interval=20
            - --sqs-long-poll-interval=20
            - --k8s-api-qps=5.0
            - --k8s-api-burst=10
            - --wpa-default-max-disruption=100%
            - --queue-services=sqs,beanstalkd
            - -v=2
          resources:
            limits:
              cpu: 100m
              memory: 100Mi
            requests:
              cpu: 10m
              memory: 20Mi

And the other WPA config:

apiVersion: k8s.practo.dev/v1
kind: WorkerPodAutoScaler
metadata:
  name: wpa
spec:
  minReplicas: 0
  maxReplicas: 0
  deploymentName: sqs-consumer
  queueURI: https://sqs.XXXXXXXXXXXXXXXXXXXXXXX.amazonaws.com/XXXXXXXXXXXXXXXXXXXXXXX/
  targetMessagesPerWorker: 2
  secondsToProcessOneJob: 7200
  maxDisruption: "0%"

Any help would be greatly appreciated to help us resolve this issue of pods randomly shutting down midway processing.

justjkk commented 2 years ago

Hi Robin,

WPA operator periodically checks queue metrics and other configuration parameters to calculate the number of pods that should be running and then adjusts the deployment's desired pods value. With maxDisruption set to "0%", it doesn't partially scale down. Scale up is still allowed and is not affected by the maxDisruption value. With maxDisruption set to "0%", only complete scale down(to the minReplicas value) is allowed when all the pods are idle(inferred from the cloudwatch metrics). If massive scale down is happening, just check if the job's visibility timeout in SQS is set such that the job is not visible to other workers for as long as it is getting processed. Please refer to GetDesiredWorkers and convertDesiredReplicasWithRules functions which contain the relevant logic.

The pod disruption issue that you are facing could be unrelated to WPA and could be due to the Kubernetes cluster autoscaler or some other factor(like spot instance replacement, node pressure eviction, etc) that causes the pod to get rescheduled and thereby interrupting the execution. You can confirm this by checking the prometheus metrics or the cluster autoscaler logs.

I noticed that both minReplicas and maxReplicas values are set to 0 which effectively disables the deployment by setting 0 desired pods. Is this a typo or are you seeing pods coming up even with maxReplicas set to 0?

Additionally, you can increase the verbosity of the WPA controller logs by changing -v=2 to -v=4 which will print more information that can help understand the thought process of the WPA controller. You can also remove ,beanstalkd from the queue_services parameter if you are using only SQS and not beanstalk

robin-ency commented 2 years ago

Thanks for the reply @justjkk, yes this is the setting template file so it is 0 and maxReplicas is modified to typically 10 - 100 range depending on the environment's requirement.

robin-ency commented 2 years ago

Hi @justjkk I hope this image makes my question clear. When WPA decides to scale down completely, as you said, does this mean all the running pods will be killed?

And if the answer is yes, then how can I prevent this? I want all pods to run till they exit (end of program).

image

alok87 commented 2 years ago

Does the consumer delete the job as soon as it receives it? Or it removes the job from the queue only after the message was processed?

robin-ency commented 2 years ago

Does the consumer delete the job as soon as it receives it? Or it removes the job from the queue only after the message was processed?

At present the consumer only removes the job from the queue after it is processed. Which is a better way?

alok87 commented 2 years ago

Yes that is the right way. Delete the job from the queue only when the processing finishes.

With MaxDisruption=0%, pods should scale down when all the jobs in the queue are processed and queueSize=0 and nothing is being processed at that moment by any worker.

Also as justkkk said: 1) did you check the visibility timeout? 2) check the node in which the scaled down pod was running, did it shutdown at the same time? If yes, then it is not in control of WPA autoscaler. You may want to use ondemand nodes for such workers instead of spot nodes.

Can you share the log of WPA queue by setting -v=4 verbosity?

robin-ency commented 2 years ago

I will check these things and get back, thank you.