Open robin-ency opened 2 years ago
Hi Robin,
WPA operator periodically checks queue metrics and other configuration parameters to calculate the number of pods that should be running and then adjusts the deployment's desired pods value. With maxDisruption set to "0%", it doesn't partially scale down. Scale up is still allowed and is not affected by the maxDisruption value. With maxDisruption set to "0%", only complete scale down(to the minReplicas
value) is allowed when all the pods are idle(inferred from the cloudwatch metrics). If massive scale down is happening, just check if the job's visibility timeout in SQS is set such that the job is not visible to other workers for as long as it is getting processed. Please refer to GetDesiredWorkers and convertDesiredReplicasWithRules functions which contain the relevant logic.
The pod disruption issue that you are facing could be unrelated to WPA and could be due to the Kubernetes cluster autoscaler or some other factor(like spot instance replacement, node pressure eviction, etc) that causes the pod to get rescheduled and thereby interrupting the execution. You can confirm this by checking the prometheus metrics or the cluster autoscaler logs.
I noticed that both minReplicas
and maxReplicas
values are set to 0 which effectively disables the deployment by setting 0 desired pods. Is this a typo or are you seeing pods coming up even with maxReplicas
set to 0?
Additionally, you can increase the verbosity of the WPA controller logs by changing -v=2
to -v=4
which will print more information that can help understand the thought process of the WPA controller. You can also remove ,beanstalkd
from the queue_services parameter if you are using only SQS and not beanstalk
Thanks for the reply @justjkk, yes this is the setting template file so it is 0
and maxReplicas
is modified to typically 10 - 100 range depending on the environment's requirement.
Hi @justjkk I hope this image makes my question clear. When WPA decides to scale down completely, as you said, does this mean all the running pods will be killed?
And if the answer is yes, then how can I prevent this? I want all pods to run till they exit (end of program).
Does the consumer delete the job as soon as it receives it? Or it removes the job from the queue only after the message was processed?
Does the consumer delete the job as soon as it receives it? Or it removes the job from the queue only after the message was processed?
At present the consumer only removes the job from the queue after it is processed. Which is a better way?
Yes that is the right way. Delete the job from the queue only when the processing finishes.
With MaxDisruption=0%, pods should scale down when all the jobs in the queue are processed and queueSize=0 and nothing is being processed at that moment by any worker.
Also as justkkk said: 1) did you check the visibility timeout? 2) check the node in which the scaled down pod was running, did it shutdown at the same time? If yes, then it is not in control of WPA autoscaler. You may want to use ondemand nodes for such workers instead of spot nodes.
Can you share the log of WPA queue by setting -v=4 verbosity?
I will check these things and get back, thank you.
We have deployed WPA on AWS EKS and are using it with some success. We have typically tried running with a maximum of 10 pods (replicas) and all went well. Now we have increased this to 50 and we are noticing pods are randomly quitting with no discernible reason. Its like there is a service that is shutting down a pod midway.
Does WPA kill pods midway as the queue length decreases? We have long running tasks that need around 1-2 hours to process, so we need the pods to finish completely and then quit. WPA should only be responsible for scaling up, and for scaling down, it should not kill existing running pods. Is this what you mean by "disruption"? does "disruption" mean you will:
A. Kill running pods B. Not create additional pods and leave the running pods on
We are running a python app on EKS, this is the YAML for the app:
This is the current WPA config:
And the other WPA config:
Any help would be greatly appreciated to help us resolve this issue of pods randomly shutting down midway processing.