Open prasadkris opened 2 years ago
Hi there, what you are seeing is the expected behavior when a node drains
To reduce the draining timeout, you could override terminationGracePeriodSeconds
I didn't know that Kubernetes counted terminating pods as part of your HPA. that seems strange.
Greetings,
We have got a livekit setup (v1.2.0) running on GKE deployed with the helm charts using values identical to the one furnished in the server-gke.yaml, with the following hpa rule and the setup works great in general 👍
There were a couple of issues recently and upon investigation we noticed that the pod's which got removed by the pod auto-scalar scale-down operation (
triggered when cpu <70%
) get stuck in aTerminating
state for a long duration eventhough it removes the terminating pods from the service endpoint, I believe this is because livekit waits until all the participants to leave the room before the scale-down operation as indicated in the docsThis works fine most of the times, but there is a possibility that the
Running
pod will get exhausted if there are many participants trying to join new rooms when other nodes are in aTerminating
state as seen in the below example, the auto scalar can't provision another pod in this case as it has reached the maximum limit (3)I could try deleting the
Terminating
pods by using--force --grace-period 0
arguments if I am fine with losing the old rooms in the "terminating "pods, but this will cause the pod continue to run on the node indefinitely and we need to manually clear the livekit process from the gke node for making room for a fresh pod on that node.I am aware that this is how the scale down operation works with livekit and there is nothing much we can do to prevent this, but I am wondering if there is any recommendations/suggestions from you guys on this! thanks! 🙏