livekit / livekit

End-to-end stack for WebRTC. SFU media server and SDKs.
https://docs.livekit.io
Apache License 2.0
10.41k stars 870 forks source link

LiveKit pod stuck in Terminating state during scale-down event. #1032

Open prasadkris opened 2 years ago

prasadkris commented 2 years ago

Greetings,

We have got a livekit setup (v1.2.0) running on GKE deployed with the helm charts using values identical to the one furnished in the server-gke.yaml, with the following hpa rule and the setup works great in general 👍

k get hpa
NAME                     REFERENCE                           TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
livekit-livekit-server   Deployment/livekit-livekit-server   68%/70%   1         3        1          28d

There were a couple of issues recently and upon investigation we noticed that the pod's which got removed by the pod auto-scalar scale-down operation (triggered when cpu <70%) get stuck in a Terminating state for a long duration eventhough it removes the terminating pods from the service endpoint, I believe this is because livekit waits until all the participants to leave the room before the scale-down operation as indicated in the docs

k get pods
NAME                                      READY   STATUS        RESTARTS   AGE
livekit-livekit-server-66bc76cdd9-ft97l   1/1     Terminating   0          7h57m
livekit-livekit-server-66bc76cdd9-tnpjd   1/1     Running       0          25h

k describe svc livekit-livekit-server | grep -i endpoints 
Endpoints:                10.158.15.248:7880 

This works fine most of the times, but there is a possibility that the Running pod will get exhausted if there are many participants trying to join new rooms when other nodes are in a Terminating state as seen in the below example, the auto scalar can't provision another pod in this case as it has reached the maximum limit (3)

livekit-livekit-server-66bc76cdd9-ft97l   1/1     Terminating   0          8h19m
livekit-livekit-server-66bc76cdd9-tnpjd   1/1     Running       0          25h
livekit-livekit-server-84cb9f7994-tc8mh   1/1     Terminating   0     93m

k get hpa
NAME                     REFERENCE                           TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
livekit-livekit-server   Deployment/livekit-livekit-server   94%/70%   1         3        3          28d

I could try deleting the Terminating pods by using --force --grace-period 0 arguments if I am fine with losing the old rooms in the "terminating "pods, but this will cause the pod continue to run on the node indefinitely and we need to manually clear the livekit process from the gke node for making room for a fresh pod on that node.

2022-09-23T00:44:57.245Z    INFO    livekit service/wire_gen.go:165 using multi-node routing via redis  {"sentinel": false, "addr": "redis-headless.redis.svc.cluster.local:6379"}
listen tcp :7881: bind: address already in use

I am aware that this is how the scale down operation works with livekit and there is nothing much we can do to prevent this, but I am wondering if there is any recommendations/suggestions from you guys on this! thanks! 🙏

davidzhao commented 2 years ago

Hi there, what you are seeing is the expected behavior when a node drains

To reduce the draining timeout, you could override terminationGracePeriodSeconds

I didn't know that Kubernetes counted terminating pods as part of your HPA. that seems strange.