replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
7.45k stars 512 forks source link

Container suddenly stopping without explicit reason #1652

Open tontan2545 opened 1 month ago

tontan2545 commented 1 month ago

Hi, I've been running a particular model in Kubernetes using Cog. Whenever we have high workloads (4-5 prediction in queue) the Cog model seems to be stopping without notifying the reason. We initially thought this was a memory issue, however upon further investigation we found that we still have plenty of memory left for it to be an issue. It would be great if you could provide any hypothesis on this issue, looking forward to be following them.

Here's an example of the log, keep it mind that we have multiple replicas running and we are displaying logs on every pods.

Note: There's no presence of cog.server.runner exception logs at all, just plain shutdown by cog http

Screenshot 2567-05-09 at 01 09 13
mattt commented 2 weeks ago

Hi @tontan2545. Sorry to hear that you're having this problem in production. Other than an OOM, the only other cause that comes to mind is the server handling an explicit POST /shutdown request or SIGTERM. Could this be the process being killed by an autoscaler?

tontan2545 commented 2 weeks ago

Hi @mattt, thanks for the reply. That's quite a sound guess, but I structured my NGINX so that only request of path /predictions is allowed to the pod in k8s. For SIGTERM part, it would be great if you could guide me on how to know whether the k8s pod does that since there's no log of that in the kube-system and the pod itself.

mattt commented 2 weeks ago

@tontan2545 This FAQ has some good background information about how autoscalers work in k8s generally.

You can monitor autoscaler events with a command like:

kubectl logs -f deployment/cluster-autoscaler -n kube-system --tail=10