Closed rgaiacs closed 2 months ago
cc @arnim
The number of pending pods at GESIS server reduced. In the last 24 hours, we had
My impression is that the pending pods are waiting for a image to be available because the image needs to be build on our server, pushed to Docker Hub and downloaded again to our server.
The peak of pending pods has a time correlation with the peak of build pods.
Do you think we need to add an additional limit to BinderHub for the number of pending spawns, to prevent too many pods being created or queued up?
Do you think we need to add an additional limit to BinderHub for the number of pending spawns, to prevent too many pods being created or queued up?
My hypothesis is that BinderHub receives a launch request and allocates a new Kubernetes pod for the launch. Because the image required for the Kubernetes pod does not exists yet, the pod goes into pending mode. BinderHub adds a building request to the queue. During some periods, the number of new image build is larger than GESIS server capacity and the pending pods start to accumulate.
I'm still puzzled of why we have big peaks of pending pods.
I understand a few (less than 10) pods pending because of network, for example, the image is larger than usual. I don't understand almost 40 pods pending at the same time.
I checked
and I did not find any correlation.
@manics @sgibson91 do you have any clue of where should I look for a correlation? Thanks!
Is a particular spike of pods related to the same repository?
Assuming your prometheus labels are the same as on Curvenote try this prometheus query for a time window around the spike: sum(label_replace(kube_pod_status_phase{phase="Pending",pod=~"jupyter-.*"}, "repo", "$1", "pod", "jupyter-(.+)-[^-]+")) by (repo)
E.g. https://prometheus.binder.curvenote.dev/graph?g0.expr=sum(label_replace(kube_pod_status_phase%7Bphase%3D%22Pending%22%2Cpod%3D~%22jupyter-.*%22%7D%2C%20%22repo%22%2C%20%22%241%22%2C%20%22pod%22%2C%20%22jupyter-(.%2B)-%5B%5E-%5D%2B%22))%20by%20(repo)&g0.tab=0&g0.display_mode=lines&g0.show_exemplars=0&g0.range_input=30m&g0.end_input=2024-08-01%2010%3A30%3A00&g0.moment_input=2024-08-01%2010%3A30%3A00
Thanks @manics.
Is a particular spike of pods related to the same repository?
Yes, there is.
A big number of the pending pods were the same repository. My assumption is that someone is giving a lecture / course. 10+ learners access mybinder.org at the same time. Because the server does not have the Docker image cached, all 10+ users are on the pending status until the Docker image is downloaded.
I'm closing this as "won't fix".
cc @arnim
@arnim This is my current understanding of the problem:
The short video highlights one pod in this scenario described above: https://github.com/user-attachments/assets/131cda9f-c9c6-4195-b219-0d7a48e217ff
You're right... easily reproducible:
$ kubectl run xxx --image=quay.io/jupyter/datascience-notebook
pod/xxx created
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
xxx 0/1 ContainerCreating 0 7s
$ kubectl delete pod xxx
# Takes ages.... 'kubectl get events --watch' shows the image is pulled before it's deleted
pod "xxx" deleted
If the image doesn't exist then you should get Error: ImagePullBackOff
and it should be possible to delete the pod straight away.
I found this issue https://github.com/kubernetes/kubernetes/issues/121435
I'm closing this in favour of https://github.com/jupyterhub/mybinder.org-deploy/issues/3056.
This started around 8:00 am CEST of 5 June 2024.
OVH
GESIS
kubectl get -n gesis pods --sort-by='{.metadata.creationTimestamp}' | grep Terminating
producesThis is because GESIS server is not able to download the Docker images fast enough.
CurveNote
Not effected yet.