Open consideRatio opened 1 year ago
It seems that a lot of pod is stucking pulling the image without erroring or succeeding. Even pods in a terminating state aren't terminating because they are stuck pulling still.
kubectl describe pod jupyter-binderhub-2dci-2dre-2dimal-2ddockerfile-2di190dym9
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m38s ovh2-user-scheduler Successfully assigned ovh2/jupyter-binderhub-2dci-2dre-2dimal-2ddockerfile-2di190dym9 to user-202211a-node-6f699a
Normal Pulled 2m38s kubelet Container image "jupyterhub/mybinder.org-tc-init:2020.12.4-0.dev.git.4289.h140cef52" already present on machine
Normal Created 2m38s kubelet Created container tc-init
Normal Started 2m37s kubelet Started container tc-init
Normal Pulling 2m37s kubelet Pulling image "2lmrrh8f.gra7.container-registry.ovh.net/mybinder-builds/r2d-g5b5b759binderhub-2dci-2drepos-2dcached-2dminimal-2ddockerfile-c90b2b:596b52f10efb0c9befc0c4ae850cc5175297d71c"
OVH harbor registry appears to be having stability issues again, which I think is the ultimate cause. I've contacted OVH support about it.
I think we should consier moving OVH to using an external registry on e.g. quay.io. Downside: images are public and we need to be more proactive about cleaning and better support requesting deletion because e.g. statements about "if you unpublish the ref, your files are inaccessible" are not true at all if the build cache is public.