Open betatim opened 5 years ago
should we add the OVH folks to a jupyterhub team so that we can ping them for questions like these? Or add to the mybinder.org-operators team?
mybinder.org ops team sounds like a good idea!
I had assumed some of them were watching the repo already.
cc @mael-le-gal and @jagwar
Indeed the differences between the 2 dashboards are surprising ... I don't have any explanation for now
Watching pods and their logs it seems that we get a "pulling image" event for things like the tc_init
container. However the pull is "super fast" and it seems very unlikely that the node didn't have this image already. Makes me wonder if there is a small difference in k8s versions or some thing like that where it always emits a "pulling" event even if the image is already present.
This afternoon the pull of that container minrk/tc-init:0.0.4
took too much time and caused the jupyterhub pod to fail several time after too much launching retry.
Just a small investigation :
# Executing inside the container
kubectl exec --namespace='ovh' ovh-dind-6vtwq -c dind -it sh
# Manually pull the image
docker -H unix:///var/run/dind/docker.sock pull minrk/tc-init:0.0.4
After doing that the image is present on the host :
# List images
docker -H unix:///var/run/dind/docker.sock images
REPOSITORY
minrk/tc-init
After waiting some time and launching again the same command the image seems to have disappear.
I saw that there are some pods named ovh-image-cleaner-***
. Could they be responsible for that deletion ?
The dockerd inside the ovh-dind-...
pods are only used to build new docker images. So the majority of pulls of minrk/tc-init
shoud happen on the nodes actual dockerd (the one k8s uses).
However I think you found the problem. On GKE we run our nodes with two disks and store all the images for the DIND dockerd on a second disk. Because k8s itself does some clean up (images created by k8s doing its thing) and we do some cleanup (of images we create in our DIND). When we shared a disk between DIND and k8s the two garbage collectors would get in each others way/always run because they couldn't find anything to delete that would drop the limit low enough.
If you kubectl describe pod
a dind pod on GKE you'll see something like:
Volumes:
dockerlib-dind:
Type: HostPath (bare host directory volume)
Path: /mnt/disks/ssd0/dind
HostPathType: DirectoryOrCreate
which is the extra disk we mount. If you describe one of the image cleaner pods it mounts the same disk:
dockerlib-dind:
Type: HostPath (bare host directory volume)
Path: /mnt/disks/ssd0/dind
HostPathType: DirectoryOrCreate
both pods mount this to /var/lib/docker
. If I describe a ovh-dind-...
pod it uses /var/lib/dind
which I think means that directory is on the same partition as the docker image storage of k8s. And I think both image cleaner and k8s image GC do something like df -h
to find out how full the disk is and then try and clean up to make space (which fails because they can't delete images controlled by the other dockerd).
If it is possible I think the simplest thing to do is to swap the nodes for ones which have a second disk that we can use for DIND.
@minrk spent a lot of time poking around image cleaning and races between the two dockerds so he might have some ideas as well.
Is there a reason for using a second disk or could it be on the same disk but in another directory to separate both ?
I don't know. Two possible reasons come to my mind: performance (the SSD is faster than the boot disk) and simplicity for computing how full the disk is.
We should investigate why on OVH we have so many more docker pulls (according to the grafana dashboard):
https://grafana.mybinder.org/d/nDQPwi7mk/node-activity?refresh=1m&panelId=29&fullscreen&orgId=1&var-cluster=OVH%20Prometheus vs https://grafana.mybinder.org/d/nDQPwi7mk/node-activity?refresh=1m&panelId=29&fullscreen&orgId=1&var-cluster=prometheus