image cleaning history - Githubissues

Writing down some history of image cleaning. This should probably go somewhere in our collected docs, but I'm not sure where, so I'm writing it down here.

Background

Binder is constantly building and launching new images. This causes the disk to fill up. With current traffic, a 1TB disk fills up in ~3-4 days. Kubernetes has its own ImageGC for cleaning up unused images when the disk gets pretty full. The default threshold for this is 90%. This gets complicated by the fact that there are two dockers running on each node, the host docker which runs user pods, etc. and docker-in-docker (dind) that we use to build user images. This means that there are two totally disconnected collections of docker images filling up each node's disk. If we never cleanup our dind images, kubernetes' internal ImageGC will cleanup images on the host docker as best it can, but the dind images will keep growing and eventually we will hit disk pressure events, causing mass evictions, etc.

Our Image cleaner

To address this, we have an image-cleaner service that aims to mimic kubernetes' own, cleaning up images on the dind and/or host docker when the disk starts to get full. However, we've been having problems with image cleaning either not doing the right thing, causing downtime, etc.

Problems with our image cleaner

The first problem is that during image cleaning on the host, spawns can fail. So we must cordon the node during cleaning.
The second problem is the lack of coordination between cleaning images for two dockers on one disk. It's possible for only one image collection or the other to end up cleaning, which results in undue pressure deleting dind images, while host images never get cleaned because the dind cleaning always happens first.

Past solutions

We've tried various approaches in the past to the disk-full issue, and none have quite worked how we want.

Past solution 1: cull old nodes

Early on, we could get an unrecoverable exhaustion of inodes. Our strategy for this was to cull all nodes older than a few days, to just avoid image-cleaning events altogether, because inode exhaustion has more dire consequences than regular disk-full events. After updating to GKE 1.10, inode exhaustion went away, and we are now using up regular blocks. Long-running nodes have both cost and performance benefits, so we would like to avoid this if we can. We stopped doing with with the GKE 1.10 update.

Past solution 2: only clean dind

Our image cleaner running on the host docker was causing nodes to become unavailable, so we only clean dind images. This, however, has the consequence that only dind images get cleaned up, as the kubernetes ImageGC threshold never gets hit. This has resulted in failing builds for nodes with a full disk, as the image cleaner is too-aggressively deleting images when the disk is filling up.

Past solution 3: only clean host (current)

In the current state of mybinder.org, our own image cleaning is completely disabled, because we haven't been able to trust it to not cause problems. This has a downside, however, that dind images must be manually cleaned out.

This is effectively manually restoring the behavior of setting the dind volume to use EmptyDir, which is always a fresh, empty directory on each dind pod start. The biggest downside of this is that it isn't automated. Without human operator intervention, there will be service interruptions when a node's disk fills up.

Proposed solutions

The core of the problems have to do with the independence of the image-culling behavior with a shared resource. Below are two proposed solutions: one to make the resources truly independent to avoid interference, and a second to make the culler truly coordinated.

Solution 1. independent image-cleaning, docker on separate disks

The coordination problem should go away if the dind images live on a different disk from the host images, so that cleaning dind images would have no effect on the cleaning threshold of host images, and vice versa. They would be wholly independent. This should have the most stable, simple behavior, with the one potentially significant deployment caveat that nodes must have two disks to accomplish it. The current proposal is to use GKE local-ssds for this.

Solution 1.b stateful sets and persistent volumes

This achieves the same goal as above, but achieving it through a switch to stateful sets instead of daemon sets, which can use persistent volume claim templates to mount their own disks. This should be more portable than local disks since it relies on Kubernetes instead of GKE, but a more complicated change to the binderhub, since stateful sets don't have the same 1:1 pod:node behavior as daemonsets.

Solution 2. more coordinated image-cleaning

If we don't want to or can't use separate disks, we would need to write an image-cleaner that truly cooperatively culls from both sources. This means that they can't use a pure disk-usage threshold, they have to set a shared flag to ensure that when one cleaner starts culling, the other does as well. This is more complex, but should result in cleaning images from both sources, no matter which one is full, avoiding the starvation we are seeing now.

jupyterhub / mybinder.org-deploy

image cleaning history #773