jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org
https://mybinder-sre.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
76 stars 74 forks source link

image cleaning history #773

Open minrk opened 5 years ago

minrk commented 5 years ago

Writing down some history of image cleaning. This should probably go somewhere in our collected docs, but I'm not sure where, so I'm writing it down here.

Background

Binder is constantly building and launching new images. This causes the disk to fill up. With current traffic, a 1TB disk fills up in ~3-4 days. Kubernetes has its own ImageGC for cleaning up unused images when the disk gets pretty full. The default threshold for this is 90%. This gets complicated by the fact that there are two dockers running on each node, the host docker which runs user pods, etc. and docker-in-docker (dind) that we use to build user images. This means that there are two totally disconnected collections of docker images filling up each node's disk. If we never cleanup our dind images, kubernetes' internal ImageGC will cleanup images on the host docker as best it can, but the dind images will keep growing and eventually we will hit disk pressure events, causing mass evictions, etc.

Our Image cleaner

To address this, we have an image-cleaner service that aims to mimic kubernetes' own, cleaning up images on the dind and/or host docker when the disk starts to get full. However, we've been having problems with image cleaning either not doing the right thing, causing downtime, etc.

Problems with our image cleaner

Past solutions

We've tried various approaches in the past to the disk-full issue, and none have quite worked how we want.

Past solution 1: cull old nodes

Early on, we could get an unrecoverable exhaustion of inodes. Our strategy for this was to cull all nodes older than a few days, to just avoid image-cleaning events altogether, because inode exhaustion has more dire consequences than regular disk-full events. After updating to GKE 1.10, inode exhaustion went away, and we are now using up regular blocks. Long-running nodes have both cost and performance benefits, so we would like to avoid this if we can. We stopped doing with with the GKE 1.10 update.

Past solution 2: only clean dind

Our image cleaner running on the host docker was causing nodes to become unavailable, so we only clean dind images. This, however, has the consequence that only dind images get cleaned up, as the kubernetes ImageGC threshold never gets hit. This has resulted in failing builds for nodes with a full disk, as the image cleaner is too-aggressively deleting images when the disk is filling up.

Past solution 3: only clean host (current)

In the current state of mybinder.org, our own image cleaning is completely disabled, because we haven't been able to trust it to not cause problems. This has a downside, however, that dind images must be manually cleaned out.

This is effectively manually restoring the behavior of setting the dind volume to use EmptyDir, which is always a fresh, empty directory on each dind pod start. The biggest downside of this is that it isn't automated. Without human operator intervention, there will be service interruptions when a node's disk fills up.

Proposed solutions

The core of the problems have to do with the independence of the image-culling behavior with a shared resource. Below are two proposed solutions: one to make the resources truly independent to avoid interference, and a second to make the culler truly coordinated.

Solution 1. independent image-cleaning, docker on separate disks

The coordination problem should go away if the dind images live on a different disk from the host images, so that cleaning dind images would have no effect on the cleaning threshold of host images, and vice versa. They would be wholly independent. This should have the most stable, simple behavior, with the one potentially significant deployment caveat that nodes must have two disks to accomplish it. The current proposal is to use GKE local-ssds for this.

Solution 1.b stateful sets and persistent volumes

This achieves the same goal as above, but achieving it through a switch to stateful sets instead of daemon sets, which can use persistent volume claim templates to mount their own disks. This should be more portable than local disks since it relies on Kubernetes instead of GKE, but a more complicated change to the binderhub, since stateful sets don't have the same 1:1 pod:node behavior as daemonsets.

Solution 2. more coordinated image-cleaning

If we don't want to or can't use separate disks, we would need to write an image-cleaner that truly cooperatively culls from both sources. This means that they can't use a pure disk-usage threshold, they have to set a shared flag to ensure that when one cleaner starts culling, the other does as well. This is more complex, but should result in cleaning images from both sources, no matter which one is full, avoiding the starvation we are seeing now.

betatim commented 5 years ago

What if repo2docker removed the image it built after pushing it to the registry?

In the extreme this would mean no sharing of layers between builds so we'd rebuild the early layers in a repo2docker image over and over again. Wondering how we could organise to keep those layers but generally have each build clean up after itself.

Could we pre-populate the image store used by r2d with some images when a new node starts up and then have each build remove the image it created? I think because the early layers are referenced from the pre-population images they'd not be GC'ed.

Could we specify --cache-from=image-name:latest to have r2d check with the BinderHub registry to fetch a previously built image as a cache. Not sure if that image would survive a docker rmi image-name:specific-tag and if it turns into a no-op if all the layers are already present.

Variation on the previous idea: we build a gigantic image that has "all the possible early layers" (how much variation is there? Install RStudio, miniconda, ..?) in it and always use that as the --cache-from image.

"early layers" == the layers that come before we copy in the contents of the repository.