tilt-dev / tilt

Define your dev environment as code. For microservice apps on Kubernetes.
https://tilt.dev/
Apache License 2.0
7.55k stars 298 forks source link

Registry cleaning #2102

Open gaetansnl opened 5 years ago

gaetansnl commented 5 years ago

Hello, First I wanted to thank you for your work on tilt. I tried multiple tools before, but most of them don't support custom scripts and such advanced sync capabilities... I can now rebuild dependencies on each server without a full server rebuild 😄

I build images locally, and push them to a registry on another server. And I noticed that they are not removed. It seems they are also kept locally. So I'm afraid to explode the server storage capacity even if most of time the image use previous images. I just got "Kubelet has disk pressure"

It here a suggested approach to remove some of them ? Or should I write a script manually ?

Thank you

nicks commented 5 years ago

Thanks for the kind words!

I don't know of a way to do this right now. Most heavy users of Tilt use live_update (https://docs.tilt.dev/live_update_tutorial.html). It copies files and runs build commands in-place, so you don't run out of disk space copying around lots of short-lived images.

I'm sure a garbage collector would help others. But I'm not totally sure what it might look like. Maybe something that deletes all image tags with the "tilt-" prefix, then runs the registry garbage collector?

gaetansnl commented 5 years ago

I use live_update but a new image is created when I restart tilt, or when I do changes in package.json for example. I know skaffold prune images but I'm not sure how they are pruning, because I still have a lot skaffold images.

I was thinking it can be nice to have, at any time, only one tag per docker_build(). And don't remove it when tilt is closed. So for the next start: 1) the cache is still there 2) old tags are removed. Removing by "tilt-" prefix will work locally but can maybe cause issues if multiple developpers are using the registry ?

dotellie commented 5 years ago

To add to this, for me at the very least, I found that using live_update wasn't worth it with Rust as the performance difference is negligible and it would require me to have a completely different build pipeline for development. Because of that, I always rebuild images on each change and have been hit with disk pressure quite a lot.

maiamcc commented 4 years ago

Hi all, sorry for the late follow-up, but do you happen to know what k8s cluster you were running on when you hit these errors? @dotellie @gaetansnl

dotellie commented 4 years ago

@maiamcc No worries! I assume you're asking for minikube in my case? I have basically as good as default settings running on Linux with KVM if that helps.

maiamcc commented 4 years ago

👋 hey folks! Tilt now has a built-in docker prune-er.

By default. the docker pruner runs once after startup (as soon as all of your pending builds are done) and once every hour thereafter, and prunes

This is out in the latest release, hopefully it'll help with @dotellie's issue -- tho @gaetansnl it seems like yours is potentially different and won't just be solved by docker images prune, is that correct?

gaetansnl commented 4 years ago

Hello. Thank you for this feature 🎉 , I'll try it as soon as I can. It will at least partially solve the problem, If I understand correctly, images inside the registry are not removed. But it will at least avoid disk issues localy

nicks commented 3 years ago

This becomes a bigger issue when you're using a local registry set up by ctlptl, because those images never get cleaned up :\

liskin commented 2 years ago

Not entirely sure if this is a recent development or not, but it seems that using a local registry is recommended/default for most setups (kind, k3d, minikube) — see examples in ctlptl README. And indeed, stuff in the local registry isn't being cleaned up: it isn't even possible to delete stuff from the local registry in its default configuration, and even if it was, a manual blob garbage collector invocation would be necessary afterwards. No wonder tilt doesn't attempt to clean up anything in the local registry.

On the other hand, tilt's FAQ claims that images are built in-cluster if possible. This certainly isn't the case with kind and a local registry, but the FAQ entry mentions Minikube as well. https://docs.tilt.dev/choosing_clusters.html#minikube says local registry is supported there, so I'm wondering if perhaps the FAQ entry is outdated? Are there any benefits of using a local registry instead of building in-cluster that I'm missing? Is this a speed vs space tradeoff?

Anyway, assuming deleting and blob GC in the registry itself is enabled (https://github.com/tilt-dev/ctlptl/issues/247), I think what tilt could do is this:

nicks commented 2 years ago

re: "I'm wondering if perhaps the FAQ entry is outdated" - the minikube team completely broke insecure registries in v1.26 - https://github.com/kubernetes/minikube/issues/14480 . This is very recent. I've been telling people to use v1.25 until it's fixed upstream.

nickzelei commented 10 months ago

Hey all, bumping this as it's been over a year since the last comment on this thread.

I'm using ctlptl to create a registry and cluster.

apiVersion: ctlptl.dev/v1alpha1
kind: Registry
name: neosync-dev-registry
---
apiVersion: ctlptl.dev/v1alpha1
kind: Cluster
product: kind
registry: neosync-dev-registry
name: kind-neosync-dev # must have kind- prefix

But roughly every few days or so (+-, depending on how much active development I'm doing) I have to nuke my cluster and remove all of the images in docker to free up space.

I run this command: docker rmi -f $(docker images -aq) once everything is shut down to remove all images. Then I go into docker desktop and reclaim space using the "Disk usage" extension to further clean up my drive.

Seems @liskin is dealing with the same issue that I'm having and I'm wondering if there is any updates on how to get this rectified? I love using tilt and have been using it for some time, but this is probably my biggest pain with it as of today.

ianb-mp commented 5 months ago

If using podman, this will remove all images that have tags beginning with tilt-:

podman rmi -f $(podman image ls --format '{{.ID}}' '*:tilt-*')
ludwick commented 5 months ago

I just wanted to add how this issue affects our team. I've been monitoring my Docker Desktop (on mac) environment over time from "fresh" starts to try to figure out why disk space gets sucked up. We use Tilt with minikube and the docker registry (as mentioned in a previous comment) so Docker Desktop is running a container for the registry and one for tilt.

I've noticed that over time, even if I'm regularly doing various forms of prune:

docker system prune -a --volumes
docker volume prune -a
tilt docker-prune

That eventually all the disk allocated to docker desktop is consumed to the point where I can't rebuild a container on change without getting "no space" error. Note that the last two don't really do anything if I do the first one but I include it for completeness. My usual procedure once I start getting frequent "no space" issues despite doing a prune is to just restart it all: shut down tilt, stop its container, remove its container, stop and remove the local docker registry container and then run all the prunes. Then start up the whole thing again, let it rebuild our application images and then go back to whatever I was doing. Unsurprisingly this starts with plenty of free space (even after the rebuild of our application images).

I decided to not restart the entire setup today when I was getting to no space issues. Instead, I first did a normal prune which reclaimed a fair bit of space but not all the way up to how much space is available after a "fresh" start. Then I exec-ed into the tilt container and ran a bunch of du -sh * commands on various directories to suss out where the tilt container was using up disk. Eventually I realized it was /var/lib/containerd/ and specifically /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs.

So I ran a containerd prune (crictl rmi --prune) from within the tilt container. This deleted a lot of stuff, but our application containers were still running fine, and it got the available disk space (as reported by Docker Desktop) back up to where it starts at after a "fresh" start of our Tilt setup. However, it also deleted a lot of stuff I wouldn't want it to – all the intermediate layers for the current "active" images, so the next rebuild of an application image missed cache. But at least I didn't have to restart everything or lose the images stored in the local docker registry (which is mostly the base images downloaded from external places). So I think tilt is keeping intermediate layers for image versions that are no longer being used and they start to add up over time. Many of the layers were many days old so I don't think it's just that our use of default docker_prune_settings was keeping around just all the layers for the 2 most recent builds & within 6 hours. I especially don't think it's just prune settings since prune over time (may days) running tilt reclaims less and less. Nonetheless, I am setting docker_prune_settings max_age_minutes so it only keeps around an hour worth vs 6 hours by default.

wuservices commented 5 months ago

Running out of space has been an issue for our team as well.

We've settled on periodically deleting everything in the registry and pruning everything, so we just run something like the following:

tilt down
docker exec -it -u root ctlptl-registry sh -c 'rm -r /var/lib/registry/docker/registry/v2/repositories/*; registry garbage-collect /etc/docker/registry/config.yml'
docker restart ctlptl-registry
docker exec -it kind-control-plane crictl rmi --prune
docker system prune -af
tilt up

Depending on your setup, this may not be safe. Not sure if we're making some bad assumptions, but it seems like there's some copy of data in Docker, Kind, and the local registry, so this tries to nuke the images (and CAUTION, potentially more in Docker).

liskin commented 2 months ago

re: "I'm wondering if perhaps the FAQ entry is outdated" - the minikube team completely broke insecure registries in v1.26 - kubernetes/minikube#14480 . This is very recent. I've been telling people to use v1.25 until it's fixed upstream.

This seems like a misunderstanding, I wasn't concerned about insecure registries at all, but rather about building in cluster. If I understand it correctly, a local registry isn't needed at all if images are built in cluster, as they're immediately available for k8s to run and don't need to be copied all around (push from docker to registry, then pull from registry to k8s).

The https://docs.tilt.dev/faq.html#q-all-the-tilt-examples-store-the-image-at-gcrio-isnt-it-really-slow-to-push-images-up-to-googles-remote-repository-for-local-development FAQ entry implies that building in cluster is desirable (faster!) and is the norm, but https://docs.tilt.dev/choosing_clusters confusingly talks a lot about local registries, making it seem that somehow not having to push/pull over the internet and only copying several gigabytes of data locally is all we can hope for.

So yeah I suppose "outdated" wasn't worded well. Should have gone for "confusing".

Anyway, now that almost 2 years have somehow managed to pass by while I wasn't looking (and I mean that quite literally, life happened…), the entry happens to be outdated indeed:

And yeah, now that I got back to this and dug a bit deeper to be able to answer, I realised that we (the company I work for where we happen to use tilt) don't configure stuff right, so despite using Colima, images aren't built in cluster, but instead get copied to and from a local registry, and therefore we run into this very problem discussed in this issue (https://github.com/tilt-dev/tilt/issues/2102#issuecomment-739984779). Perhaps for no good reason, because if we just used colima --kubernetes, everything would be fine? (well, I've been on Linux the whole time, just running kind in a rootless docker, so it wouldn't be fine for me, but it could be fine for everyone else)

https://docs.tilt.dev/choosing_clusters could use an entry for Colima btw… And for Orbstack :-)

liskin commented 2 months ago

Seems @liskin is dealing with the same issue that I'm having and I'm wondering if there is any updates on how to get this rectified? I love using tilt and have been using it for some time, but this is probably my biggest pain with it as of today.

Apologies for the late reply, but there hasn't been much news so perhaps I might have something useful still.

First, as I realised halfway through writing the above reply to Nick, one thing is that we're all probably using Tilt wrong. Apparently it can be used without a registry, building images directly in the k8s cluster, which saves space and time and also possibly makes this whole garbage collection problem go away. So if one doesn't need a registry for some other reason (and I don't know what that reason might be but I'm somewhat certain we don't have one), then configuring things so images aren't pushed/pulled to/from a registry would be the best way to avoid this problem.

If you, however, do run a local registry (like I do, because I happen to use ctlptl with kind on Linux, mostly because of being completely clueless about k8s and wanting to keep it in an isolated box as much as possible, so letting it use my host's docker is certainly not an option), then… well then you could do all sorts of weird shenanigans like we resorted to doing because we had no idea we could have avoided it.

Like… (and I'm not making this up, we really do all this)

So yeah it's dumb and I'm maybe a bit bitter about it. Now let me go and find out how to switch to building in cluster so I can forget about all this. :-)

ulevitsky commented 3 weeks ago

Hi @liskin, did you get anywhere with your quest? I'm in the same boat - kind's registry eating up disk space until there's none left, and the only guaranteed solution seems to be to nuke then recreate the cluster.

Not sure if it's me doing something wrong though; I would assume a system like Tilt would be better at managing its resources.