Open dimikot opened 1 month ago
@tonistiigi maybe you have some hints on why is this happening? I tried to look at the source code, at the places that call updateLastUsed(); I found out that it's called in release(). And release() is called from many places... Maybe cacheManager.prune() in cache/manager.go does it unintentionally somewhere?
I also tried to downgrade to 27.2.0 in Linux (both docker-ce and docker-ce-cli), it didn't help, same effect.
From your output, I see --builder=container
in your output, which means that you're likely using a custom containerised builder (created through docker buildx create
). In that case buildx
is not using the BuildKit instance that's compiled into the Docker Engine, but it's using a fully separate BuildKit daemon that runs inside a container. Downgrading Docker in that case likely won't downgrade BuildKit (as it's separate). You can use docker buildx inspect --builder=container
to get information about the version of BuildKit running in the container.
Very orthogonal to this ticket, but if the reason you're running a separate builder is to build multi-arch/multi-platform images, and if you have an environment to test on, then it's worth considering to enable the containerd image store;
With the containerd image store enabled, the Docker Engine can store multi-platform images, and data used for build-cache and images is shared between BuildKit and the Docker Engine, which can save storage, as well as improve performance.
In either case, this looks to be an issue related to BuildKit, so let me transfer it to the BuildKit issue tracker
@thaJeztah
if the reason you're running a separate builder is to build multi-arch/multi-platform images
Not only for that. Also to be able to build on dev Macbooks without forcing all devs to manually change their docker-desktop configs (if I understand how it all works correctly). Plus to have customizable gc settings in buildkitd.toml (again, without any needs for anyone to tweak their docker-desktop). Plus there is another reason: builder, when running as a separate container, stores all its caches in a volume, and we have a fs-based infra in CI which can backup and restore volumes of arbitrary sizes blazingly fast, like 10-20 times faster than --cache-to/--cache-from could even imagine. But it's unrelated to the ticket I think.
The builder container is created with:
docker buildx create --name container
--driver=docker-container
--buildkitd-config=buildkitd.toml
--bootstrap
Also, I've just published an open-source tool https://github.com/dimikot/docker-buildx-cache/ which works-around this behavior, plus can print cache layers in a hierarchical way.
Sorry, accidentally closed. Reopening.
Description
When running e.g.
after deleting some cache layers, the parents of those cache layers are marked as "last used". This does not let it prune the entire subtree: instead, it prunes only one leaf layer at a time.
Reproduce
This is actually very hard to reproduce, so I provide a screenshot from some real CI run. I just built a quick python tool which represents the results of
du
andprune
as a tree and adds colors.I run
du
. Look at cache id=byc4z0pb2ba29tm25nqbrdcpk (underlined with red) and its parent lac46mmewlr8bqd5f7ii95hgd (underlined with green). They were both last used 3 minutes ago.Then,
docker buildx prune --filter="until=25s"
removes the old unreferenced caches, and it removes the red cache byc4z0pb2ba29tm25nqbrdcpk (which is correct). For some reason, it doesn't remove its green parent lac46mmewlr8bqd5f7ii95hgd (although it theoretically should).And after pruning, I run
du
again, and look what happened with the green parent lac46mmewlr8bqd5f7ii95hgd (follow the arrows): it is now "Last used 1 second ago"! (Just reminding that, before pruning, it was "Last used 3 minutes ago".) I.e.prune
does update the timestamp of the cache it doesn't touch. I think it may also be the reason why it doesn't delete that green parent: since it touches it, it doesn't treat it as "older than 25s".Expected behavior
docker version
docker info
Additional Info
What I'm trying to achieve with all these is to remain only the layer caches related to the latest build, and prune everything else. I.e. remain only the artifacts of the latest, most recent build. Theoretically,
docker buildx prune --filter="until=${until}s"
should do it (where until =now() - build_start_timestamp
), and in fact it seems to do so on e.g. MacOS (docker 27.2.0) with my test Dockerfile. But in practice, probably due to the effect explained above (marking unrelated caches as "recently used" on Linux and with a real heavy Dockerfile), it doesn't work as expected.I also tried to downgrade to 27.2.0 in Linux (both docker-ce and docker-ce-cli), it didn't help, same effect.