Closed Antiarchitect closed 5 months ago
Spent some time looking into this, it's specifically related to the codepath here: https://github.com/moby/buildkit/blob/2713491176c5f9fa64dd26ea7efa2f20557db576/solver/jobs.go#L808-L812
This happens when CacheMap.PreprocessFunc
is set (which could explain why we maybe only seem to see this on Copy operation, but the code indicates it could also happen on Exec operation).
One of two things seems to be happening:
actives
- I don't think this can be possible? The scheduler "shouldn't" let this happen.actives
actives
in a scenario where we shouldn't.For what its worth I have experienced this sporadically, if it helps, I've only seen it with buildx bake
related builds, could be a coincidence and maybe thats actually the assumption from the description in this issue 👍🏼
Yeah, that tracks with my observations as well - the issue definitely seems related to edge merging / shared vertices, which happens when multiple build requests are run at the same time, and this causes merges in the internal solve graph when you have similar nodes.
We see this when we have a large shared runner that lots of builds share - you could also see this with bake.
I'm headed out on holiday for a bit, so won't have a lot of time to spend digging into this for a while. But as a general request: if anyone in this thread is experiencing this issue and has a reliable reproducer (ideally minimal, but not necessary), that would go so far to helping track this down - currently we really only have the error message and some hunches to go on.
I don't have a reproducer, but I think I just hit this on Ubuntu 22.04 LTS with a Docker 26.0.0 freshly reinstalled on Monday (6 days ago) without any involvement of bake:
...
34 [builder 16/22] COPY --link /sourcecode ./sourcecode
#34 DONE 0.1s
#35 [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .
#35 ERROR: failed to get state for index 0 on [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .
------
> [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .:
------
ERROR: failed to solve: failed to compute cache key: failed to get state for index 0 on [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .
@jedevc is this log hitting the Exec operation you mention in https://github.com/moby/buildkit/issues/3635#issuecomment-1977205713?
The reason I reinstalled docker was https://github.com/moby/moby/issues/46136, so I removed the previous docker 26.0.0 deb packages, did rm -rf /var/lib/docker
and then installed the packages again according to the instructions.
The machines build many different images, and quite a few of those images are based on the same Dockerfile with different ARGs (the failing excerpt above is when building that Dockerfile). The Dockerfile has a builder image, a base image of common things, and a prod/dev image FROM the base image. The prod/dev images are built in parallel. There are nightly jobs that build all images with --no-cache
, and some of those jobs have enough images to build so they build multiple base images in parallel.
All builds have SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct)
and are done with docker build
using the default builder, nobody has ever executed bake on the machines.
I don't know if it relates to this bug, but there is something interesting with space usage on the machine that got the error:
$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 92 0 32.98GB 32.98GB (100%)
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 1553 17 42.9GB 41.87GB
$ sudo du -hs /var/lib/docker
96G /var/lib/docker
$ bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
32.98+42.9
75.88
For comparison, on my local machine (Ubuntu 24.04, Docker 26.0.0):
$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 43 6 38.04GB 29.2GB (76%)
Containers 7 7 2.249GB 0B (0%)
Local Volumes 33 0 1.029GB 1.029GB (100%)
Build Cache 455 0 0B 0B
$ sudo du -hs /var/lib/docker
56G /var/lib/docker
$ bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
38.04+2.249+1.029
41.318
so whatever else is stored in /var/lib/docker just takes ~15G on my machine, but 20G on the machine where I got the cache error.
Detailed version info for the machine that got the cache error:
$ docker version
Client: Docker Engine - Community
Version: 26.0.0
API version: 1.45
Go version: go1.21.8
Git commit: 2ae903e
Built: Wed Mar 20 15:17:48 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 26.0.0
API version: 1.45 (minimum version 1.24)
Go version: go1.21.8
Git commit: 8b79278
Built: Wed Mar 20 15:17:48 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.28
GitCommit: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc:
Version: 1.1.12
GitCommit: v1.1.12-0-g51d5e94
docker-init:
Version: 0.19.0
GitCommit: de40ad0
$ docker info
Client: Docker Engine - Community
Version: 26.0.0
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.13.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.25.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 211
Server Version: 26.0.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-101-generic
Operating System: Ubuntu 22.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.751GiB
Name: <machine>
ID: <id>
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://<mirror>/
Live Restore Enabled: false
@pjonsson curious; what does du
show if you also use the -x /
--one-file-system` option? It's possible it's double-counting mounts.
https://linux.die.net/man/1/du
-x, --one-file-system
skip directories on different file systems
Here's a quick example on a machine I had running;
docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 1 1 9.89MB 0B (0%)
Containers 1 1 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B
du -hs /var/lib/docker/overlay2/
20M /var/lib/docker/overlay2/
root@swarm-test-01:~# du -xhs /var/lib/docker/overlay2/
9.6M /var/lib/docker/overlay2/
The machines have been running since my comment last week, and they run apt-get upgrade
on a nightly basis so they are now using Docker 26.0.1. The Docker 26.0.0 that I had last week was installed with a non-existent /var/lib/docker
, so it's something that comes with the docker deb packages that is responsible for the content in that directory. With that said, here's the current state:
# docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 22 0 9.174GB 9.174GB (100%)
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 1407 0 42.59GB 42.59GB
# du -hs /var/lib/docker/overlay2
102G /var/lib/docker/overlay2
# du -xhs /var/lib/docker/overlay2
102G /var/lib/docker/overlay2
so the difference is ~50G now.
@thaJeztah my apologies if I came across as trying to tell you where bugs are in the system, I don't know that at all, it just appears that something involving the (in layman's terms) "storage" must be wrong. And just this morning one machine failed to build images, and even after a reboot it says Error response from daemon: error getting build cache usage: failed to get usage for 1xeqlft7tbtfb9veq92jdzws8: layer does not exist
when I run docker system df
.
I understand that you would prefer to have a reproducer, but it's difficult to reproduce the conditions of a CI machine, is there anything else we could do to provide more information that would help you? I saw the "Go race detector" mentioned in some other thread, and personally I would be happy to switch the CI machines to some other apt-source with debug-packages for a couple of weeks (if the performance of Docker is within 5x) and provide logs to you.
Edit: ran docker system prune
on the machine that couldn't find the layer, and that apparently still has 280 build cache entries (?) summing up to 0 bytes:
# docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 22 0 20.51GB 20.51GB (100%)
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 280 0 0B 0B
# du -xhs /var/lib/docker/overlay2
45G /var/lib/docker/overlay2
Docker version information:
# docker info
Client: Docker Engine - Community
Version: 26.0.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.13.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.26.1
Path: /usr/libexec/docker/cli-plugins/docker-compose
..
I ran into this today:
#18 DONE 15.6s
#19 [test-bullseye 11/12] COPY --chown=65533:65533 [./, .]
#19 ERROR: failed to get state for index 0 on [test-bullseye 11/12] COPY --chown=65533:65533 [./, .]
xref: https://gitlab.wikimedia.org/repos/releng/scap/-/jobs/278321
The client is buildctl
and the server is buildkitd v0.13.2
.
The situation matches what @jedevc described above: 6 separate jobs started at the same time, many of them probably hitting the same buildkitd
.
The log entry from bulidkitd
is about the same:
time="2024-06-06T16:42:49Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled"
time="2024-06-06T16:42:50Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled"
time="2024-06-06T16:42:58Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = failed to compute cache key: failed to get state for index 0 on [test-bullseye 11/12] COPY --chown=65533:65533 [./, .]"
intermittent failed to get state for index
should be solved by https://github.com/moby/buildkit/pull/4887
We are still facing the same issue using the latest docker 26.1.4 release with buildkit 0.14.1 unfortunately.
We also had the same issue. Previously we were building a few containers per agent in github actions:
- name: Build images
run: docker compose build ${{ inputs.service }}
- name: Push images
run: docker compose push ${{ inputs.service }}
After replacing this with docker buildx bake --push ${{ inputs.service }}
it worked twice.
Every run that came later got this error at least once:
ERROR: failed to solve: DeadlineExceeded: failed to compute cache key: failed to copy: httpReadSeeker: failed open: failed to authorize: no active session for ...: context deadline exceeded
We are building 21 containers across 4 agents. so each agent has 5 containers to build. It was nice and faster than docker compose, but it just did not work.
Docker version: 27.0.3 buildx version: v0.15.1 1c1dbb2
Original issue https://github.com/docker/buildx/issues/1308, but according to the code it's
buildkit
throwing this error inCalcSlowCache
.Using
docker buildx bake
we get sporadically (different build steps):