moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.23k stars 1.16k forks source link

[BUG] ERROR: failed to solve: failed to compute cache key: failed to get state for index 0 on #3635

Closed Antiarchitect closed 5 months ago

Antiarchitect commented 1 year ago

Original issue https://github.com/docker/buildx/issues/1308, but according to the code it's buildkit throwing this error in CalcSlowCache.

Using docker buildx bake we get sporadically (different build steps):

> [assets] importing cache manifest from xxx/yyy/zzz:main--assets:
------
ERROR: failed to solve: failed to compute cache key: failed to get state for index 0 on [assets 3/9] COPY --from=cache /cache/node-modules /app/apps/xxx/yyy/node_modules
$ docker version
Client:
 Version:           23.0.1
 API version:       1.42
 Go version:        go1.19.5
 Git commit:        a5ee5b1
 Built:             Thu Feb  9 19:45:43 2023
 OS/Arch:           linux/amd64
 Context:           default
Server: Docker Engine - Community
 Engine:
  Version:          23.0.1
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.19.5
  Git commit:       bc3805a
  Built:            Thu Feb  9 19:48:50 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.6.16
  GitCommit:        31aa4358a36870b21a[99](https://gitlab.enapter.ninja/backend/monorepo/-/jobs/1421100#L99)2d3ad2bef29e1d693bec
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.16.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 65
 Server Version: 23.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 31aa4358a36870b21a992d3ad2bef29e1d693bec
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
 Kernel Version: 6.1.10-1.el7.elrepo.x86_64
 Operating System: Alpine Linux v3.17 (containerized)
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: [125](https://gitlab.enapter.ninja/backend/monorepo/-/jobs/1421100#L125).6GiB
 Name: app-gitlab-runners-docker-5
 ID: c9f224dc-3a07-449d-9c82-846b9433ba8b
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
  [127](https://gitlab.enapter.ninja/backend/monorepo/-/jobs/1421100#L127).0.0.0/8
 Registry Mirrors:
  https://dockerhub.enapter.ninja/
 Live Restore Enabled: false
 Product License: Community Engine
 Default Address Pools:
   Base: 10.10.0.0/16, Size: 24
WARNING: API is accessible on http://0.0.0.0:2375/ without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/go/attack-surface/
jedevc commented 8 months ago

Spent some time looking into this, it's specifically related to the codepath here: https://github.com/moby/buildkit/blob/2713491176c5f9fa64dd26ea7efa2f20557db576/solver/jobs.go#L808-L812

This happens when CacheMap.PreprocessFunc is set (which could explain why we maybe only seem to see this on Copy operation, but the code indicates it could also happen on Exec operation).

One of two things seems to be happening:

lucasteligioridis commented 8 months ago

For what its worth I have experienced this sporadically, if it helps, I've only seen it with buildx bake related builds, could be a coincidence and maybe thats actually the assumption from the description in this issue 👍🏼

jedevc commented 8 months ago

Yeah, that tracks with my observations as well - the issue definitely seems related to edge merging / shared vertices, which happens when multiple build requests are run at the same time, and this causes merges in the internal solve graph when you have similar nodes.

We see this when we have a large shared runner that lots of builds share - you could also see this with bake.

I'm headed out on holiday for a bit, so won't have a lot of time to spend digging into this for a while. But as a general request: if anyone in this thread is experiencing this issue and has a reliable reproducer (ideally minimal, but not necessary), that would go so far to helping track this down - currently we really only have the error message and some hunches to go on.

pjonsson commented 7 months ago

I don't have a reproducer, but I think I just hit this on Ubuntu 22.04 LTS with a Docker 26.0.0 freshly reinstalled on Monday (6 days ago) without any involvement of bake:

...
34 [builder 16/22] COPY --link /sourcecode ./sourcecode
#34 DONE 0.1s
#35 [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .
#35 ERROR: failed to get state for index 0 on [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .
------
 > [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .:
------
ERROR: failed to solve: failed to compute cache key: failed to get state for index 0 on [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .

@jedevc is this log hitting the Exec operation you mention in https://github.com/moby/buildkit/issues/3635#issuecomment-1977205713?

The reason I reinstalled docker was https://github.com/moby/moby/issues/46136, so I removed the previous docker 26.0.0 deb packages, did rm -rf /var/lib/docker and then installed the packages again according to the instructions.

The machines build many different images, and quite a few of those images are based on the same Dockerfile with different ARGs (the failing excerpt above is when building that Dockerfile). The Dockerfile has a builder image, a base image of common things, and a prod/dev image FROM the base image. The prod/dev images are built in parallel. There are nightly jobs that build all images with --no-cache, and some of those jobs have enough images to build so they build multiple base images in parallel.

All builds have SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct) and are done with docker build using the default builder, nobody has ever executed bake on the machines.

I don't know if it relates to this bug, but there is something interesting with space usage on the machine that got the error:

$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          92        0         32.98GB   32.98GB (100%)
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     1553      17        42.9GB    41.87GB
$ sudo du -hs /var/lib/docker
96G /var/lib/docker
$ bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
32.98+42.9
75.88

For comparison, on my local machine (Ubuntu 24.04, Docker 26.0.0):

$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          43        6         38.04GB   29.2GB (76%)
Containers      7         7         2.249GB   0B (0%)
Local Volumes   33        0         1.029GB   1.029GB (100%)
Build Cache     455       0         0B        0B
$ sudo du -hs /var/lib/docker
56G /var/lib/docker
$ bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
38.04+2.249+1.029
41.318

so whatever else is stored in /var/lib/docker just takes ~15G on my machine, but 20G on the machine where I got the cache error.

Detailed version info for the machine that got the cache error:

$ docker version
Client: Docker Engine - Community
 Version:           26.0.0
 API version:       1.45
 Go version:        go1.21.8
 Git commit:        2ae903e
 Built:             Wed Mar 20 15:17:48 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.0.0
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.8
  Git commit:       8b79278
  Built:            Wed Mar 20 15:17:48 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
$ docker info
Client: Docker Engine - Community
 Version:    26.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.25.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 211
 Server Version: 26.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-101-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.751GiB
 Name: <machine>
 ID: <id>
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://<mirror>/
 Live Restore Enabled: false
thaJeztah commented 7 months ago

@pjonsson curious; what does du show if you also use the -x /--one-file-system` option? It's possible it's double-counting mounts.

https://linux.die.net/man/1/du

-x, --one-file-system

skip directories on different file systems

Here's a quick example on a machine I had running;

docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          1         1         9.89MB    0B (0%)
Containers      1         1         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B

du -hs /var/lib/docker/overlay2/
20M /var/lib/docker/overlay2/
root@swarm-test-01:~# du -xhs /var/lib/docker/overlay2/
9.6M    /var/lib/docker/overlay2/
pjonsson commented 7 months ago

The machines have been running since my comment last week, and they run apt-get upgrade on a nightly basis so they are now using Docker 26.0.1. The Docker 26.0.0 that I had last week was installed with a non-existent /var/lib/docker, so it's something that comes with the docker deb packages that is responsible for the content in that directory. With that said, here's the current state:

# docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          22        0         9.174GB   9.174GB (100%)
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     1407      0         42.59GB   42.59GB
# du -hs /var/lib/docker/overlay2
102G    /var/lib/docker/overlay2
# du -xhs /var/lib/docker/overlay2
102G    /var/lib/docker/overlay2

so the difference is ~50G now.

@thaJeztah my apologies if I came across as trying to tell you where bugs are in the system, I don't know that at all, it just appears that something involving the (in layman's terms) "storage" must be wrong. And just this morning one machine failed to build images, and even after a reboot it says Error response from daemon: error getting build cache usage: failed to get usage for 1xeqlft7tbtfb9veq92jdzws8: layer does not exist when I run docker system df.

I understand that you would prefer to have a reproducer, but it's difficult to reproduce the conditions of a CI machine, is there anything else we could do to provide more information that would help you? I saw the "Go race detector" mentioned in some other thread, and personally I would be happy to switch the CI machines to some other apt-source with debug-packages for a couple of weeks (if the performance of Docker is within 5x) and provide logs to you.

Edit: ran docker system prune on the machine that couldn't find the layer, and that apparently still has 280 build cache entries (?) summing up to 0 bytes:

# docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          22        0         20.51GB   20.51GB (100%)
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     280       0         0B        0B
# du -xhs /var/lib/docker/overlay2
45G     /var/lib/docker/overlay2

Docker version information:

# docker info
Client: Docker Engine - Community
 Version:    26.0.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.26.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
..
dancysoft commented 5 months ago

I ran into this today:

#18 DONE 15.6s
#19 [test-bullseye 11/12] COPY --chown=65533:65533 [./, .]
#19 ERROR: failed to get state for index 0 on [test-bullseye 11/12] COPY --chown=65533:65533 [./, .]

xref: https://gitlab.wikimedia.org/repos/releng/scap/-/jobs/278321

The client is buildctl and the server is buildkitd v0.13.2.

The situation matches what @jedevc described above: 6 separate jobs started at the same time, many of them probably hitting the same buildkitd.

The log entry from bulidkitd is about the same:

time="2024-06-06T16:42:49Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled"
time="2024-06-06T16:42:50Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled"
time="2024-06-06T16:42:58Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = failed to compute cache key: failed to get state for index 0 on [test-bullseye 11/12] COPY --chown=65533:65533 [./, .]"
tonistiigi commented 5 months ago

intermittent failed to get state for index should be solved by https://github.com/moby/buildkit/pull/4887

snaumenko-st commented 5 months ago

We are still facing the same issue using the latest docker 26.1.4 release with buildkit 0.14.1 unfortunately.

jfheins commented 3 months ago

We also had the same issue. Previously we were building a few containers per agent in github actions:

    - name: Build images
      run: docker compose build ${{ inputs.service }}

    - name: Push images
      run: docker compose push ${{ inputs.service }}

After replacing this with docker buildx bake --push ${{ inputs.service }} it worked twice. Every run that came later got this error at least once:

ERROR: failed to solve: DeadlineExceeded: failed to compute cache key: failed to copy: httpReadSeeker: failed open: failed to authorize: no active session for ...: context deadline exceeded

We are building 21 containers across 4 agents. so each agent has 5 containers to build. It was nice and faster than docker compose, but it just did not work.

Docker version: 27.0.3 buildx version: v0.15.1 1c1dbb2