Open cpuguy83 opened 2 years ago
Looks the same as https://github.com/grpc/grpc-go/issues/4722 😞
If this is indeed the grpc issue, then that was fixed in this PR: https://github.com/moby/buildkit/pull/3940 So worth a try to see if fixed in v0.12
This is still an issue with 0.12 + buildx at least. Exporting the content to the client gets stuck in random places.
$ docker buildx inspect
Name: amazing_panini
Driver: docker-container
Last Activity: 2023-07-27 18:17:00 +0000 UTC
Nodes:
Name: amazing_panini0
Endpoint: unix:///var/run/docker.sock
Driver Options: image="moby/buildkit:v0.12.0"
Status: running
Buildkit: v0.12.0
Platforms: linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/386
Labels:
org.mobyproject.buildkit.worker.executor: oci
org.mobyproject.buildkit.worker.hostname: 3bf52c7ae3e9
org.mobyproject.buildkit.worker.network: host
org.mobyproject.buildkit.worker.oci.process-mode: sandbox
org.mobyproject.buildkit.worker.selinux.enabled: false
org.mobyproject.buildkit.worker.snapshotter: overlayfs
GC Policy rule#0:
All: false
Filters: type==source.local,type==exec.cachemount,type==source.git.checkout
Keep Duration: 48h0m0s
Keep Bytes: 488.3MiB
GC Policy rule#1:
All: false
Keep Duration: 1440h0m0s
Keep Bytes: 12.11GiB
GC Policy rule#2:
All: false
Keep Bytes: 12.11GiB
GC Policy rule#3:
All: true
Keep Bytes: 12.11GiB
$ docker buildx build --output=_out - <<<"FROM golang:1.18"
[+] Building 4.4s (4/5) docker-container:amazing_panini
[+] Building 277.7s (4/5) docker-container:amazing_panini
=> [internal] load build definition from Dockerfile 0.1s
=> => transferring dockerfile: 54B 0.0s
=> [internal] load metadata for docker.io/library/golang:1.18 0.4s
=> [internal] load .dockerignore 0.1s
=> => transferring context: 2B 0.0s
=> [1/1] FROM docker.io/library/golang:1.18@sha256:50c889275d26f816b5314fc99f55425fa76b18fcaf16af255f5d57f09e1f48da 11.1s
=> => resolve docker.io/library/golang:1.18@sha256:50c889275d26f816b5314fc99f55425fa76b18fcaf16af255f5d57f09e1f48da 0.0s
=> => sha256:cc7973a07a5b4a44399c5d36fa142f37bb343bb123a3736357365fd9040ca38a 156B / 156B 0.1s
=> => sha256:56261d0e6b05ece42650b14830960db5b42a9f23479d868256f91d96869ac0c2 10.88MB / 10.88MB 0.3s
=> => sha256:f049f75f014ee8fec2d4728b203c9cbee0502ce142aec030f874aa28359e25f1 5.16MB / 5.16MB 0.2s
=> => sha256:9bd150679dbdb02d9d4df4457d54211d6ee719ca7bc77747a7be4cd99ae03988 54.58MB / 54.58MB 0.8s
=> => sha256:bfcb68b5bd105d3f88a2c15354cff6c253bedc41d83c1da28b3d686c37cd9103 85.98MB / 85.98MB 1.6s
=> => sha256:bbeef03cda1f5d6c9e20c310c1c91382a6b0a1a2501c3436b28152f13896f082 55.03MB / 55.03MB 1.4s
=> => sha256:06d0c5d18ef41fa1c2382bd2afd189a01ebfff4910b868879b6dcfeef46bc003 141.98MB / 141.98MB 3.2s
=> => extracting sha256:bbeef03cda1f5d6c9e20c310c1c91382a6b0a1a2501c3436b28152f13896f082 1.0s
=> => extracting sha256:f049f75f014ee8fec2d4728b203c9cbee0502ce142aec030f874aa28359e25f1 0.1s
=> => extracting sha256:56261d0e6b05ece42650b14830960db5b42a9f23479d868256f91d96869ac0c2 0.2s
=> => extracting sha256:9bd150679dbdb02d9d4df4457d54211d6ee719ca7bc77747a7be4cd99ae03988 1.1s
=> => extracting sha256:bfcb68b5bd105d3f88a2c15354cff6c253bedc41d83c1da28b3d686c37cd9103 1.3s
=> => extracting sha256:06d0c5d18ef41fa1c2382bd2afd189a01ebfff4910b868879b6dcfeef46bc003 3.7s
=> => extracting sha256:cc7973a07a5b4a44399c5d36fa142f37bb343bb123a3736357365fd9040ca38a 0.0s
=> exporting to client directory 277.2s
=> => copying files 280.96MB 277.2s
ERROR: failed to solve: Canceled: context canceled
This is still an issue with 0.12 + buildx at least.
Sad, but now that I think about it again, I guess that PR only applied a fix to the grpc streams for gateway containers, so it's possible the same issue needs to be fixed in other parts of the codebase.
When you repro it, if you send SIGQUIT
to the buildkitd process (or possibly buildkit client process too, in case it's manifesting there somehow) to get a stack trace, you will see goroutines blocked on a stack containing google.golang.org/grpc/internal/transport.(*writeQuota).get
(or similar, grpc
and Quota
are the key things to look for) if that is indeed the issue. Otherwise I'd guess it's something else entirely.
It seems like this is a problem only when many files are exported. If only there was a way to specify what files should be exported instead of the whole filesystem…
This issue still persists. I encounter it whenever filesystems >5-600MB are being exported. I've tried natively on Linux/x86_64 via systemd process; Linux/x86_64 via containerd; and darwin/arm64 via Docker Desktop.
Is there any progress on this issue?
Hi, is there any workaround for this problem? I am experiencing this behaviour, stuck exporting to client directory, in Github Actions.
Experiencing this problem as well. Exporting to client directory takes forever.
A temporary solution we found which works is to use the tar exporter (client.ExporterTar
), and then simply de-tar this file. A needless extra processing step but does not result in the deadlock.
I've found I can fairly easily deadlock an export operation. This works either with buildx or docker (I built a custom buildx to get a stack trace out), and the latest buildkit as well as the built-in moby buildkit in 20.10.
Here are the stack traces: https://gist.github.com/cpuguy83/88efed6bacae5355aa6ba0b63d3b4faa I've collapsed these down so that duplicate goroutines are just one entry.
On the buildx side there are thousands of goroutines waiting for a mutex lock to read (5100 goroutines) or write (2700 goroutines) to the grpc conn.
The goroutine holding the lock on the buildx side seems to be:
On the dockerd side, we seem to be holding the write lock here:
I haven't looked into the latest buildkit stack trace yet.