moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.18k stars 1.16k forks source link

Deadlock exporting data #2950

Open cpuguy83 opened 2 years ago

cpuguy83 commented 2 years ago

I've found I can fairly easily deadlock an export operation. This works either with buildx or docker (I built a custom buildx to get a stack trace out), and the latest buildkit as well as the built-in moby buildkit in 20.10.

$ ./bin/buildx build --output=_out - <<<"FROM golang:1.18"

Here are the stack traces: https://gist.github.com/cpuguy83/88efed6bacae5355aa6ba0b63d3b4faa I've collapsed these down so that duplicate goroutines are just one entry.

On the buildx side there are thousands of goroutines waiting for a mutex lock to read (5100 goroutines) or write (2700 goroutines) to the grpc conn.

The goroutine holding the lock on the buildx side seems to be:

``` goroutine 18864 [select, 1 minutes]: 1 times: [[18864] google.golang.org/grpc/internal/transport.(*serverHandlerTransport).do(0x1?, 0x1c2f8a0?) /src/vendor/google.golang.org/grpc/internal/transport/handler_server.go:176 +0x65 google.golang.org/grpc/internal/transport.(*serverHandlerTransport).Write(0xc000649290, 0xc000347200, {0xc00067cfd0, 0x5, 0x5}, {0xc00067cfcb, 0x5, 0x5}, 0x0?) /src/vendor/google.golang.org/grpc/internal/transport/handler_server.go:291 +0x11c google.golang.org/grpc.(*serverStream).SendMsg(0xc00077d2c0, {0x1dfd240?, 0xc0083f7140}) /src/vendor/google.golang.org/grpc/stream.go:1512 +0x1bc go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*serverStream).SendMsg(0xc000536b70, {0x1dfd240, 0xc0083f7140}) /src/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:366 +0x42 github.com/tonistiigi/fsutil.(*syncStream).SendMsg(0xc0008fe0d8, {0x1dfd240?, 0xc0083f7140?}) /src/vendor/github.com/tonistiigi/fsutil/send.go:205 +0x73 github.com/tonistiigi/fsutil.(*receiver).asyncDataFunc(0xc0006126e0, {0x21aeba0, 0xc000931d40}, {0xc0054ce240, 0x3e}, {0x219a560?, 0xc001ce9ae0}) /src/vendor/github.com/tonistiigi/fsutil/receive.go:249 +0x1d9 github.com/tonistiigi/fsutil.(*DiskWriter).processChange(0xc00052c540, 0xc000816f08?, {0xc0054ce240, 0x3e}, {0x21b22f0?, 0xc00b6f6720?}, {0x219a560?, 0xc001ce9ae0?}) /src/vendor/github.com/tonistiigi/fsutil/diskwriter.go:233 +0xea github.com/tonistiigi/fsutil.(*DiskWriter).requestAsyncFileData.func1() /src/vendor/github.com/tonistiigi/fsutil/diskwriter.go:209 +0xf9 golang.org/x/sync/errgroup.(*Group).Go.func1() /src/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x67 created by golang.org/x/sync/errgroup.(*Group).Go /src/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x8d ```

On the dockerd side, we seem to be holding the write lock here:

``` goroutine 44040 [select]: 1 times: [[44040] github.com/docker/docker/vendor/google.golang.org/grpc/internal/transport.(*writeQuota).get(0xc0017d0be0, 0x800e) /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/internal/transport/flowcontrol.go:59 +0x74 github.com/docker/docker/vendor/google.golang.org/grpc/internal/transport.(*http2Client).Write(0xc0001cda40, 0xc0003ae000, {0xc000a8b800, 0x5, 0x5}, {0xc00474c000, 0x8009, 0x8009}, 0xc000a8b805) /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/internal/transport/http2_client.go:862 +0x225 github.com/docker/docker/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc001df8800, {0x5594e4797f20, 0xc0020cca20}, {0xc000a8b800, 0x5, 0x5}, {0xc00474c000, 0x10000c002148b90, 0x8009}, {0xc00474c000, ...}) /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:845 +0x2ae github.com/docker/docker/vendor/google.golang.org/grpc.(*clientStream).SendMsg.func2(0xc0001d2400) /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:710 +0x6a github.com/docker/docker/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc00155ab40, 0xc001ffdc70, 0xc002148ca0) /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:590 +0x2f6 github.com/docker/docker/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc00155ab40, {0x5594e4797f20, 0xc0020cca20}) /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:716 +0x3b2 github.com/docker/docker/vendor/github.com/tonistiigi/fsutil.(*syncStream).SendMsg(0xc000b86d50, {0x5594e4797f20, 0xc0020cca20}) /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/github.com/tonistiigi/fsutil/send.go:205 +0x73 github.com/docker/docker/vendor/github.com/tonistiigi/fsutil.(*fileSender).Write(0xc000e28c20, {0xc002aa8000, 0x8000, 0x8000}) /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/github.com/tonistiigi/fsutil/send.go:191 +0xb5 io.copyBuffer({0x5594e483b7c0, 0xc000e28c20}, {0x5594e483cf20, 0xc0012a6038}, {0xc002aa8000, 0x8000, 0x8000}) /usr/local/go/src/io/io.go:425 +0x204 io.CopyBuffer({0x5594e483b7c0, 0xc000e28c20}, {0x5594e483cf20, 0xc0012a6038}, {0xc002aa8000, 0xc002148ee0, 0x3}) /usr/local/go/src/io/io.go:396 +0x3c github.com/docker/docker/vendor/github.com/tonistiigi/fsutil.(*sender).sendFile(0xc00202a7e0, 0xc00115e000) /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/github.com/tonistiigi/fsutil/send.go:138 +0x1b4 github.com/docker/docker/vendor/github.com/tonistiigi/fsutil.(*sender).run.func2() /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/github.com/tonistiigi/fsutil/send.go:75 +0x9a github.com/docker/docker/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1() /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x67 created by github.com/docker/docker/vendor/golang.org/x/sync/errgroup.(*Group).Go /build/moby-engine/.gopath/src/github.com/docker/docker/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x92 ```

I haven't looked into the latest buildkit stack trace yet.

tonistiigi commented 2 years ago

Looks the same as https://github.com/grpc/grpc-go/issues/4722 😞

sipsma commented 1 year ago

If this is indeed the grpc issue, then that was fixed in this PR: https://github.com/moby/buildkit/pull/3940 So worth a try to see if fixed in v0.12

cpuguy83 commented 1 year ago

This is still an issue with 0.12 + buildx at least. Exporting the content to the client gets stuck in random places.

cpuguy83 commented 1 year ago
$ docker buildx inspect
Name:          amazing_panini
Driver:        docker-container
Last Activity: 2023-07-27 18:17:00 +0000 UTC

Nodes:
Name:           amazing_panini0
Endpoint:       unix:///var/run/docker.sock
Driver Options: image="moby/buildkit:v0.12.0"
Status:         running
Buildkit:       v0.12.0
Platforms:      linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/386
Labels:
 org.mobyproject.buildkit.worker.executor:         oci
 org.mobyproject.buildkit.worker.hostname:         3bf52c7ae3e9
 org.mobyproject.buildkit.worker.network:          host
 org.mobyproject.buildkit.worker.oci.process-mode: sandbox
 org.mobyproject.buildkit.worker.selinux.enabled:  false
 org.mobyproject.buildkit.worker.snapshotter:      overlayfs
GC Policy rule#0:
 All:           false
 Filters:       type==source.local,type==exec.cachemount,type==source.git.checkout
 Keep Duration: 48h0m0s
 Keep Bytes:    488.3MiB
GC Policy rule#1:
 All:           false
 Keep Duration: 1440h0m0s
 Keep Bytes:    12.11GiB
GC Policy rule#2:
 All:        false
 Keep Bytes: 12.11GiB
GC Policy rule#3:
 All:        true
 Keep Bytes: 12.11GiB
$ docker buildx build  --output=_out - <<<"FROM golang:1.18"
[+] Building 4.4s (4/5)                                                                                    docker-container:amazing_panini
[+] Building 277.7s (4/5)                                                                                  docker-container:amazing_panini
 => [internal] load build definition from Dockerfile                                                                                  0.1s
 => => transferring dockerfile: 54B                                                                                                   0.0s
 => [internal] load metadata for docker.io/library/golang:1.18                                                                        0.4s
 => [internal] load .dockerignore                                                                                                     0.1s
 => => transferring context: 2B                                                                                                       0.0s
 => [1/1] FROM docker.io/library/golang:1.18@sha256:50c889275d26f816b5314fc99f55425fa76b18fcaf16af255f5d57f09e1f48da                 11.1s
 => => resolve docker.io/library/golang:1.18@sha256:50c889275d26f816b5314fc99f55425fa76b18fcaf16af255f5d57f09e1f48da                  0.0s
 => => sha256:cc7973a07a5b4a44399c5d36fa142f37bb343bb123a3736357365fd9040ca38a 156B / 156B                                            0.1s
 => => sha256:56261d0e6b05ece42650b14830960db5b42a9f23479d868256f91d96869ac0c2 10.88MB / 10.88MB                                      0.3s
 => => sha256:f049f75f014ee8fec2d4728b203c9cbee0502ce142aec030f874aa28359e25f1 5.16MB / 5.16MB                                        0.2s
 => => sha256:9bd150679dbdb02d9d4df4457d54211d6ee719ca7bc77747a7be4cd99ae03988 54.58MB / 54.58MB                                      0.8s
 => => sha256:bfcb68b5bd105d3f88a2c15354cff6c253bedc41d83c1da28b3d686c37cd9103 85.98MB / 85.98MB                                      1.6s
 => => sha256:bbeef03cda1f5d6c9e20c310c1c91382a6b0a1a2501c3436b28152f13896f082 55.03MB / 55.03MB                                      1.4s
 => => sha256:06d0c5d18ef41fa1c2382bd2afd189a01ebfff4910b868879b6dcfeef46bc003 141.98MB / 141.98MB                                    3.2s
 => => extracting sha256:bbeef03cda1f5d6c9e20c310c1c91382a6b0a1a2501c3436b28152f13896f082                                             1.0s
 => => extracting sha256:f049f75f014ee8fec2d4728b203c9cbee0502ce142aec030f874aa28359e25f1                                             0.1s
 => => extracting sha256:56261d0e6b05ece42650b14830960db5b42a9f23479d868256f91d96869ac0c2                                             0.2s
 => => extracting sha256:9bd150679dbdb02d9d4df4457d54211d6ee719ca7bc77747a7be4cd99ae03988                                             1.1s
 => => extracting sha256:bfcb68b5bd105d3f88a2c15354cff6c253bedc41d83c1da28b3d686c37cd9103                                             1.3s
 => => extracting sha256:06d0c5d18ef41fa1c2382bd2afd189a01ebfff4910b868879b6dcfeef46bc003                                             3.7s
 => => extracting sha256:cc7973a07a5b4a44399c5d36fa142f37bb343bb123a3736357365fd9040ca38a                                             0.0s
 => exporting to client directory                                                                                                   277.2s
 => => copying files 280.96MB                                                                                                       277.2s
ERROR: failed to solve: Canceled: context canceled
sipsma commented 1 year ago

This is still an issue with 0.12 + buildx at least.

Sad, but now that I think about it again, I guess that PR only applied a fix to the grpc streams for gateway containers, so it's possible the same issue needs to be fixed in other parts of the codebase.

When you repro it, if you send SIGQUIT to the buildkitd process (or possibly buildkit client process too, in case it's manifesting there somehow) to get a stack trace, you will see goroutines blocked on a stack containing google.golang.org/grpc/internal/transport.(*writeQuota).get (or similar, grpc and Quota are the key things to look for) if that is indeed the issue. Otherwise I'd guess it's something else entirely.

AlekSi commented 1 year ago

It seems like this is a problem only when many files are exported. If only there was a way to specify what files should be exported instead of the whole filesystem…

nderjung commented 8 months ago

This issue still persists. I encounter it whenever filesystems >5-600MB are being exported. I've tried natively on Linux/x86_64 via systemd process; Linux/x86_64 via containerd; and darwin/arm64 via Docker Desktop.

ls-2018 commented 6 months ago

Is there any progress on this issue?

SquirrelDeveloper commented 3 months ago

Hi, is there any workaround for this problem? I am experiencing this behaviour, stuck exporting to client directory, in Github Actions.

JRazek commented 2 months ago

Experiencing this problem as well. Exporting to client directory takes forever.

nderjung commented 2 months ago

A temporary solution we found which works is to use the tar exporter (client.ExporterTar), and then simply de-tar this file. A needless extra processing step but does not result in the deadlock.