moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.03k stars 1.12k forks source link

Strange resource busy issue on simple buildx using buildkit #3812

Open pimguilherme opened 1 year ago

pimguilherme commented 1 year ago

Hi! We are trying to run docker buildx using the Github Actions action, and will sometimes encounter this "resource busy" issue when trying to unmount the /tmp/containerd-mount***** folder that is produced on build.

We investigated a bit the code and found that that folder is actually created and mounted on the containerd project, but it seems like it's triggerd from buildkit, so not sure what the culprit here is.

This problem is on/off, but we were able to simulate it by running a lot of load on a node and then trying to run the build manually on it.

Here's an example of how the issue manifests::

$ /usr/bin/docker buildx build --file docker-image-build-with-cache/test/workflow/fixtures/ok/Dockerfile --iidfile /tmp/docker-build-push-91Rxxm/iidfile --label REDACTED --provenance mode=min,inline-only=true,builder-id=REDACTED --tag REDACTED/infra-github-workflows-docker-image-build-with-cache:main --builder  builder-945a0db5-0c8d-4a0e-9cb8-03057735ab63 --metadata-file /tmp/docker-build-push-91Rxxm/metadata-file --no-cache docker-image-build-with-cache/test/workflow/fixtures/ok  --progress=plain

WARNING: No output specified with docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 1.01kB done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 0.1s

#3 [internal] load metadata for docker.io/library/python:3.11.1-slim
#3 DONE 0.3s

#4 [internal] load build context
#4 DONE 0.0s

#5 [ 1/16] FROM docker.io/library/python:3.11.1-slim@sha256:33a1008485e1a2dc565be79ece483b240cbc4d6266d6144a57a5a9965ede9bbf
#5 resolve docker.io/library/python:3.11.1-slim@sha256:33a1008485e1a2dc565be79ece483b240cbc4d6266d6144a57a5a9965ede9bbf 0.0s done
#5 DONE 0.1s

#4 [internal] load build context
#4 transferring context: 115.13kB done
#4 DONE 0.1s

#5 [ 1/16] FROM docker.io/library/python:3.11.1-slim@sha256:33a1008485e1a2dc565be79ece483b240cbc4d6266d6144a57a5a9965ede9bbf
#5 sha256:99cdff81380064f9516d72e3d10a7b37fd54bac693984c3b9d4d9c7adfd04d94 235B / 235B 0.0s done
#5 sha256:7648ab525d5ff693a70de6d5d6b9d918227b9ecf8cbd24b6b9aa04b088672232 3.35MB / 3.35MB 0.1s done
#5 sha256:5dda314a937ad03f8beac81c714da74e459b6174301368e0903ef586a68ae150 1.08MB / 1.08MB 0.1s done
#5 sha256:09f5525c15a3d4651899bec331ab29198e95fe6185cdbe7cd8f8410caa4f5771 12.08MB / 12.08MB 0.2s done
#5 sha256:01b5b2efb836d74b8b49da819514eca52e25290d1688db59420ffb9c6b65a03c 12.58MB / 31.40MB 0.2s
#5 sha256:01b5b2efb836d74b8b49da819514eca52e25290d1688db59420ffb9c6b65a03c 28.31MB / 31.40MB 0.3s
#5 sha256:01b5b2efb836d74b8b49da819514eca52e25290d1688db59420ffb9c6b65a03c 31.40MB / 31.40MB 0.5s
#5 sha256:01b5b2efb836d74b8b49da819514eca52e25290d1688db59420ffb9c6b65a03c 31.40MB / 31.40MB 0.5s done
#5 extracting sha256:01b5b2efb836d74b8b49da819514eca52e25290d1688db59420ffb9c6b65a03c
#5 extracting sha256:01b5b2efb836d74b8b49da819514eca52e25290d1688db59420ffb9c6b65a03c 3.6s done
#5 DONE 4.3s

#6 [ 2/16] RUN mount
#6 ERROR: failed to unmount /tmp/containerd-mount1594417594: failed to unmount target /tmp/containerd-mount1594417594: device or resource busy
------
 > [ 2/16] RUN mount:
------
Dockerfile:10
--------------------
   8 |     
   9 |     
  10 | >>> RUN mount
  11 |     
  12 |     RUN apt-get update && apt-get upgrade -y > /dev/null \
--------------------
ERROR: failed to solve: failed to compute cache key: failed to unmount /tmp/containerd-mount1594417594: failed to unmount target /tmp/containerd-mount1594417594: device or resource busy

This problem does not seem related to the RUN mount command itself, because it could be anything there.. It seems rather related to the extraction that takes place beforehand.

Sometimes it goes through if I run it again, and then I use docker buildx prune -a -f to remove cache and hit the error again.

This is very strange, we haven't come accross this since about 23/03/2023.. but we coulnd't pinpoint this to any version upgrades, because we tried locking all versions to an earlier date when this error was not met, and it didn't help.

These are the versions we are using:

###docker
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.10.4)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 23.0.4
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 2806fc1057397dbaeefbea0e4e17bddfbd388f38
 runc version: v1.1.5-0-gf19387a
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
 Kernel Version: 5.4.209-116.367.amzn2.x86_64
 Operating System: Alpine Linux v3.17 (containerized)
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 30.58GiB
 Name: xxxx
 ID: a500bfb0-3bad-4980-8179-5c21fbc8535f
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: dlpco
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

###buildx inspect
Name:          builder-945a0db5-0c8d-4a0e-9cb8-03057735ab63
Driver:        docker-container
Last Activity: 2023-04-20 21:07:10 +0000 UTC

Nodes:
Name:      builder-945a0db5-0c8d-4a0e-9cb8-03057735ab630
Endpoint:  builders
Status:    running
Flags:     --debug
Buildkit:  v0.11.5
Platforms: linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/386

### Containerd
containerd github.com/containerd/containerd v1.6.20 2806fc1057397dbaeefbea0e4e17bddfbd388f38

This is running on an AWS instance which runs as a kubernetes node. inux REDACTED 5.4.209-116.367.amzn2.x86_64 #1 SMP Wed Aug 31 00:09:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Buildkit is run on a container within a DIND container in pod on that node. Docker container: Linux REDACTED 5.4.209-116.367.amzn2.x86_64 #1 SMP Wed Aug 31 00:09:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux Buildkit image: moby/buildkit:buildx-stable-1 "buildkitd --debug"

Here you are some logs showing the behavior for a single buildx failed command error.log

And here the Dockerfile used: Dockerfile.txt

Please let me know if there's anything else that could be of service.

Thanks!

pimguilherme commented 1 year ago

PS this bit might be important too.. we don't seem to be using containerd:

/ # docker logs a8349a34fa64 2>&1 | head -20
time="2023-04-20T20:27:26Z" level=info msg="auto snapshotter: using overlayfs"
time="2023-04-20T20:27:26Z" level=warning msg="using host network as the default"
time="2023-04-20T20:27:26Z" level=info msg="found worker \"htoig7vfhf20svvfpk39cn0a6\", labels=map[org.mobyproject.buildkit.worker.executor:oci org.mobyproject.buildkit.worker.hostname:a8349a34fa64 org.mobyproject.buildkit.worker.network:host org.mobyproject.buildkit.worker.oci.process-mode:sandbox org.mobyproject.buildkit.worker.selinux.enabled:false org.mobyproject.buildkit.worker.snapshotter:overlayfs], platforms=[linux/amd64 linux/amd64/v2 linux/amd64/v3 linux/amd64/v4 linux/386]"
time="2023-04-20T20:27:26Z" level=warning msg="skipping containerd worker, as \"/run/containerd/containerd.sock\" does not exist"
time="2023-04-20T20:27:26Z" level=info msg="found 1 workers, default=\"htoig7vfhf20svvfpk39cn0a6\""
time="2023-04-20T20:27:26Z" level=warning msg="currently, only the default worker can be used."
time="2023-04-20T20:27:26Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
tonistiigi commented 1 year ago

PS this bit might be important too.. we don't seem to be using containerd:

This does not matter. The mount code is in the containerd libraries that buildkit imports. Actually running containerd daemon is optional.

This usually means that some process has a file open under /tmp/containerd-mount1594417594 while BuildKit tried to unmount it. It could be that something in buildkit is doing that (or a leftover container that buildkit didn't clean up) but I've never seen a similar report before. Maybe something else running in the system would scan this path and sometimes open some files from it?

pimguilherme commented 1 year ago

thanks @tonistiigi! I will investigate and let you know.. it's not impossible that we have services scanning the system for security, but I will bring back more details

pimguilherme commented 1 year ago

hi @tonistiigi! it's pretty strange.. we don't really have anything scanning this system (at least not the inner container where buildkit is run)

this is an example of lsof output running in 1s intervals during the build command lsof-example.log\

there's only one entry of /tmp/containerd-* there.. not sure files are used very briefly and I didn't catch possible culprits.. but what's most interesting is that after the builx command is run and fails, we still cannot unmount or delete the directory and there's nothing showing open under /tmp/

/tmp # lsof
1   /usr/bin/buildkitd  0   /dev/null
1   /usr/bin/buildkitd  1   pipe:[107006108]
1   /usr/bin/buildkitd  2   pipe:[107006109]
1   /usr/bin/buildkitd  3   /dev/null
1   /usr/bin/buildkitd  4   anon_inode:[eventpoll]
1   /usr/bin/buildkitd  5   pipe:[106997293]
1   /usr/bin/buildkitd  6   pipe:[106997293]
1   /usr/bin/buildkitd  7   /var/lib/buildkit/buildkitd.lock
1   /usr/bin/buildkitd  8   socket:[107007323]
1   /usr/bin/buildkitd  9   /var/lib/buildkit/runc-overlayfs/containerdmeta.db
1   /usr/bin/buildkitd  10  /var/lib/buildkit/runc-overlayfs/metadata_v2.db
1   /usr/bin/buildkitd  11  /var/lib/buildkit/cache.db
1   /usr/bin/buildkitd  12  /var/lib/buildkit/history.db
1   /usr/bin/buildkitd  13  socket:[106876817]
1   /usr/bin/buildkitd  15  /var/lib/buildkit/runc-overlayfs/snapshots/metadata.db
114 /bin/busybox    0   /dev/pts/0
114 /bin/busybox    1   /dev/pts/0
114 /bin/busybox    2   /dev/pts/0
114 /bin/busybox    10  /dev/tty
/tmp # umount containerd-mount23225334
umount: can't unmount /tmp/containerd-mount23225334: Resource busy

please note this is a kubernertes environment, so it goes like this: kubernetes node -> pod -> docker container (docker:dind) -> buildkit container (moby/buildkit:buildx-stable-1)

do you think it's possible that outer layers are making the resource busy? I am not very sure what is being mounted and from where

/ # mount
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/KHPEIWW5SM5VFOAC3WKFHKM4AO:/var/lib/docker/overlay2/l/GH3BC2QTC5X37DR2XBYIJDQ27E:/var/lib/docker/overlay2/l/GJGQ3TOZZFULTQIF424B3F57AP:/var/lib/docker/overlay2/l/3A4X6PRGINXDAEUZ55YVIGHIRS:/var/lib/docker/overlay2/l/Y3CUM3YG3BH53UY4IPLFJBMD4C,upperdir=/var/lib/docker/overlay2/7c7459edf8c313debe8ed78f2753fe64b5e805ef8ba9d81bb6f48e6d5a815259/diff,workdir=/var/lib/docker/overlay2/7c7459edf8c313debe8ed78f2753fe64b5e805ef8ba9d81bb6f48e6d5a815259/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
/dev/nvme0n1p1 on /etc/resolv.conf type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme0n1p1 on /etc/hostname type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme0n1p1 on /etc/hosts type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme0n1p1 on /var/lib/buildkit type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme0n1p1 on /tmp/containerd-mount709539869 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme0n1p1 on /tmp/containerd-mount23225334 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme0n1p1 on /tmp/containerd-mount963839226 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/nvme0n1p1 on /tmp/containerd-mount2677359912 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)

thanks for the assistance!

EDIT: example fuser:

/ # fuser -m /tmp/containerd-mount23225334
1 2652

(2652 is current shell)
pimguilherme commented 1 year ago

one strange note that might add to the investigation.. if I try using strace -f to follow the buildkitd process and collect details as the error is triggered, I cannot reproduce the issue (haven't been able with a few attempts)

this could perhaps indicate that this is some sort of race condition that is coincidentally resolved with strace's added delay here and there

EDIT: I was able to capture the error running strace -f -e trace=%file -p 1 -o strace.out strace.out.gz

tonistiigi commented 1 year ago

docker container (docker:dind)

Maybe this is something with mount propagation. Can you make sure that /tmp is actual tmpfs in your pod.

umount: can't unmount /tmp/containerd-mount23225334: Resource busy

If you can look if there is anything inside that path and what files these are. /proc/<pid>/mountinfo might be bit more verbose.

pimguilherme commented 1 year ago

hi @tonistiigi ! thanks for reaching back

/tmp wasn't a tmpfs filesystem but that didn't seem to fix it (I wanna say it helped, but not sure): docker container within the pod: image

inside the buildkit container I tried setting it to tmpfs too, with mount -t tmpfs tmpfs /tmp (seems like a fake way to do it though) image

there's nothing inside /tmp/containerd-mount*:

/ # ls -la /tmp/containerd-mount134507496
total 0

however /proc/1/mountinfo does provide a bit more details:

/ # cat /proc/1/mountinfo  | grep containerd
7194 23194 259:1 /var/lib/docker/volumes/fbfd4ec22acbc1896c016803c6dc7263aa66d6b7a5ed53f32935482e3795ef0c/_data/volumes/buildx_buildkit_builder-64cabc2b-d2d1-46cd-bcf7-8bfae3f5f9c40_state/_data/runc-overlayfs/snapshots/snapshots/13/fs//deleted /tmp/containerd-mount134507496 rw,noatime master:290 - xfs /dev/nvme0n1p1 rw,attr2,inode64,logbufs=8,logbsize=32k,noquota
pimguilherme commented 1 year ago

okay, it looks like your first hunch was correct! we are still going to test it thoroughly tomorrow, but it looks like a security scan run permanently on the worker node is causing this (seeing the mountinfo info, it was possible to trace back the /tmp/containerd-mount* dir to the worker node)

in case we do need this scanner running, would there be a way to configure buildkit to wait more gracefully for this?

thanks a lot!

pimguilherme commented 1 year ago

okay, reporting back after tests.. it looks like it was not the scanner (we turned it off and it didn't help.. this scanner also apparently only acts as sort of an strace on the system calls.. it's called falco btw)

the tests we are performing are as follows:

sergiomacedo commented 1 year ago

Hello @tonistiigi did you manage to reproduce the problem?

nuzayets commented 1 year ago

We're also experiencing this issue, also running buildkit pods on EKS running AL2. Restarting the builders helps temporarily.

sergiomacedo commented 1 year ago

@nuzayets check this workaround.

https://github.com/actions/actions-runner-controller/discussions/2439#discussioncomment-5849676

nuzayets commented 1 year ago

@nuzayets check this workaround.

actions/actions-runner-controller#2439 (reply in thread)

Thanks for the link. You seem to be running with the default builders on the machine running the buildx executable? So, this sidesteps the issue entirely. We run a multi-platform build natively on our cluster, not in the CI environment's runners; our builders are buildkit pods running on both x86 and ARM nodes. So, switching to using the default builder in the CI environment's runner is not a workaround for us. We have far more storage & compute on the cluster than in CI, and cross-arch builds have been problematic for us in the past, not to mention time on the CI runner is a lot more expensive than time on our cluster.

sergiomacedo commented 1 year ago

I see. Well, I hope buildkit's devs can help you with this issue, then.

brandocomando commented 1 year ago

Having the same error when using moby/buildkit:master-rootless docker image in our self hosted ci env

...
#21 DONE 0.1s

#22 exporting to oci image format
#22 exporting layers
#22 exporting layers 2.7s done
#22 ERROR: failed to unmount /run/user/1000/containerd-mount617712552: failed to unmount target /run/user/1000/containerd-mount617712552: device or resource busy
------
 > exporting to oci image format:
------
error: failed to solve: failed to unmount /run/user/1000/containerd-mount617712552: failed to unmount target /run/user/1000/containerd-mount617712552: device or resource busy
brandocomando commented 1 year ago

Did a few more tests last week. Seemed to be more common when building using daemonless mode, but did happen using buildkitd as well. would really appreciate any input from the team here, I am consistently able to reproduce this issue and its blocking some stuff i'm working on

renilthomas commented 1 year ago

I am replacing DIND with buildkit and in the initial tests I have hit this bug and its easily reproducible.

The buildkit container logs

worker-x-r8d3n-3nhw5 buildkit INFO[2023-06-26T11:42:35Z] found worker "l1easnkgz8z8tjqbjwo1jtqhs", labels=map[org.mobyproject.buildkit.worker.executor:oci org.mobyproject.buildkit.worker.hostname:worker-x-r8d3n-3nhw5 org.mobyproject.buildkit.worker.network:host org.mobyproject.buildkit.worker.oci.process-mode:sandbox org.mobyproject.buildkit.worker.selinux.enabled:false org.mobyproject.buildkit.worker.snapshotter:overlayfs], platforms=[linux/amd64 linux/amd64/v2 linux/amd64/v3 linux/386]
worker-x-r8d3n-3nhw5 buildkit WARN[2023-06-26T11:42:35Z] skipping containerd worker, as "/run/containerd/containerd.sock" does not exist
worker-x-r8d3n-3nhw5 buildkit INFO[2023-06-26T11:42:35Z] found 1 workers, default="l1easnkgz8z8tjqbjwo1jtqhs"
worker-x-r8d3n-3nhw5 buildkit WARN[2023-06-26T11:42:35Z] currently, only the default worker can be used.
worker-x-r8d3n-3nhw5 buildkit INFO[2023-06-26T11:42:35Z] running server on /run/buildkit/buildkitd.sock
worker-x-r8d3n-3nhw5 buildkit ERRO[2023-06-26T11:43:04Z] failed to remove mount temp dir               dir=/tmp/containerd-mount2463842917 error="remove /tmp/containerd-mount2463842917: device or resource busy" spanID=abc644840e7df512 traceID=08414c9e9ae90e51fddcab6dc60513a7
worker-x-r8d3n-3nhw5 buildkit ERRO[2023-06-26T11:43:08Z] /moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = failed to compute cache key: failed to unmount /tmp/containerd-mount2463842917: failed to unmount target /tmp/containerd-mount2463842917: device or resource busy
brandocomando commented 1 year ago

Anyone from the team able to take a look at this?

adawalli commented 1 year ago

We are also hitting this issue on 9/10 builds now.

EKS 1.24, Gitlab Runners in k8s, v0.11.1 b4df08551fb12eb2ed17d3d9c3977ca0a903415a

this used to be a non-issue for us until the last few weeks

What logs can we provide to help triage?

mlschindler commented 1 year ago

+1, we are also using GitLab CI K8s executor, EKS 1.24. Same issue above as @adawalli.

brandocomando commented 1 year ago

Alright ran a few more test today. was able to replicate with moby/buildkit:master-rootless on EKS 1.24, 1.25, 1.26, and 1.27 using latest AL2 ami

Also tried using latest ubuntu flavored eks worker node for eks 1.27 which was actually the worst results.

adawalli commented 1 year ago

@tonistiigi What can we give you to help shine light on this issue? This is starting to break non-multiarch builds for us now

FWIW, our build steps in gitlab are fairly straightforward

.docker-build:
  extends:
    - .docker-dind
  before_script:
    - !reference [.docker-prep, before_script]
    - docker context create kamino-context
    - > 
      docker buildx create \
        --driver docker-container \
        --use \
        --name kamino \
        --bootstrap \
        kamino-context
    - docker buildx inspect
  script:
    - >
      docker buildx build \
        -f ${CONTEXT}/${DOCKERFILE} \
        --cache-from ${CI_REGISTRY_IMAGE} \
        --cache-to ${CI_REGISTRY_IMAGE} \
        -t ${CI_REGISTRY_IMAGE}:${DOCKER_TAG}  \
        ${CONTEXT}
$ docker buildx inspect
Name:          kamino
Driver:        docker-container
Last Activity: 2023-07-11 19:35:24 +0000 UTC
Nodes:
Name:      kamino0
Endpoint:  kamino-context
Status:    running
Buildkit:  v0.11.6
Platforms: linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/386
Labels:
 org.mobyproject.buildkit.worker.executor:         oci
 org.mobyproject.buildkit.worker.hostname:         fe9c6afd115c
 org.mobyproject.buildkit.worker.network:          host
 org.mobyproject.buildkit.worker.oci.process-mode: sandbox
 org.mobyproject.buildkit.worker.selinux.enabled:  false
 org.mobyproject.buildkit.worker.snapshotter:      overlayfs
GC Policy rule#0:
 All:           false
 Filters:       type==source.local,type==exec.cachemount,type==source.git.checkout
 Keep Duration: 48h0m0s
 Keep Bytes:    488.3MiB
GC Policy rule#1:
 All:           false
 Keep Duration: 1440h0m0s
 Keep Bytes:    9.313GiB
GC Policy rule#2:
 All:        false
 Keep Bytes: 9.313GiB
GC Policy rule#3:
 All:        true
 Keep Bytes: 9.313GiB
sh-4.2$ uname -a
Linux <REDACTED>nvpro 5.4.247-162.350.amzn2.x86_64 #1 SMP Tue Jun 27 22:03:59 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
#9 exporting content cache
#9 preparing build cache for export
#9 preparing build cache for export 9.7s done
#9 ERROR: mount callback failed on /tmp/containerd-mount3449153598: failed to unmount /tmp/containerd-mount1958014075: failed to unmount target /tmp/containerd-mount1958014075: device or resource busy
------
 > importing cache manifest from <server>/gitlab-tester:
------
------
 > exporting content cache:
------
WARNING: No output specified with docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
WARNING: buildx: git was not found in the system. Current commit information was not captured by the build
ERROR: failed to solve: mount callback failed on /tmp/containerd-mount3449153598: failed to unmount /tmp/containerd-mount1958014075: failed to unmount target /tmp/containerd-

FWIW, it still fails with caching "disabled", eg

  script:
    - >
      docker buildx build \
        --no-cache \
        -f ${CONTEXT}/${DOCKERFILE} \
        -t ${CI_REGISTRY_IMAGE}:${DOCKER_TAG}  \
        --push \
        ${CONTEXT}
tonistiigi commented 1 year ago

What can we give you to help shine light on this issue?

A reproducer. Looks like this is likely related to the specific environment or maybe rootless.

adawalli commented 1 year ago

What can we give you to help shine light on this issue?

A reproducer. Looks like this is likely related to the specific environment or maybe rootless.

We are not running rootless. We are running on a vanilla AWS EKS cluster with the only caveat being that we route through a direct connect.

Are there no logs we can provide that would be of interest to you?

adawalli commented 1 year ago

What daemonsets are folks running here? I found that the issue has 100% disappeared for me by removing datadog agent - which definitely spies on the container mounts. I am going to continue testing to make sure this isn't built on hopium, but take a critical look at your daemonsets.

brandocomando commented 1 year ago

I am using datadog also! I'll try turning it off today and see if i can replicate it.

brandocomando commented 1 year ago

So not definitive proof yet. But ran 10 concurrent builds twice in a row without datadog daemon set with no errors. turned datadog back on, and ran 10 concurrent builds, got this error in 1 of the 10 builds. planning on doing a few more tests next week

adawalli commented 1 year ago

So not definitive proof yet. But ran 10 concurrent builds twice in a row without datadog daemon set with no errors. turned datadog back on, and ran 10 concurrent builds, got this error in 1 of the 10 builds. planning on doing a few more tests next week

We were failing 10/10 times before this change but I need to try a few hundred times and on more nodes to be confident

adawalli commented 1 year ago

I have now run over 1000 jobs (20 pipelines of 50 concurrent builds) and the issue has not popped up a single time. As soon as I re-enable datadog agent, I fail 9/10 (or more) times.

I have a ticket open with DD to find out how to disable snooping on the container layers. I will update my comment here when I find something out.

brandocomando commented 1 year ago

Awesome job @adawalli! I'll open a ticket with our account as well. Would it be a good idea to post something in their public repo too?

adawalli commented 1 year ago

Awesome job @adawalli! I'll open a ticket with our account as well. Would it be a good idea to post something in their publish repo too?

The only reason I haven't raised an issue in their github, is because I haven't figured out exactly which integration (e.g., cri, vs containerd, vs k8s, vs docker) is causing the issue. I will do that at some point if I don't make any headway with support though. Keep me posted if the issue appears to be resolved on your end with further testing 🙏

brandocomando commented 1 year ago

Gotcha. Yea, same if you find anything out let me know.

sergiomacedo commented 1 year ago

Well, in our scenario we didn't have datadog but we have trivvy security scan. However, when we were running our testes, we disabled the security scans and the problems still happened.

(edit: formatting)

brandocomando commented 1 year ago

@sergiomacedo do you have any other daemon sets that might be monitoring container mounts?

brandocomando commented 1 year ago

@adawalli Narrowed it down a bit today. Turned off "Enable Universal Service Monitoring" on the datadog agents and had no issues building. turned it back on and error reappeared.

sergiomacedo commented 1 year ago

@sergiomacedo do you have any other daemon sets that might be monitoring container mounts?

trivvy and falco-security. However, as I said, they were disabled during the tests to narrow down the problem.

I cant reproduce it anymore because we droped buildkit altogether and just replicated the build commands on our own.

adawalli commented 1 year ago

@adawalli Narrowed it down a bit today. Turned off "Enable Universal Service Monitoring" on the datadog agents and had no issues building. turned it back on and error reappeared.

I am reproducing your results. I think it's time to take this to datadog github issue. I will begin that conversation and link back here