moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.17k stars 1.16k forks source link

Rootless on Bottlerocker failed with failed to mount /run/user/1000/containerd-mount3852074643: operation not permitted #4667

Closed AhmadMS1988 closed 5 months ago

AhmadMS1988 commented 8 months ago

Hi fellows in buildkit. I know this might have open multiple times like here and here, but will try to bring it again with more details so you may be able to help more. I am trying to run buildkit in rootless in EKS using bottlerocket, the infra information are below:

  1. Arch: arm64
  2. buildkit image: moby/buildkit:rootless
  3. Bottlerocket OS 1.19.1 (aws-k8s-1.28)
  4. k8s version: v1.28.5-eks-5e0fdde

The below pod definition is used:

apiVersion: v1
kind: Pod
metadata:
  name: buildkitd
  annotations:
    container.apparmor.security.beta.kubernetes.io/buildkitd: unconfined
spec:
  nodeSelector:
    workload: runners
  containers:
    - name: buildkitd
      image: moby/buildkit:rootless
      args:
        - --addr
        - tcp://0.0.0.0:1234
        - --oci-worker-no-process-sandbox
        - --debug
      securityContext:
        seccompProfile:
          type: Unconfined
        runAsUser: 1000
        runAsGroup: 1000
      volumeMounts:
        - mountPath: /home/user/.local/share/buildkit
          name: buildkitd
    - name: runner
      image: moby/buildkit:rootless
      command: [ "/bin/sh", "-c", "--" ]
      args: [ "while true; do sleep 30; done;" ]
      env:
        - name: BUILDKIT_HOST
          value: tcp://localhost:1234
  volumes:
    - name: buildkitd
      emptyDir: {}

Note the the runner is actually a custom image that we use in our CI, but replaced with the same buildkit container as it has buildctl to use, but buildkit container is the same.

When we run buildctl on the runner, we get the following error:

time="2024-02-19T12:34:34Z" level=warning msg="failed to compute blob by overlay differ (ok=false): failed to write compressed diff: mount callback failed on /run/user/1000/containerd-mount387897202: mount callback failed on /run/user/1000/containerd-mount1737412574: failed to record upperdir changes (close error: failed to close tar writer: context canceled): context canceled"
time="2024-02-19T12:34:34Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = failed to mount /run/user/1000/containerd-mount3852074643: operation not permitted"

Bottlerocket is configured with:

    [settings.kernel.sysctl]
    "user.max_user_namespaces" = "63359"

Really appreciate your help in identifying where the missing peace to let this to work. Thank you

AkihiroSuda commented 8 months ago

Does it work if you specify securityContext.privileged ?

AkihiroSuda commented 8 months ago

Does https://raw.githubusercontent.com/moby/buildkit/master/examples/kubernetes/job.rootless.yaml work?

AhmadMS1988 commented 8 months ago

We do not want to run it in privileged mode.

AkihiroSuda commented 8 months ago

We do not want to run it in privileged mode.

Asking for a diagnosis purpose

AhmadMS1988 commented 8 months ago

It worked actually, but still the purpose to run it without privileged.

AhmadMS1988 commented 8 months ago

One question comes to my mind, as we by default use the oci worker, what is this containerd mount?

AkihiroSuda commented 8 months ago

One question comes to my mind, as we by default use the oci worker, what is this containerd mount?

OCI mode still consumes containerd as a library

AhmadMS1988 commented 8 months ago

Is there any logs or commands that I can execute to help investigating more?

AkihiroSuda commented 8 months ago

Is there any logs or commands that I can execute to help investigating more?

cat /proc/mounts in the buildkitd container, and compare the result with Ubuntu nodes, etc.

AhmadMS1988 commented 8 months ago

It worked as expected on both Amazon linux 2 and Ubuntu EKS optimized images based on 20.04. The output of /proc/mounts is:

overlay / overlay rw,context="system_u:object_r:data_t:s0:c208,c287",relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/71/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/59/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/55/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/50/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/45/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/40/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/425/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/425/work 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,context="system_u:object_r:data_t:s0:c208,c287",nosuid,size=65536k,mode=755 0 0
devpts /dev/pts devpts rw,context="system_u:object_r:data_t:s0:c208,c287",nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
mqueue /dev/mqueue mqueue rw,seclabel,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs ro,seclabel,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup cgroup2 ro,seclabel,nosuid,nodev,noexec,relatime 0 0
/dev/nvme1n1p1 /etc/hosts xfs rw,seclabel,nosuid,nodev,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/nvme1n1p1 /dev/termination-log xfs rw,seclabel,nosuid,nodev,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/nvme1n1p1 /etc/hostname xfs rw,seclabel,nosuid,nodev,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/nvme1n1p1 /etc/resolv.conf xfs rw,seclabel,nosuid,nodev,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
shm /dev/shm tmpfs rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k 0 0
/dev/nvme1n1p1 /home/user/.local/share/buildkit xfs rw,seclabel,nosuid,nodev,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
tmpfs /run/secrets/kubernetes.io/serviceaccount tmpfs ro,seclabel,relatime,size=6931992k 0 0
proc /proc/bus proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/fs proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/irq proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sys proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sysrq-trigger proc ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /proc/acpi tmpfs ro,context="system_u:object_r:data_t:s0:c208,c287",relatime 0 0
tmpfs /proc/kcore tmpfs rw,context="system_u:object_r:data_t:s0:c208,c287",nosuid,size=65536k,mode=755 0 0
tmpfs /proc/keys tmpfs rw,context="system_u:object_r:data_t:s0:c208,c287",nosuid,size=65536k,mode=755 0 0
tmpfs /proc/latency_stats tmpfs rw,context="system_u:object_r:data_t:s0:c208,c287",nosuid,size=65536k,mode=755 0 0
tmpfs /proc/timer_list tmpfs rw,context="system_u:object_r:data_t:s0:c208,c287",nosuid,size=65536k,mode=755 0 0
tmpfs /proc/scsi tmpfs ro,context="system_u:object_r:data_t:s0:c208,c287",relatime 0 0
tmpfs /sys/firmware tmpfs ro,context="system_u:object_r:data_t:s0:c208,c287",relatime 0 0

Can you please take a look and provide feedback so I can open a ticket to Bottlerocket team with the details? Thanks

AkihiroSuda commented 8 months ago

Seems relevant to SELinux? Does this work?

securityContext:
  seLinuxOptions:
    level: s0
    type: spc_t
AhmadMS1988 commented 8 months ago

Unfortunately, it did not work. I got the same error.

bcressey commented 5 months ago

As far as I can tell this is the same error that was fixed in #3697, but at a different stage in the process.

Running mountsnoop from bcc, I can see that the initial set of bind mounts go OK:

buildkitd        210370  210738  4026533418  mount("/home/user/.local/share/buildkit/runc-overlayfs/snapshots/snapshots/2/fs", "/home/user/.local/tmp/buildkit-mount276192057", "bind", MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOATIME|MS_BIND|MS_REC, "") = 0
buildkitd        210370  210738  4026533418  mount("", "/home/user/.local/tmp/buildkit-mount276192057", "", MS_RDONLY|MS_NOSUID|MS_NODEV|MS_REMOUNT|MS_NOATIME|MS_BIND|MS_REC, "") = 0
...

However, the operation ultimately fails in the call to overlay.WriteUpperdir:

2024-05-06T01:07:57.845201317Z stderr F time="2024-05-06T01:07:57Z" level=warning msg="failed to compute blob by overlay differ (ok=false): failed to write compressed diff: failed to mount /home/user/.local/tmp/containerd-mount1074778686: operation not permitted" span="export layers" spanID=0f5a00d506b35262 traceID=32ade31627d6b338d5e3051b59dea3e2

From the related mountsnoop output, we can see that the nosuid and nodev flags were not passed:

buildkitd        210370  210739  4026533418  mount("/home/user/.local/share/buildkit/runc-overlayfs/snapshots/snapshots/4/fs", "/home/user/.local/tmp/containerd-mount1074778686", "bind", MS_RDONLY|MS_BIND|MS_REC, "") = 0
buildkitd        210370  210739  4026533418  mount("", "/home/user/.local/tmp/containerd-mount1074778686", "", MS_RDONLY|MS_REMOUNT|MS_BIND|MS_REC, "") = -EPERM

overlay.WriteUpperdir calls into mount.WithTempMount, which uses the containerd mount library. It looks like we end up here and then the remount fails because it doesn't have the equivalent of the UnprivilegedMountFlags logic.

AkihiroSuda commented 5 months ago

overlay.WriteUpperdir calls into mount.WithTempMount, which uses the containerd mount library. It looks like we end up here and then the remount fails because it doesn't have the equivalent of the UnprivilegedMountFlags logic.

@bcressey Thanks for analysis. Would you be interested in submitting a PR?

swagatbora90 commented 5 months ago

@bcressey if okay, I can work on a fix for this.

bcressey commented 5 months ago

@swagatbora90 that'd be great! Let me know if I can help advise on setting up a test environment, or testing out a change when ready.

vtgspk commented 5 months ago

As mentioned by @bcressey , Bottlerocket mounts its local storage with “nosuid” and “nodev” flags as a hardening step, and those flags are among those that have to be passed in subsequent bind mounts.

Here is the workaround using a persistent volume(EBS csi driver in EKS) instead of emptyDir that in turn uses Bottlerocket's local storage

Pod: Used fsGroup as 1000 to mount the volume within the pod for user (1000) and the Group (1000) to have access

Pod yaml - https://github.com/vtgspk/buildkit-rootless/blob/main/pod.yml Persistent Volume Claim- https://github.com/vtgspk/buildkit-rootless/blob/main/persistent-claim.yml Storage class - https://github.com/vtgspk/buildkit-rootless/blob/main/storage-class.yml

By this way, I am able to get the buildkitd pod up and running and build images successfully within that which uses the EBS mount instead of the Bottlerocket local storage.

swagatbora90 commented 5 months ago

@bcressey @AkihiroSuda Added PR to check and preserve unprivileged flags before we remount a bind mount for readonly. However, the change alone was not sufficient and also had to update the above pod spec to mount the /tmp directory from the host

pod.spec


apiVersion: v1
kind: Pod
metadata:
  name: buildkitd
spec:
  containers:
    - name: buildkitd
      image: public.ecr.aws/e5v3s6y4/buildkit-rootless:rootless
      args:
        - --addr
        - tcp://0.0.0.0:1234
        - --oci-worker-no-process-sandbox
        - --debug
      securityContext:
        seccompProfile:
          type: Unconfined
        runAsUser: 1000
        runAsGroup: 1000
      volumeMounts:
        # The first mount is not needed, but makes it explicit that there
        # is a VOLUME here which shows up as a separate mount, which is why
        # buildkit is able to find the unprivileged mount flags it needs to
        # preserve.
        - mountPath: /home/user/.local/share/buildkit
          name: buildkitd-1
        # The second mount is needed, because otherwise there's no explicit
        # mount to inspect for mount options, and the underlying filesystem's
        # mount flags are obscured by the overlayfs used for the container's
        # rootfs.
        - mountPath: /home/user/.local/tmp
          name: buildkitd-2
      env:
        # This is required to align the temporary directory created by buildkit
        # with the volume mount for that directory.
        - name: XDG_RUNTIME_DIR
          value: /home/user/.local/tmp
    - name: runner
      image: moby/buildkit:rootless
      command: [ "/bin/sh", "-c", "--" ]
      args: [ "while true; do sleep 30; done;" ]
      env:
        - name: BUILDKIT_HOST
          value: tcp://localhost:1234
  volumes:
    - name: buildkitd-1
      emptyDir: {}
    - name: buildkitd-2
      emptyDir: {}

Exposing the tmp dir as a bind mount in the container is required, otherwise the directory is just in the container root and its actual mount flags get obfuscated by overlayfs. So, the check for unprivileged flags no longer works. Inorder to make this work we need both 1) Update containerd mount library to preserve nosuid, nodev flags 2) Pod spec update to bind mount /tmp dir.

Let me know if this makes sense. I am also wondering if we no longer need #3697 since we are already checking for the flags downstream in containerd. I will test this out next.

xmanwms95 commented 4 months ago

@bcressey @AkihiroSuda Added PR to check and preserve unprivileged flags before we remount a bind mount for readonly. However, the change alone was not sufficient and also had to update the above pod spec to mount the /tmp directory from the host

pod.spec


apiVersion: v1
kind: Pod
metadata:
  name: buildkitd
spec:
  containers:
    - name: buildkitd
      image: public.ecr.aws/e5v3s6y4/buildkit-rootless:rootless
      args:
        - --addr
        - tcp://0.0.0.0:1234
        - --oci-worker-no-process-sandbox
        - --debug
      securityContext:
        seccompProfile:
          type: Unconfined
        runAsUser: 1000
        runAsGroup: 1000
      volumeMounts:
        # The first mount is not needed, but makes it explicit that there
        # is a VOLUME here which shows up as a separate mount, which is why
        # buildkit is able to find the unprivileged mount flags it needs to
        # preserve.
        - mountPath: /home/user/.local/share/buildkit
          name: buildkitd-1
        # The second mount is needed, because otherwise there's no explicit
        # mount to inspect for mount options, and the underlying filesystem's
        # mount flags are obscured by the overlayfs used for the container's
        # rootfs.
        - mountPath: /home/user/.local/tmp
          name: buildkitd-2
      env:
        # This is required to align the temporary directory created by buildkit
        # with the volume mount for that directory.
        - name: XDG_RUNTIME_DIR
          value: /home/user/.local/tmp
    - name: runner
      image: moby/buildkit:rootless
      command: [ "/bin/sh", "-c", "--" ]
      args: [ "while true; do sleep 30; done;" ]
      env:
        - name: BUILDKIT_HOST
          value: tcp://localhost:1234
  volumes:
    - name: buildkitd-1
      emptyDir: {}
    - name: buildkitd-2
      emptyDir: {}

Exposing the tmp dir as a bind mount in the container is required, otherwise the directory is just in the container root and its actual mount flags get obfuscated by overlayfs. So, the check for unprivileged flags no longer works. Inorder to make this work we need both 1) Update containerd mount library to preserve nosuid, nodev flags 2) Pod spec update to bind mount /tmp dir.

Let me know if this makes sense. I am also wondering if we no longer need #3697 since we are already checking for the flags downstream in containerd. I will test this out next.

Updating to buildkit v0.14.0 resolves the failed to mount issue when using the rootless configuration on bottlerocket.