moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.21k stars 1.16k forks source link

Invalid layer content in rootless mode #2381

Open lbernail opened 3 years ago

lbernail commented 3 years ago

Hello,

We are starting to run buildkit in rootless mode and in some cases, we seem to be having issues with the snapshotter. For instance, with this Dockerfile

FROM golang:latest as builder

# https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/commit/93915e330be4291f209b13dc6df063729bd4d9c4
ENV TAG "93915e330be4291f209b13dc6df063729bd4d9c4"
ENV GOOS "linux"
ENV CGO_ENABLED "0"

WORKDIR src/sigs.k8s.io/sig-storage-local-static-provisioner

RUN git clone https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner.git .  && ls -l vendor/k8s.io/utils/io/ && find . | wc -l
RUN git checkout $TAG && ls -l vendor/k8s.io/utils/io/ && find | wc -l

RUN echo "Checking content" && ls -l vendor/k8s.io/utils/io/ && find | wc -l

We get this with buildkit in rootless mode:

#6 [3/5] RUN git clone https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner.git .  && ls -l vendor/k8s.io/utils/io/ && find . | wc -l
#6 sha256:aff0ee315ab635aa458b8e553f083158f525c06590d39b21212716100211e4fa
#6 0.073 Cloning into '.'...
#6 3.166 total 8
#6 3.166 -rw-r--r-- 1 root root  134 Sep 25 09:58 README.md
#6 3.166 -rw-r--r-- 1 root root 2982 Sep 25 09:58 read.go
#6 3.195 6787
#6 DONE 3.4s

#7 [4/5] RUN git checkout 93915e330be4291f209b13dc6df063729bd4d9c4 && ls -l vendor/k8s.io/utils/io/ && find . | wc -l
#7 sha256:712f0ac1353ca45de25e50eb4e17134c9865712a83f756c893ee1fde0750e0c8
#7 1.068 Note: switching to '93915e330be4291f209b13dc6df063729bd4d9c4'.
#7 1.072 total 4
#7 1.072 -rw-r--r-- 1 root root 1748 Sep 25 09:58 consistentread.go
#7 1.106 5573
#7 DONE 1.3s

#8 [5/5] RUN echo "Checking content" && ls -l vendor/k8s.io/utils/io/ && find . | wc -l
#8 sha256:600ad17eb12cb7bd7a2f3154b8eec4ce13dd60bd40b2cabf4b95320e00fd45df
#8 0.120 Checking content
#8 0.121 total 12
#8 0.121 -rw-r--r-- 1 root root  134 Sep 25 09:58 README.md
#8 0.121 -rw-r--r-- 1 root root 1748 Sep 25 09:58 consistentread.go
#8 0.121 -rw-r--r-- 1 root root 2982 Sep 25 09:58 read.go
#8 0.157 5605
#8 DONE 0.2s

Of course we'd expect content of the layer at step #8 to be the same as the content at the end of step #7

We tested building in non-rootless mode and the problem disappeared. After looking into past issues, we found https://github.com/moby/buildkit/issues/1792 which felt a bit similar, so we tried in rootless mode with --oci-worker-snapshotter=native and the problem also disappeared.

So the problem only happens in rootless mode with overlayfs snapshotter. Here is the configuration where we see this behavior:

Is there a way to diagnose what is failing when we use the overlayfs snapshotter? I could not find the impact of using the native snapshotter instead of the overlayfs one. I assume slower build?

Of course, we are more than happy to perform additional tests to gather more data

tonistiigi commented 3 years ago

@AkihiroSuda I guess this is the fuse-overlay? Seems it doesn't invalidate the fuse cache or writeback is not implemented correctly.

edit: or is it the opaque handling? I initially looked that files where not there after clone but the issue seems to be that layer removes files and adds some new ones but in the next layer both appear.

tonistiigi commented 3 years ago

@giuseppe Is this something known/tracked for fuse-overlayfs?

giuseppe commented 3 years ago

It could be fuse-overlayfs.

What version of fuse-overlayfs is it?

What is the underlying file system? Is it running on top of an another overlay mount?

tonistiigi commented 3 years ago

What version of fuse-overlayfs is it?

docker run -it --rm --entrypoint apk moby/buildkit:v0.9.0-rootless info -v | grep fuse-overlayfs
fuse-overlayfs-1.6-r0
giuseppe commented 3 years ago

I am not sure I've followed the correct steps, but I've tried a simple docker build --no-cache . as rootless on Ubuntu 20.04.2 with kernel 5.4.0-73-generic with the Dockerfile above, and this is what I get:

...

Step 7/8 : RUN git checkout $TAG && ls -l vendor/k8s.io/utils/io/ && find | wc -l
 ---> Running in d62a84ba8eb9
Note: switching to '93915e330be4291f209b13dc6df063729bd4d9c4'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 93915e33 Merge pull request #135 from DataDog/JulienBalestra/readiness
total 4
-rw-r--r-- 1 root root 1748 Oct  5 08:52 consistentread.go
5573
Removing intermediate container d62a84ba8eb9
 ---> c1831ee2992d
Step 8/8 : RUN echo "Checking content" && ls -l vendor/k8s.io/utils/io/ && find | wc -l
 ---> Running in 713d365767c0
Checking content
total 12
-rw-r--r-- 1 root root  134 Oct  5 08:52 README.md
-rw-r--r-- 1 root root 1748 Oct  5 08:52 consistentread.go
-rw-r--r-- 1 root root 2982 Oct  5 08:52 read.go
5605
Removing intermediate container 713d365767c0
 ---> 175956894c15
Successfully built 175956894c15

fuse-overlays is not even installed on the system.

lbernail commented 3 years ago

@giuseppe we run the rootless buildkitd in a kubernetes pod with no specific mounts so the underlying fs for builds is the root fs of the container (overlay) We will try using an ephemeral volume to see if using ext4 as the underlying fs solves the issue

lbernail commented 3 years ago

@AkihiroSuda moving to the native snapshotter solved all our problems with invalid layers but we took a huge hit in terms of performance (io throughput is much higher). I assume that's expected and that it is the only impact we should see from using this snapshotter?

lbernail commented 3 years ago

We checked on our side and we are not using fuse-overlay (we don't load the fuse module and we don't run the k8s-hostdev-plugin plugin: https://github.com/moby/buildkit/pull/1384) So it seems the problem is really overlay in rootless mode (which confirms what you saw @giuseppe ) We will try with a more recent kernel next week (so far we have tested on 5.8 and we will test on 5.11) and we will keep you updated

lbernail commented 3 years ago

I'm not 100% sure but I think I may have found the (or a) problem.

I simulated what happens directly on a linux host:

mkdir lower merged upper work
git clone https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner.git lower
sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
cd merged/ ; git checkout 93915e330be4291f209b13dc6df063729bd4d9c4 ; cd ..

At this point, lower has:

ls -l lower/vendor/k8s.io/utils/io
total 8
-rw-rw-r-- 1 ddeng ddeng 2982 Oct 10 15:08 read.go
-rw-rw-r-- 1 ddeng ddeng  134 Oct 10 15:08 README.md

And here is what we get in merged:

ls -l merged/vendor/k8s.io/utils/io
total 4
-rw-rw-r-- 1 ddeng ddeng 1748 Oct 10 15:09 consistentread.go

Which is the expected content. What about upper?

ls -l upper/vendor/k8s.io/utils/io
total 4
-rw-rw-r-- 1 ddeng ddeng 1748 Oct 10 15:09 consistentread.go

No whiteouts, let's look at the directory:

sudo getfattr -R -d -m "" upper/vendor/k8s.io/utils/io
# file: upper/vendor/k8s.io/utils/io
trusted.overlay.opaque="y"

=> Directory is marked as opaque and the lower directory is ignored

Let's do the same in a user namespace, starting from scratch:

unshare -m -p -f -U -r bash
mkdir lower merged upper work
git clone https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner.git lower
mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
cd merged/ ; git checkout 93915e330be4291f209b13dc6df063729bd4d9c4 ; cd ..

At this point everything is the same, but

getfattr -R -d -m "" upper/vendor/k8s.io/utils/io

returns nothing in the userns. The directory has the right xattr in the host user namespace (trusted.overlay.opaque="y"), but we don't have the info in the userns so we can't add it when we snapshot.

So, I think the problem is with the extended attribute: given it's part of the Trusted namespace we can't see it in a user namespace. I suspect this only impacts opaque directories and we can likely create a much simpler reproducer.

lbernail commented 3 years ago

There is a solution in 5.11: https://github.com/torvalds/linux/commit/2d2f2d7322ff43e0fe92bf8cccdc0b09449bf2e1

And it looks promising:

unshare -m -p -f -U -r bash
mkdir lower merged upper work
git clone https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner.git lower
mount -t overlay overlay -o userxattr -olowerdir=lower,upperdir=upper,workdir=work merged
cd merged/ ; git checkout 93915e330be4291f209b13dc6df063729bd4d9c4 ; cd ..

Same as before, with the additional mount option from 5.11: -o userxattr

And as a result:

getfattr -R -d -m "" upper/vendor/k8s.io/utils/io
# file: upper/vendor/k8s.io/utils/io
user.overlay.opaque="y"

Now I wonder if this will work with the overlay snapshotter because the new mount option is explicit and only work on 5.11+ kernels. We'll update our buildkit platform to 5.11 and will let you know what we find.

lbernail commented 3 years ago

I'm not sure what happens between 2 RUN commands in buildkit. Are we archiving the diff into a tarball before processing to next one? (this seems to rely on github.com/docker/docker/pkg/archive )

Slightly related: https://github.com/moby/buildkit/pull/2181 computes diff from the overlayfs upper dir (which requires trusted.overlay.opaque or user.overlay.opaque). Before this PR, diff relied on walking differ (which is slow but should always be right I think). Once this PR goes into a release it may break some setups (kernel < 5.11). It does not seem to be our problem here because 0.9.0 still uses the walking differ (hence my first question on archive)

tonistiigi commented 3 years ago

Are we archiving the diff into a tarball before processing to next one?

No. Just the previous upper becomes an additional lower.

@AkihiroSuda If the issue is that overlay in rootless mode doesn't work correctly for <5.11 kernels (at least for ubuntu) we should try to detect it and run fuse for these as well. cc @ktock for the overlay differ part

lbernail commented 3 years ago

No. Just the previous upper becomes an additional lower.

This means we do not go through the Differ or Archiver ? Looks like I have another code path to explore. So we are unmounting and remounting using upper as an additional lower directly? (I wonder because I feel this should work, unless we do a copy, which may lead to loss of the xattr). I can easily check this on a 5.8 host

tonistiigi commented 3 years ago

Differ only runs at the end of the build if an image result was needed.

lbernail commented 3 years ago

OK I confirm this is the problem on 5.8:

unshare -m -p -f -U -r bash
mkdir lower merged upper work
git clone https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner.git lower
mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
cd merged/ ; git checkout 93915e330be4291f209b13dc6df063729bd4d9c4 ; cd ..

At this point, merged has the right content and xattr does not show in userns:

ls merged/vendor/k8s.io/utils/io
consistentread.go

getfattr -R -d -m "" upper/vendor/k8s.io/utils/io
=> nothing

If we proceed with a new mount including upper as an additional layer:

umount merged
mkdir upper2
mount -t overlay overlay -olowerdir=upper:lower,upperdir=upper2,workdir=work merged

the content becomes wrong, showing the xattr is ignored for the second mount:

ls merged/vendor/k8s.io/utils/io
consistentread.go  read.go  README.md

On 5.11, this works completely fine (as long as -o userxattr is added to mount options)

lbernail commented 3 years ago

I confirm that the problem disappears with Ubuntu 20.04.3 and kernel 5.11 , which is great news 🎉 I agree with @tonistiigi I think we should not use overlayfs in kernel <5.11 even on Ubuntu: it will mostly work but will have weird edge cases as the one we encountered

lbernail commented 3 years ago

I just noticed we have a lot of

[22995.791577] overlayfs: lowerdir is in-use as upperdir/workdir of another mount, accessing files from both mounts will result in undefined behavior.

in our kernel logs. I assume we don't unmount overlayfs between the different mount operations used for RUN statements? It's likely just a warning and this does not impact our builds