Why Runc do bind mount config.rootfs to config.rootfs while it create container ?

Marshalzxy commented 2 years ago

In rootfs_linux.go prepareRoot(), Runc bind mount config.Rootfs to config.Rootfs . Why this step is necessary? In my centos 8.2, while graphdriver is overlayfs2, it seem that this step can be skipped without any consequences.Can we optimize this step? https://github.com/opencontainers/runc/blob/51e607f2cdd8ccf7f39f29aab531d674e487b311/libcontainer/rootfs_linux.go#L785 Here is the code: func prepareRoot(config *configs.Config) error { flag := unix.MS_SLAVE | unix.MS_REC if config.RootPropagation != 0 { flag = config.RootPropagation } if err := unix.Mount("", "/", "", uintptr(flag), ""); err != nil { return err } if err := rootfsParentMountPrivate(config.Rootfs); err != nil { return err } return unix.Mount(config.Rootfs, config.Rootfs, "bind", unix.MS_BIND|unix.MS_REC, "") }

kolyshkin commented 2 years ago

I am not quite sure but maybe this mount is not needed if config.Rootfs is already a mount point.

If the above is true, we can make it optional. Problem is, detection of whether the path is a mount point is a costly operation itself, for kernels that lack openat2(2) support (that is, Linux < 5.6) detecting a bind mount requires parsing /proc/mountinfo.

What is your use case, and what is the benefit from making this mount optional?

kolyshkin commented 2 years ago

Problem is, detection of whether the path is a mount point is a costly operation itself, for kernels that lack openat2(2) support (that is, Linux < 5.6) detecting a bind mount requires parsing /proc/mountinfo.

OTOH we already parse mountinfo anyway in rootfsParentMountPrivate...

kolyshkin commented 2 years ago

The following patch can prevent the bind mount if it's already a mount point:

diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index 5290a45e..c123547e 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -738,15 +738,18 @@ func getParentMount(rootfs string) (string, string, error) {
    return mi[idx].Mountpoint, mi[idx].Optional, nil
 }

-// Make parent mount private if it was shared
-func rootfsParentMountPrivate(rootfs string) error {
+// Make parent mount private if it was shared. This also returns
+// a flag whether rootfs itself is a mount point.
+func rootfsParentMountPrivate(rootfs string) (bool, error) {
    sharedMount := false

    parentMount, optionalOpts, err := getParentMount(rootfs)
    if err != nil {
-       return err
+       return false, err
    }

+   isMount := parentMount == rootfs
+
    optsSplit := strings.Split(optionalOpts, " ")
    for _, opt := range optsSplit {
        if strings.HasPrefix(opt, "shared:") {
@@ -760,10 +763,10 @@ func rootfsParentMountPrivate(rootfs string) error {
    // shared. Secondly when we bind mount rootfs it will propagate to
    // parent namespace and we don't want that to happen.
    if sharedMount {
-       return mount("", parentMount, "", "", unix.MS_PRIVATE, "")
+       return isMount, mount("", parentMount, "", "", unix.MS_PRIVATE, "")
    }

-   return nil
+   return isMount, nil
 }

 func prepareRoot(config *configs.Config) error {
@@ -778,10 +781,16 @@ func prepareRoot(config *configs.Config) error {
    // Make parent mount private to make sure following bind mount does
    // not propagate in other namespaces. Also it will help with kernel
    // check pass in pivot_root. (IS_SHARED(new_mnt->mnt_parent))
-   if err := rootfsParentMountPrivate(config.Rootfs); err != nil {
+   isMount, err := rootfsParentMountPrivate(config.Rootfs)
+   if err != nil {
        return err
    }

+   if isMount {
+       // config.Rootfs is already a mount point.
+       return nil
+   }
+
    return mount(config.Rootfs, config.Rootfs, "", "bind", unix.MS_BIND|unix.MS_REC, "")
 }

Feel free to test it and report your findings here.

Marshalzxy commented 2 years ago

I am not quite sure but maybe this mount is not needed if config.Rootfs is already a mount point.

If the above is true, we can make it optional. Problem is, detection of whether the path is a mount point is a costly operation itself, for kernels that lack openat2(2) support (that is, Linux < 5.6) detecting a bind mount requires parsing /proc/mountinfo.

What is your use case, and what is the benefit from making this mount optional?

Thanks to reply my issue, it is not my special use case .I read these codes of Runc preparing rootfs, and this duplicated bind mounted made me confused. So i think maybe we can do some optimize this step. @kolyshkin

kolyshkin commented 2 years ago

i think maybe we can do some optimize this step.

This is what the above patch does. I am not convinced though whether we should have it or not. Surely, we can omit this mount, and thus save a mount(2) syscall. OTOH this makes the code more complicated, and I am not quite sure it is worth the savings.

What do you think @Marshalzxy ?

Marshalzxy commented 2 years ago

i think maybe we can do some optimize this step.

This is what the above patch does. I am not convinced though whether we should have it or not. Surely, we can omit this mount, and thus save a mount(2) syscall. OTOH this makes the code more complicated, and I am not quite sure it is worth the savings.

What do you think @Marshalzxy ?

It's indeed more complicated. At lease we should add some comments above return mount(config.Rootfs, config.Rootfs, "", "bind", unix.MS_BIND|unix.MS_REC, "") To explaint under some circumstances this mount is redundant.

kolyshkin commented 2 years ago

It's indeed more complicated. At lease we should add some comments above return mount(config.Rootfs, config.Rootfs, "", "bind", unix.MS_BIND|unix.MS_REC, "") To explaint under some circumstances this mount is redundant.

I think I do have a comment, right before the return, which explains the case.

My question was different. Do you think that adding this (thus making the code code complicated) is worth the savings we get in return (skipped bind mount)? I am genuinely not sure about this, and that is why I'm asking.

Marshalzxy commented 2 years ago

It's indeed more complicated. At lease we should add some comments above return mount(config.Rootfs, config.Rootfs, "", "bind", unix.MS_BIND|unix.MS_REC, "") To explaint under some circumstances this mount is redundant.

I think I do have a comment, right before the return, which explains the case.

My question was different. Do you think that adding this (thus making the code code complicated) is worth the savings we get in return (skipped bind mount)? I am genuinely not sure about this, and that is why I'm asking.

After your patch, these codes are little more complicated.However it indeed saves hundreds of kernel instuctions and context switch when system call is called. So I do think we could accecpt your patch.

kamizjw commented 1 year ago

I think setting slave mode here instead of using bind mode is not wanting to affect the host rootfs

anurnomeru commented 1 year ago

i have the same question.... it's confused

chenk008 commented 10 months ago

https://man7.org/linux/man-pages/man2/mount.2.html First it mounts / as slave with MS_REC, the rootfs mount becomes slave.
Second it bind mounts rootfs with MS_REC, all submounts under the rootfs subtree are also bind mounted. Maybe it help to ignore unbindable mountpoint?

opencontainers / runc

Why Runc do bind mount config.rootfs to config.rootfs while it create container ? #3417