opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.8k stars 2.1k forks source link

Cannot create user namespaced container without network namespaces #799

Open cyphar opened 8 years ago

cyphar commented 8 years ago

I discovered this while working on rootless containers. It looks like there's some issues using a non-network namespaced setup. This is also blocking rootless containers from having networking (since we need to just use host networking).

% sudo runc start test
rootfs_linux.go:53: mounting "/sys" to rootfs "/home/cyphar/src/runc/rootfs" caused "operation not permitted"

Here's the config, but the important thing to note is that I've added some dummy user namespace setup and removed the network section from namespaces.

{
    "ociVersion": "0.6.0-dev",
    "platform": {
        "os": "linux",
        "arch": "amd64"
    },
    "process": {
        "terminal": true,
        "user": {},
        "args": [
            "sh"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "TERM=xterm"
        ],
        "cwd": "/",
        "capabilities": [
            "CAP_AUDIT_WRITE",
            "CAP_KILL",
            "CAP_NET_BIND_SERVICE"
        ],
        "rlimits": [
            {
                "type": "RLIMIT_NOFILE",
                "hard": 1024,
                "soft": 1024
            }
        ],
        "noNewPrivileges": true
    },
    "root": {
        "path": "rootfs",
        "readonly": true
    },
    "hostname": "runc",
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc"
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/pts",
            "type": "devpts",
            "source": "devpts",
            "options": [
                "nosuid",
                "noexec",
                "newinstance",
                "ptmxmode=0666",
                "mode=0620",
                "gid=5"
            ]
        },
        {
            "destination": "/dev/shm",
            "type": "tmpfs",
            "source": "shm",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "mode=1777",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/mqueue",
            "type": "mqueue",
            "source": "mqueue",
            "options": [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/sys",
            "type": "sysfs",
            "source": "sysfs",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "ro"
            ]
        },
        {
            "destination": "/sys/fs/cgroup",
            "type": "cgroup",
            "source": "cgroup",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "relatime",
                "ro"
            ]
        }
    ],
    "hooks": {},
    "linux": {
        "resources": {
            "devices": [
                {
                    "allow": false,
                    "access": "rwm"
                }
            ]
        },
        "uidMappings": [
            {
                "hostID": 1000,
                "containerID": 0,
                "size": 100
            }
        ],
        "gidMappings": [
            {
                "hostID": 1000,
                "containerID": 0,
                "size": 100
            }
        ],
        "namespaces": [
            {
                "type": "user"
            },
            {
                "type": "pid"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            }
        ],
        "maskedPaths": [
            "/proc/kcore",
            "/proc/latency_stats",
            "/proc/timer_stats",
            "/proc/sched_debug"
        ],
        "readonlyPaths": [
            "/proc/asound",
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger"
        ]
    }
}

Blocking #774.

dqminh commented 8 years ago

This is a currently known restriction in the kernel that you cant mount sys without CAP_SYS_ADMIN rights. Removing sysfs mounting should allow you to start the container

I think the patch note is here:

Also discussed a bit in https://github.com/docker/docker/issues/21800

cyphar commented 8 years ago

@dqminh But we're using user namespaces, so we have CAP_SYS_ADMIN in the namespace. If you add the network namespace to the config, it works perfectly fine. I think it's more nuanced problem (possibly how we're messing around with mount options in rootfs_linux).

dqminh commented 8 years ago

But we're using user namespaces, so we have CAP_SYS_ADMIN in the namespace

That's not quite true I think. You only have CAP_SYS_ADMIN in net namespace created by the user, not when you join net namespace of the host.

cyphar commented 8 years ago

Ah, you meant the user namespace that "owns" the net namespace. Okay, if that's the requirement for mounting all of /sys (which seems odd), we'll have to not mount sysfs. We should probably add this to the validator, so people don't run into this by accident.

I've removed sysfs from my config and that appears to work now. Unfortunately, it looks like I still don't have network access for some reason ...

/cc @davidlt

dqminh commented 8 years ago

Unfortunately, it looks like I still don't have network access for some reason ...

Hmm it should work ( at least when i tested this a few weeks ago :p ). What did you use to test network access ? ping or anything that uses CAPNET* will not work though.

cyphar commented 8 years ago

I was just using netcat. I've had enough bad experiences with capabilities to know better than trust ping in containers. ;)

davidlt commented 8 years ago

Seems to work, at least yum makecache worked, but I am facing issues trying to install anything useful in the container, e.g.

Running transaction
  Installing : fipscheck-lib-1.4.1-5.el7.x86_64                                                                                                                                                                                                                             1/3
Error unpacking rpm package fipscheck-lib-1.4.1-5.el7.x86_64
error: unpacking of archive failed on file /usr/lib64/libfipscheck.so.1;5728b733: cpio: symlink
  Installing : fipscheck-1.4.1-5.el7.x86_64                                                                                                                                                                                                                                 2/3
Error unpacking rpm package fipscheck-1.4.1-5.el7.x86_64
error: fipscheck-lib-1.4.1-5.el7.x86_64: install failed
error: unpacking of archive failed on file /usr/bin/fipscheck;5728b733: cpio: open
error: fipscheck-1.4.1-5.el7.x86_64: install failed
groupadd: cannot open /etc/gshadow
  Installing : openssh-6.6.1p1-25.el7_2.x86_64                                                                                                                                                                                                                              3/3
Error unpacking rpm package openssh-6.6.1p1-25.el7_2.x86_64
error: unpacking of archive failed on file /usr/bin/ssh-keygen;5728b733: cpio: open

I guess, I have to built an image with e.g. Docker and include wanted packages.

davidlt commented 8 years ago

Here is a better proof that it works. Is there a way to map /etc/resolv.conf from the host to the container?

[davidlt@pccms205 test2]$ cat /etc/redhat-release
Fedora release 24 (Twenty Four)
[davidlt@pccms205 test2]$ runc --root $PWD start test_cont
sh-4.2# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
sh-4.2# dig google.com

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.3 <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55412
;; flags: qr rd ra; QUERY: 1, ANSWER: 15, AUTHORITY: 4, ADDITIONAL: 5

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.                    IN      A

;; ANSWER SECTION:
google.com.             195     IN      A       195.112.88.178
google.com.             195     IN      A       195.112.88.177
google.com.             195     IN      A       195.112.88.185
google.com.             195     IN      A       195.112.88.179
google.com.             195     IN      A       195.112.88.184
google.com.             195     IN      A       195.112.88.180
google.com.             195     IN      A       195.112.88.187
google.com.             195     IN      A       195.112.88.181
google.com.             195     IN      A       195.112.88.188
google.com.             195     IN      A       195.112.88.189
google.com.             195     IN      A       195.112.88.183
google.com.             195     IN      A       195.112.88.175
google.com.             195     IN      A       195.112.88.176
google.com.             195     IN      A       195.112.88.182
google.com.             195     IN      A       195.112.88.186

;; AUTHORITY SECTION:
google.com.             59409   IN      NS      ns2.google.com.
google.com.             59409   IN      NS      ns4.google.com.
google.com.             59409   IN      NS      ns1.google.com.
google.com.             59409   IN      NS      ns3.google.com.

;; ADDITIONAL SECTION:
ns1.google.com.         37866   IN      A       216.239.32.10
ns2.google.com.         72394   IN      A       216.239.34.10
ns3.google.com.         35936   IN      A       216.239.36.10
ns4.google.com.         56592   IN      A       216.239.38.10

;; Query time: 1 msec
;; SERVER: 137.138.17.5#53(137.138.17.5)
;; WHEN: Tue May 03 15:27:22 UTC 2016
;; MSG SIZE  rcvd: 415
cyphar commented 8 years ago

You can try bindmounting the file. You'd have to create the file in the rootfs of your container (manually), then adding a bind option for it in config.json. You could also use pre-start hooks if you really wanted to just copy the file (but that would make it go out of sync).

wking commented 8 years ago

The difficulty with unpriviledged net namespaces is with connecting them to the outside world:

$ unshare -nUfr sh sh-4.3# ip route sh-4.3# ip addr 1: lo: mtu 65536 qdisc noop state DOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: sit0@NONE: mtu 1480 qdisc noop state DOWN group default link/sit 0.0.0.0 brd 0.0.0.0

To setup that connection, you need someone with priviledged access in the runtime namespace 1 to setup a bridge and throw one half of a veth connection over the wall (e.g. 2), or setup iptable rules, etc., etc. to connect the runtime net namespace with the container net namespace.

In the absence of such a cooperative privileged user, you can still use unprivileged net namespaces for isolated network tests (and you can probably setup subcontainers and have the unprivileged user setup bridging between those subcontainers).

mrunalp commented 8 years ago

Yeah, need a privileged helper for setting up veth pair to host bridge. lxc also uses a privileged helper to setup networking for unprivileged containers called lxc-user-nic.

cyphar commented 8 years ago

807 adds a check to the validator to make sure that a user doesn't end up in this case.

nzhang-zh commented 5 years ago

Ran into a similar issue when runc is given a network namespace file.

However it runs fine if either namespace file path or user namespace is removed from config.json.

Is there a work around to use network namespace created in host namespace?

$ sudo runc run hello
container_linux.go:344: starting container process caused "process_linux.go:424: container init caused "rootfs_linux.go:58: mounting "sysfs" to rootfs "/tmp/hello-world/rootfs" at "/sys" caused "operation not permitted"""
$ jq '.' config.json
{
  "ociVersion": "1.0.1-dev",
  "process": {
    "terminal": false,
    "user": {
      "uid": 0,
      "gid": 0
    },
    "args": [
      "/hello"
    ],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "cwd": "/",
    "capabilities": {
      "bounding": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
      ],
      "effective": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
      ],
      "inheritable": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
      ],
      "permitted": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
      ],
      "ambient": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
      ]
    },
    "rlimits": [
      {
        "type": "RLIMIT_NOFILE",
        "hard": 1024,
        "soft": 1024
      }
    ],
    "noNewPrivileges": true
  },
  "root": {
    "path": "rootfs",
    "readonly": true
  },
  "hostname": "runc",
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "proc"
    },
    {
      "destination": "/dev",
      "type": "tmpfs",
      "source": "tmpfs",
      "options": [
        "nosuid",
        "strictatime",
        "mode=755",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/pts",
      "type": "devpts",
      "source": "devpts",
      "options": [
        "nosuid",
        "noexec",
        "newinstance",
        "ptmxmode=0666",
        "mode=0620",
        "gid=5"
      ]
    },
    {
      "destination": "/dev/shm",
      "type": "tmpfs",
      "source": "shm",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "mode=1777",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/mqueue",
      "type": "mqueue",
      "source": "mqueue",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/sys",
      "type": "sysfs",
      "source": "sysfs",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "ro"
      ]
    },
    {
      "destination": "/sys/fs/cgroup",
      "type": "cgroup",
      "source": "cgroup",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "relatime",
        "ro"
      ]
    }
  ],
  "linux": {
    "uidMappings": [
      {
        "containerID": 0,
        "hostID": 1000,
        "size": 32000
      }
    ],
    "gidMappings": [
      {
        "containerID": 0,
        "hostID": 1000,
        "size": 32000
      }
    ],
    "resources": {
      "devices": [
        {
          "allow": false,
          "access": "rwm"
        }
      ]
    },
    "namespaces": [
      {
        "type": "pid"
      },
      {
        "type": "network",
        "path": "/var/run/netns/ns1"
      },
      {
        "type": "ipc"
      },
      {
        "type": "uts"
      },
      {
        "type": "mount"
      },
      {
        "type": "user"
      }
    ],
    "maskedPaths": [
      "/proc/kcore",
      "/proc/latency_stats",
      "/proc/timer_list",
      "/proc/timer_stats",
      "/proc/sched_debug",
      "/sys/firmware",
      "/proc/scsi"
    ],
    "readonlyPaths": [
      "/proc/asound",
      "/proc/bus",
      "/proc/fs",
      "/proc/irq",
      "/proc/sys",
      "/proc/sysrq-trigger"
    ]
  }
}