hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.57k stars 1.92k forks source link

Nomad Failing To Create Job when Docker "userns-remap": "default" #8459

Open adawalli opened 3 years ago

adawalli commented 3 years ago

If filing a bug please include the following:

Nomad version

Nomad v0.12.0 (8f7fbc8e7b5a4ed0d0209968faf41b238e6d5817)

Operating system and Environment details

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.6 LTS
Release:        16.04
Codename:       xenial

Issue

By default, we run our docker daemon with userns-remap=default. In this case, even the simplest Job file (e.g., from nomad init -short) is failing

Docker Daemon File

{
  "hosts": [    "unix:///var/run/docker.sock",  "tcp://0.0.0.0:2376"  ],  
  "labels": [    "is-our-remote-engine=true"  ],  "tls": true,  "tlsverify": true,
  "tlscacert": "/etc/docker/ca.pem",
  "tlscert": "/etc/docker/cert.pem",
  "tlskey": "/etc/docker/key.pem",
  "data-root": "/data/docker-storage",
  "storage-driver": "overlay2",
  "log-driver": "json-file",
    "log-opts": {
        "max-size": "1m",
        "max-file": "10"
      },
  "userns-remap": "default",
  "bip": "192.168.2.1/24"
}

Reproduction steps

Job file (if appropriate)

job "example" {
  datacenters = ["dc1"]

  group "cache" {
    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      resources {
        cpu    = 500
        memory = 256

        network {
          mbits = 10
          port "db" {}
        }
      }
    }
  }
}

Nomad Client logs (if appropriate)

Nomad Log Snippet


client.driver_mgr.docker: created container: driver=docker container_id=fec6cb88295e45fcaad69e99cf96c134ff4a7ff6336c002080d0d45b2e34e205
client.driver_mgr.docker: failed to start container: driver=docker container_id=fec6cb88295e45fcaad69e99cf96c134ff4a7ff6336c002080d0d45b2e34e205 ntainer_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/data/nomad/alloc/tfs \\\"/data/docker-storage/100000.100000/overlay2/8905d8ffa72be2d9f0c97b4971e5b50b079b9a46a200bdfac6d255117f5f26fd/merged\\\" at \\\"/alloc\\\" caused 5e3cab1b0d/alloc: permission denied\\\"\"": unknown"
 2020-07-17T12:14:27.679-0700 [ERROR] client.driver_mgr.docker: failed to start container: driver=docker 336c002080d0d45b2e34e205 error="API error (400): OCI runtime create failed: container_linux.go:349: starting container process caused _linux.go:58: mounting \\\"/data/nomad/alloc/f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d/alloc\\\" to rootfs \\\"/data/docker-storage/100000.100000/bdfac6d255117f5f26fd/merged\\\" at \\\"/alloc\\\" caused \\\"stat /data/nomad/alloc/f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d/alloc: permission denied\\\"\"": 
 2020-07-17T12:14:27.685-0700 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d task=redis 9e99cf96c134ff4a7ff6336c002080d0d45b2e34e205: API error (400): OCI runtime create failed: container_linux.go:349: starting container process caused _linux.go:58: mounting \\\"/data/nomad/alloc/f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d/alloc\\\" to rootfs \\\"/data/docker-storage/100000.100000/bdfac6d255117f5f26fd/merged\\\" at \\\"/alloc\\\" caused \\\"stat /data/nomad/alloc/f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d/alloc: permission denied\\\"\"": 
 2020-07-17T12:14:27.685-0700 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d task=redis 
client.alloc_runner.task_runner: running driver failed: alloc_id=f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d task=redis error="Failed to start container 45b2e34e205: API error (400): OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init /alloc/f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d/alloc\\\" to rootfs \\\"/data/docker-storage/100000.100000/bdfac6d255117f5f26fd/merged\\\" at \\\"/alloc\\\" caused \\\"stat /data/nomad/alloc/f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d/alloc: permission denied\\\"\"": 
client.alloc_runner.task_runner: not restarting task: alloc_id=f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d task=redis reason="Error was unrecoverable"
 2020-07-17T12:14:27.686-0700 [INFO]  client.gc: marking allocation for GC: alloc_id=f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d
client.gc: marking allocation for GC: alloc_id=f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d

It looks like root own /data/nomad/alloc/f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d/alloc but /data/docker-storage/100000.100000/bdfac6d255117f5f26fd/merged\\\ is of course user namespaced

In order to make things simple - I am running the client Nomad node as root (although would prefer to add it to docker group later on).

Any ideas?

notnoop commented 3 years ago

Thank you so much for reporting the issue. We don't actively support userns mode with Docker right now, I'm afraid. It will be nice to add such support in future, and we'll need to do some research there (e.g. how it affects volume/networking integrations that Nomad 0.11/0.12 just added).

In the short term, I'm curious if making dockremap the owner (or grant write access) to /data/nomad would help?

adawalli commented 3 years ago

@notnoop - do you have a secure deployment guide you recommend for docker then? userns is typically used in best-practices to help mitigate container escalation

shishir-a412ed commented 3 years ago

@adawalli Works for me!

Running docker daemon under user namespaces with a remapped root smahajan

root@smahajan-VirtualBox:/tmp# cat /etc/subuid
smahajan:100000:65536
root@smahajan-VirtualBox:/tmp# cat /etc/subgid
smahajan:100000:65536
root@smahajan-VirtualBox:/tmp# systemctl cat docker
# /lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
BindsTo=containerd.service
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --userns-remap=smahajan
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always

ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --userns-remap=smahajan

$ nomad job init -short
Example job file written to example.nomad

$ nomad job run example.nomad
$ nomad job status
ID       Type     Priority  Status   Submit Date
example  service  50        running  2020-07-20T14:49:49-07:00
root@smahajan-VirtualBox:/tmp# docker top $(docker ps -lq)
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
100999              9502                9476                0                   14:49               ?                   00:00:01            redis-server *:6379

root@smahajan-VirtualBox:/tmp# nomad alloc exec -i -t 3bcaee6f /bin/bash
root@ead293e83d51:/data# id -u
0
root@ead293e83d51:/data# id -g
0

Non-root on the host, root inside container. I am wondering why are you setting userns-remap: default. Should this be userns-remap: dockremap?

shishir-a412ed commented 3 years ago

Oh I see

When you configure Docker to use the userns-remap feature, you can optionally specify an existing user and/or group, or you can specify default. If you specify default, a user and group dockremap is created and used for this purpose.

default just creates dockremap for you! Can you try creating the mapping manually with another existing user and try it out? and see if that resolves the issue.

shishir-a412ed commented 3 years ago

@adawalli Works for me with userns-remap=default too!

adawalli commented 3 years ago

Let me give this another try in the next few days - will report back!

adawalli commented 3 years ago

@adawalli Works for me!

Running docker daemon under user namespaces with a remapped root smahajan

root@smahajan-VirtualBox:/tmp# cat /etc/subuid
smahajan:100000:65536
root@smahajan-VirtualBox:/tmp# cat /etc/subgid
smahajan:100000:65536
root@smahajan-VirtualBox:/tmp# systemctl cat docker
# /lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
BindsTo=containerd.service
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --userns-remap=smahajan
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always

ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --userns-remap=smahajan

$ nomad job init -short
Example job file written to example.nomad

$ nomad job run example.nomad
$ nomad job status
ID       Type     Priority  Status   Submit Date
example  service  50        running  2020-07-20T14:49:49-07:00
root@smahajan-VirtualBox:/tmp# docker top $(docker ps -lq)
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
100999              9502                9476                0                   14:49               ?                   00:00:01            redis-server *:6379

root@smahajan-VirtualBox:/tmp# nomad alloc exec -i -t 3bcaee6f /bin/bash
root@ead293e83d51:/data# id -u
0
root@ead293e83d51:/data# id -g
0

Non-root on the host, root inside container. I am wondering why are you setting userns-remap: default. Should this be userns-remap: dockremap?

What user are you running nomad under - root? or under smahajan?

shishir-a412ed commented 3 years ago

What user are you running nomad under - root? or under smahajan?

@adawalli root

adawalli commented 3 years ago

It honestly doesn't make sense to me that the root partition could be viewed by a namespaced process - isn't that a breakdown of namespacing?

that's why

: mounting \\\"/data/nomad/alloc/f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d/alloc\\\" to rootfs \\\"/data/docker-storage/100000.100000/bdfac6d255117f5f26fd/merged\\\" at \\\"/alloc\\\" caused \\\"stat /data/nomad/alloc/f79d1ebb-31d7-673c-a80f-dc5e3cab1b0d/alloc: permission denied\\\"\"": 

actually seemed like a pretty reasonable error

And sorry, not able to reproduce your behavior with dockremap - I will try it with a manually created user just to close up that loose end as well.

adawalli commented 3 years ago

The following also did not work

This is kind of a big bummer, because there are plenty of upstream docker images (e.g., traefik) that don't add a non-root user to their default image. I am surprised that more folks aren't impacted by this limitation. Is everyone really running docker as root and not enforcing userns??

shishir-a412ed commented 3 years ago

@adawalli What is your docker root location?

docker info | grep "Docker Root"

and is the root location owned by root or non-root?

adawalli commented 3 years ago

@shishir-a412ed - I really appreciate you helping me and continuing to ask question. My hope is not lost in the internet!

$ docker info | grep Root
 Docker Root Dir: /data/docker-storage/100000.100000
$ ls -lha /data
total 36K
drwxr-xr-x  6 root root      4.0K Jul 21 16:54 .
drwxr-xr-x 24 root root      4.0K Jun  2 06:02 ..
drwxrwxr-x  5 root blackduck 4.0K Jun 29 19:15 blackduck
drwx--x--x  3 root root      4.0K Jul 21 16:54 docker-storage
drwx------  2 root root       16K May 11 08:32 lost+found
drwx------  4 root bin       4.0K Jul 21 14:18 nomad
$ sudo ls -la /data/docker-storage/100000.100000
total 56
drwx------ 14 100000 100000 4096 Jul 21 16:54 .
drwx--x--x  3 root   root   4096 Jul 21 16:54 ..
drwx------  2 root   root   4096 Jul 21 16:54 builder
drwx--x--x  4 root   root   4096 Jul 21 16:54 buildkit
drwx------  2 100000 100000 4096 Jul 21 16:56 containers
drwx------  3 root   root   4096 Jul 21 16:54 image
drwxr-x---  3 root   root   4096 Jul 21 16:54 network
drwx------  9 100000 100000 4096 Jul 21 16:56 overlay2
drwx------  4 root   root   4096 Jul 21 16:54 plugins
drwx------  2 root   root   4096 Jul 21 16:54 runtimes
drwx------  2 root   root   4096 Jul 21 16:54 swarm
drwx------  2 100000 100000 4096 Jul 21 16:54 tmp
drwx------  2 root   root   4096 Jul 21 16:54 trust
drwx------  5 100000 100000 4096 Jul 21 16:56 volumes
shishir-a412ed commented 3 years ago

@adawalli No worries! I think the problem is your docker root location (/data) is root owned, and when you launch the container, the container root filesystem (rootfs) needs to be mounted in the container. In your case, this won't fly well, since container rootfs on the host is owned by root, and it's trying to mount it inside the container which is remapped root (not real root).

This is okay with default docker root location /var/lib/docker/100000.100000 since that is not root owned. Can you try to chown your docker root location and try to launch a nomad job and see if you still get the permission error?

chown 100000:100000 -R /data

Another option is if somewhere in your configuration you are setting your docker root manually, clear it so it fallback to default root location which is /var/lib/docker/<remapped_root>

adawalli commented 3 years ago

that was a worthy thing to try, but unfortunately, same results - it still doesn't seem to like the root-owned partition from /alloc in nomad mounting into that namespace

I even rolled back the storage location as you recommended with exactly the same results FWIW, I am using Docker version 19.03.11, build 42e35e61f3

shishir-a412ed commented 3 years ago

I thought we were chowning the entire /data (based on my comment above)?

Why is /alloc in nomad still root-owned?

adawalli commented 3 years ago

ok, so I wasn't comfortable running chown recursively on nomad's data folder. Why not? Because, the nomad process, running as root, makes folders as it's managing nomad, and those will be owned by root as it creates them.

However, chowning just the root of the nomad data folder appears to be enough!

sudo ls -lha nomad
total 20K
drwxr-x--- 4 100000 100000 4.0K Jul 22 06:03 .
drwxr-xr-x 6 root   root   4.0K Jul 22 06:04 ..
drwx--x--x 3 root   bin    4.0K Jul 22 06:05 alloc
-rw-r--r-- 1 root   bin     394 Jul 22 06:03 checkpoint-signature
drwx------ 2 root   bin    4.0K Jul 22 06:02 client

I am now able to run containers with namespacing - I really hope that the nomad team puts a priority on adding this support in properly, but glad I can forge ahead for the moment.

shishir-a412ed commented 3 years ago

@adawalli ok, so I wasn't comfortable running chown recursively on nomad's data folder.

Yeah, I understand! I was trying to validate (and understand) the issue. My understanding was once nomad hands it over to the docker driver to launch the task, it should have nothing to do with docker, and docker should manage the namespace itself.

But looks like nomad is mounting to the docker root location at the container start, and it needs to be non-root too. Yeah, maybe nomad will support this use-case better in the future. I don't work for hashicorp so I cannot help you there :) Glad you have a workaround for now! to get around the situation.