hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.95k stars 1.96k forks source link

Host-Volumes + SELinux result in permission denied. #9123

Open Tetha opened 4 years ago

Tetha commented 4 years ago

For reporting security vulnerabilities please refer to the website.

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Nomad v0.12.3 (2db8abd9620dd41cb7bfe399551ba0f7824b3f61)

Operating system and Environment details

These are Centos7.8 hosts with SELinux enabled.

Issue

We haven't given up running as many hosts as possible with SELinux enabled, including the nomad clients. As such, one change we had to implement was to configure the default selinux label for the docker driver on the clients:

plugin "docker" {
        volumes {
            selinuxlabel = "z"
        }
}

As far as I understand labels here, this labels all docker volumes as "shared across all containers".

Now I came across another requirement, a service needed some local storage and as a quick workaround, I figured I'd use a host volume to give the task group some local storage and worry about outages later.

Nomad-Client Configuration:

client {
    enabled = true
    cni_config_dir = "/etc/cni/net.d"
    cni_path = "/opt/cni/bin"

    host_volume "internal-service" {
        path = "/var/lib/service"
        read_only = false
    }
}

I also made sure to chown the directory to the uid inside the container, as well as changing the selinux label of /var/lib/service to system_u:object_r:container_file_t as other container files.

However, no matter what I do, SELinux denies the container access to the file.

As a workaround, I enabled arbitrary volume mounts on the client and mounted the volume using the docker volume configuration (with the selinux label in place) and it works properly.

Job file (if appropriate)

ob "internal-service" {
    type = "service"
    datacenters = [ "dc1" ]

    group "service" {
        count = "1"

        volume "service" {
            type = "host"
            source = "internal-service"
        }

        task "service" {
            driver = "docker"
            leader = true

            /*
            volume_mount {
                volume = "service"
                destination = "/service-data"
            }*/

            config {
                image = "..."
                volumes = [
                    "/var/lib/service:/service-data",
                ]
            }

Nomad Client logs (if appropriate)

Sadly, I can't really find the logs anymore since this situation was two week ago and the logs have been rotated. Let me know if I need to retry this with a test job to get the logs of docker / selinux / nomad.

tgross commented 4 years ago

Thanks for opening this @Tetha! Glad to hear you have a workaround but there's definitely something we're missing if the host volume doesn't work but a Docker volume mount does. Just for my clarity: in the jobspec you've provided, the commented-out section is what's not working, which you've worked around with the config.volumes section, right?

Tetha commented 4 years ago

Exactly.

My first attempt at writing the job was following the documentation on host volumes and looked like this

job "internal-service" {
    type = "service"
    datacenters = [ "dc1" ]

    group "service" {
        count = "1"

        volume "service" {
            type = "host"
            source = "internal-service"
        }

        task "service" {
            driver = "docker"
            leader = true

            volume_mount {
                volume = "service"
                destination = "/service-data"
            }

            config {
                image = "..."
            }
            // ...
}

My current workaround is the job spec from the initial post.

plugin "docker" {
        volumes {
            enabled      = true
            selinuxlabel = "z"
        }
    }
}
tgross commented 3 years ago

Hi @Tetha I got a chance to dig into this a bit and it looks like we're running into a Docker limitation, but one that appears to be intentional.

Any volume we mount with the volume_mount flag (host volumes or CSI volumes) get passed as part of the Docker driver MountConfig. This is the same as if you were using the mounts block in the Docker driver, as opposed to the volumes block like you're doing above.

The Docker container resulting from a job that has a volume_mount, a volumes block, and a mounts block looks like the following:

$ docker inspect a224
[
    {
        ...
        "HostConfig": {
            "Binds": [
                "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/alloc:/alloc",
                "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/redis/local:/local",
                "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/redis/secrets:/secrets",
                "/srv/volumeSource0:/local/srv"
            ],
            ...
            "Mounts": [
                {
                    "Type": "bind",
                    "Source": "/srv/volumeSource1",
                    "Target": "/local/vagrant",
                    "ReadOnly": true,
                    "BindOptions": {}
                },
                {
                    "Type": "bind",
                    "Source": "/srv/volumeSource2",
                    "Target": "/test",
                    "ReadOnly": true,
                    "BindOptions": {
                        "Propagation": "rprivate"
                    }
                }
            ],
            ...
        },
        ...
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/redis/local",
                "Destination": "/local",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/redis/secrets",
                "Destination": "/secrets",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/srv/volumeSource0",
                "Destination": "/local/srv",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/srv/volumeSource1",
                "Destination": "/local/vagrant",
                "Mode": "",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/srv/volumeSource2",
                "Destination": "/test",
                "Mode": "",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/alloc",
                "Destination": "/alloc",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
        ....

So the mounts block maps to the Docker command line's --mount flag, about which the Docker docs say:

The --mount flag does not support z or Z options for modifying selinux labels.

It looks like their reasoning for this can be found in places like: https://github.com/moby/moby/issues/36282 https://github.com/moby/moby/issues/30934 https://github.com/docker/cli/pull/832/files

For Nomad, we define the relabelling in the client configuration, which is privileged so the destructive possibilities here are lessened (although it still could be a nasty footgun for someone). I'm still trying to figure out what the right way to handle this problem is and what we can do with it in the Nomad driver. So I just wanted to check in and let you know it's been at least looked at, but it's probably not going to get fixed in Nomad 1.0.0.

Ramblurr commented 2 years ago

I believe I'm running into this issue when trying to use the ceph csi. My nomad clients are on Fedora server systems with selinux enforcing enabled.

Despite enabling the container_use_cephfs selinux boolean, my containers cannot access the mounted volume:

# running a `ls /srv` in the container results in this selinux denial
type=AVC msg=audit(1638965583.753:493): avc:  denied  { read } for  pid=2761 comm="ls" name="/" dev="rbd0" ino=2 scontext=system_u:system_r:container_t:s0:c583,c1011 tcontext=system_u:object_r:unlabeled_t:s0 tclass=dir permissive=0

docker inspect reveals

...
 "HostConfig": {
            "Binds": [
                "/var/lib/nomad/alloc/32159918-57d9-1663-544b-fa5d415712c7/alloc:/alloc",
                "/var/lib/nomad/alloc/32159918-57d9-1663-544b-fa5d415712c7/mysql-server/local:/local",
                "/var/lib/nomad/alloc/32159918-57d9-1663-544b-fa5d415712c7/mysql-server/secrets:/secrets"
            ],
....
            "Mounts": [
                {
                    "Type": "bind",
                    "Source": "/var/lib/nomad/client/csi/node/ceph-csi/per-alloc/32159918-57d9-1663-544b-fa5d415712c7/ceph-mysql-test-tf2/rw-file-system-single-node-writer",
                    "Target": "/srv",
                    "BindOptions": {
                        "Propagation": "rprivate"
                    }
                }
            ],
...

Aside from disabling selinux, Is there a workaround? Or is use of nomad volumes with selinux just not supported yet?

Edit: one "workaround" I've identified is to disable selinux on a per-container basis by passing security_opt = ["label=disable"] in the docker config of the job. This is better than disabling selinux entirely, but it is still not viable workaround for as it has to be applied to every stateful workload.

tgross commented 2 years ago

Edit: one "workaround" I've identified is to disable selinux on a per-container basis by passing security_opt = ["label=disable"] in the docker config of the job. This is better than disabling selinux entirely, but it is still not viable workaround for as it has to be applied to every stateful workload.

Yeah, that's currently the only reasonable workaround. We need to review how to handle the relabelling question safely and we haven't had a chance to do so yet.