Host-Volumes + SELinux result in permission denied.

For reporting security vulnerabilities please refer to the website.

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Nomad v0.12.3 (2db8abd9620dd41cb7bfe399551ba0f7824b3f61)

Operating system and Environment details

These are Centos7.8 hosts with SELinux enabled.

Issue

We haven't given up running as many hosts as possible with SELinux enabled, including the nomad clients. As such, one change we had to implement was to configure the default selinux label for the docker driver on the clients:

plugin "docker" {
        volumes {
            selinuxlabel = "z"
        }
}

As far as I understand labels here, this labels all docker volumes as "shared across all containers".

Now I came across another requirement, a service needed some local storage and as a quick workaround, I figured I'd use a host volume to give the task group some local storage and worry about outages later.

Nomad-Client Configuration:

client {
    enabled = true
    cni_config_dir = "/etc/cni/net.d"
    cni_path = "/opt/cni/bin"

    host_volume "internal-service" {
        path = "/var/lib/service"
        read_only = false
    }
}

I also made sure to chown the directory to the uid inside the container, as well as changing the selinux label of /var/lib/service to system_u:object_r:container_file_t as other container files.

However, no matter what I do, SELinux denies the container access to the file.

As a workaround, I enabled arbitrary volume mounts on the client and mounted the volume using the docker volume configuration (with the selinux label in place) and it works properly.

Job file (if appropriate)

ob "internal-service" {
    type = "service"
    datacenters = [ "dc1" ]

    group "service" {
        count = "1"

        volume "service" {
            type = "host"
            source = "internal-service"
        }

        task "service" {
            driver = "docker"
            leader = true

            /*
            volume_mount {
                volume = "service"
                destination = "/service-data"
            }*/

            config {
                image = "..."
                volumes = [
                    "/var/lib/service:/service-data",
                ]
            }

Nomad Client logs (if appropriate)

Sadly, I can't really find the logs anymore since this situation was two week ago and the logs have been rotated. Let me know if I need to retry this with a test job to get the logs of docker / selinux / nomad.

Thanks for opening this @Tetha! Glad to hear you have a workaround but there's definitely something we're missing if the host volume doesn't work but a Docker volume mount does. Just for my clarity: in the jobspec you've provided, the commented-out section is what's not working, which you've worked around with the config.volumes section, right?

Exactly.

My first attempt at writing the job was following the documentation on host volumes and looked like this

job "internal-service" {
    type = "service"
    datacenters = [ "dc1" ]

    group "service" {
        count = "1"

        volume "service" {
            type = "host"
            source = "internal-service"
        }

        task "service" {
            driver = "docker"
            leader = true

            volume_mount {
                volume = "service"
                destination = "/service-data"
            }

            config {
                image = "..."
            }
            // ...
}

My current workaround is the job spec from the initial post.

It contains the group level volume stanza in order to force nomad to schedule the job onto the client with the host_volume. Given there is just one of those, it pins the job to a client
But it uses the docker.config.volume array to mount the directory of the host volume into the container including the selinuxlabel from the docker plugin config. To be entirely precise, I also had to enable the arbitrary volume mount on the docker driver as well:

plugin "docker" {
        volumes {
            enabled      = true
            selinuxlabel = "z"
        }
    }
}

Hi @Tetha I got a chance to dig into this a bit and it looks like we're running into a Docker limitation, but one that appears to be intentional.

Any volume we mount with the volume_mount flag (host volumes or CSI volumes) get passed as part of the Docker driver MountConfig. This is the same as if you were using the mounts block in the Docker driver, as opposed to the volumes block like you're doing above.

The Docker container resulting from a job that has a volume_mount, a volumes block, and a mounts block looks like the following:

$ docker inspect a224
[
    {
        ...
        "HostConfig": {
            "Binds": [
                "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/alloc:/alloc",
                "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/redis/local:/local",
                "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/redis/secrets:/secrets",
                "/srv/volumeSource0:/local/srv"
            ],
            ...
            "Mounts": [
                {
                    "Type": "bind",
                    "Source": "/srv/volumeSource1",
                    "Target": "/local/vagrant",
                    "ReadOnly": true,
                    "BindOptions": {}
                },
                {
                    "Type": "bind",
                    "Source": "/srv/volumeSource2",
                    "Target": "/test",
                    "ReadOnly": true,
                    "BindOptions": {
                        "Propagation": "rprivate"
                    }
                }
            ],
            ...
        },
        ...
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/redis/local",
                "Destination": "/local",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/redis/secrets",
                "Destination": "/secrets",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/srv/volumeSource0",
                "Destination": "/local/srv",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/srv/volumeSource1",
                "Destination": "/local/vagrant",
                "Mode": "",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/srv/volumeSource2",
                "Destination": "/test",
                "Mode": "",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/var/nomad/data/allocs/d730cdde-d062-ddc3-d33e-a0240e4e8ebc/alloc",
                "Destination": "/alloc",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
        ....

So the mounts block maps to the Docker command line's --mount flag, about which the Docker docs say:

The --mount flag does not support z or Z options for modifying selinux labels.

It looks like their reasoning for this can be found in places like: https://github.com/moby/moby/issues/36282 https://github.com/moby/moby/issues/30934 https://github.com/docker/cli/pull/832/files

For Nomad, we define the relabelling in the client configuration, which is privileged so the destructive possibilities here are lessened (although it still could be a nasty footgun for someone). I'm still trying to figure out what the right way to handle this problem is and what we can do with it in the Nomad driver. So I just wanted to check in and let you know it's been at least looked at, but it's probably not going to get fixed in Nomad 1.0.0.

I believe I'm running into this issue when trying to use the ceph csi. My nomad clients are on Fedora server systems with selinux enforcing enabled.

Despite enabling the container_use_cephfs selinux boolean, my containers cannot access the mounted volume:

# running a `ls /srv` in the container results in this selinux denial
type=AVC msg=audit(1638965583.753:493): avc:  denied  { read } for  pid=2761 comm="ls" name="/" dev="rbd0" ino=2 scontext=system_u:system_r:container_t:s0:c583,c1011 tcontext=system_u:object_r:unlabeled_t:s0 tclass=dir permissive=0

docker inspect reveals

...
 "HostConfig": {
            "Binds": [
                "/var/lib/nomad/alloc/32159918-57d9-1663-544b-fa5d415712c7/alloc:/alloc",
                "/var/lib/nomad/alloc/32159918-57d9-1663-544b-fa5d415712c7/mysql-server/local:/local",
                "/var/lib/nomad/alloc/32159918-57d9-1663-544b-fa5d415712c7/mysql-server/secrets:/secrets"
            ],
....
            "Mounts": [
                {
                    "Type": "bind",
                    "Source": "/var/lib/nomad/client/csi/node/ceph-csi/per-alloc/32159918-57d9-1663-544b-fa5d415712c7/ceph-mysql-test-tf2/rw-file-system-single-node-writer",
                    "Target": "/srv",
                    "BindOptions": {
                        "Propagation": "rprivate"
                    }
                }
            ],
...

Aside from disabling selinux, Is there a workaround? Or is use of nomad volumes with selinux just not supported yet?

Edit: one "workaround" I've identified is to disable selinux on a per-container basis by passing security_opt = ["label=disable"] in the docker config of the job. This is better than disabling selinux entirely, but it is still not viable workaround for as it has to be applied to every stateful workload.

Edit: one "workaround" I've identified is to disable selinux on a per-container basis by passing security_opt = ["label=disable"] in the docker config of the job. This is better than disabling selinux entirely, but it is still not viable workaround for as it has to be applied to every stateful workload.

Yeah, that's currently the only reasonable workaround. We need to review how to handle the relabelling question safely and we haven't had a chance to do so yet.

hashicorp / nomad