hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

Cannot start ceph_csi controller #15874

Closed acziryak closed 1 year ago

acziryak commented 1 year ago

Nomad version

Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)

Operating system and Environment details

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"
PRETTY_NAME="Ubuntu 22.04 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Issue

Ceph CSI Controller does not start following instructions here: https://docs.ceph.com/en/latest/rbd/rbd-nomad/

Reproduction steps

Worker node config:

data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/data/plugins"

region = "us"

plugin "raw_exec" {
  config {
    enabled = true
  }
}

plugin "docker" {
  config {
    allow_privileged = true
    allow_caps = ["all"]

    volumes {
      enabled = true
    }

    auth {
      config = "/opt/nomad/docker-auth.json"
    }
  }
}

tls {
  http = true
  rpc = true

  ca_file = "/opt/nomad/tls/ca.crt"
  cert_file = "/opt/nomad/tls/node.crt"
  key_file = "/opt/nomad/tls/node.key"

  verify_server_hostname = true
}

vault {
  enabled = true
  create_from_role = "nomad-cluster"
}

acl {
  enabled = true
}

Expected Result

Ceph CSI plugin starts up correctly without permission denied on both /sys/fs/cgroup//pids.max and /csi.

Actual Result

While investigating the below, I found out that the /csi mount is mounted as nobody:nobody, which renders it inaccessible from the server.

I0125 19:31:55.496632       8 cephcsi.go:192] Driver version: v3.7.2 and Git version: 47b59ee5a430f66a88913bea1a6ac1961c8ff552
I0125 19:31:55.497015       8 cephcsi.go:210] Initial PID limit is set to 38397
E0125 19:31:55.497084       8 cephcsi.go:214] Failed to set new PID limit to -1: open /sys/fs/cgroup//pids.max: permission denied
I0125 19:31:55.497405       8 cephcsi.go:241] Starting driver type: rbd with name: rbd.csi.ceph.com
I0125 19:31:55.497688       8 driver.go:94] Enabling controller service capability: CREATE_DELETE_VOLUME
I0125 19:31:55.497709       8 driver.go:94] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0125 19:31:55.497717       8 driver.go:94] Enabling controller service capability: CLONE_VOLUME
I0125 19:31:55.497724       8 driver.go:94] Enabling controller service capability: EXPAND_VOLUME
I0125 19:31:55.497733       8 driver.go:107] Enabling volume access mode: SINGLE_NODE_WRITER
I0125 19:31:55.497741       8 driver.go:107] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0125 19:31:55.497757       8 driver.go:107] Enabling volume access mode: SINGLE_NODE_SINGLE_WRITER
I0125 19:31:55.497773       8 driver.go:107] Enabling volume access mode: SINGLE_NODE_MULTI_WRITER
W0125 19:31:55.497791       8 driver.go:162] replication service running on controller server is deprecated and replaced by CSI-Addons, see https://github.com/ceph/ceph-csi/issues/3314 for more details
F0125 19:31:55.497885       8 server.go:97] Failed to remove /csi/csi.sock, error: remove /csi/csi.sock: permission denied
I0125 19:31:55.497891       8 server.go:114] listening for CSI-Addons requests on address: &net.UnixAddr{Name:"/tmp/csi-addons.sock", Net:"unix"}
I0125 19:32:11.602401       7 cephcsi.go:192] Driver version: v3.7.2 and Git version: 47b59ee5a430f66a88913bea1a6ac1961c8ff552
I0125 19:32:11.602653       7 cephcsi.go:210] Initial PID limit is set to 38397
E0125 19:32:11.602700       7 cephcsi.go:214] Failed to set new PID limit to -1: open /sys/fs/cgroup//pids.max: permission denied
I0125 19:32:11.602930       7 cephcsi.go:241] Starting driver type: rbd with name: rbd.csi.ceph.com
I0125 19:32:11.603121       7 driver.go:94] Enabling controller service capability: CREATE_DELETE_VOLUME
I0125 19:32:11.603134       7 driver.go:94] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0125 19:32:11.603138       7 driver.go:94] Enabling controller service capability: CLONE_VOLUME
I0125 19:32:11.603142       7 driver.go:94] Enabling controller service capability: EXPAND_VOLUME
I0125 19:32:11.603148       7 driver.go:107] Enabling volume access mode: SINGLE_NODE_WRITER
I0125 19:32:11.603152       7 driver.go:107] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0125 19:32:11.603156       7 driver.go:107] Enabling volume access mode: SINGLE_NODE_SINGLE_WRITER
I0125 19:32:11.603160       7 driver.go:107] Enabling volume access mode: SINGLE_NODE_MULTI_WRITER
W0125 19:32:11.603170       7 driver.go:162] replication service running on controller server is deprecated and replaced by CSI-Addons, see https://github.com/ceph/ceph-csi/issues/3314 for more details
F0125 19:32:11.603244       7 server.go:97] Failed to remove /csi/csi.sock, error: remove /csi/csi.sock: permission denied

This results in a failed job. I'm not sure if /csi should be mounted as nobody:nobody, or if it should be populated somewhere in the filesystem, but the only csi directories I can find under that allocation are owned by user 10000, and have no sockets inside them.

Job file (if appropriate)

{
  "Stop": false,
  "Region": "us",
  "Namespace": "default",
  "ID": "ceph-csi-controller-ceph_csi-us-ind-test",
  "ParentID": "",
  "Name": "ceph-csi-controller-ceph_csi-us-ind-test",
  "Type": "service",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "ind-nonprod2"
  ],
  "Constraints": null,
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "ceph-csi-controller-ceph_csi-us-ind-test",
      "Count": 1,
      "Update": {
        "Stagger": 30000000000,
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000,
        "ProgressDeadline": 600000000000,
        "AutoRevert": false,
        "AutoPromote": false,
        "Canary": 0
      },
      "Migrate": {
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000
      },
      "Constraints": [
        {
          "LTarget": "${attr.consul.version}",
          "RTarget": ">= 1.7.0",
          "Operand": "semver"
        }
      ],
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 2,
        "Interval": 1800000000000,
        "Delay": 15000000000,
        "Mode": "fail"
      },
      "Tasks": [
        {
          "Name": "ceph-csi-controller-ceph_csi-us-ind-test",
          "Driver": "docker",
          "User": "",
          "Config": {
            "image": "quay.io/cephcsi/cephcsi:v3.7.2",
            "volumes": [
              "./local/config.json:/etc/ceph-csi-config/config.json"
            ],
            "mounts": [
              {
                "type": "tmpfs",
                "target": "/tmp/csi/keys",
                "readonly": false,
                "tmpfs_options": {
                  "size": 1000000
                }
              }
            ],
            "args": [
              "--type=rbd",
              "--controllerserver=true",
              "--drivername=rbd.csi.ceph.com",
              "--endpoint=unix://csi/csi.sock",
              "--nodeid=${node.unique.name}",
              "--instanceid=${node.unique.name}-controller",
              "--logtostderr=true",
              "--v=5",
              "--metricsport=${NOMAD_PORT_metrics}"
            ]
          },
          "Env": null,
          "Services": [
            {
              "Name": "ceph-csi-controller",
              "TaskName": "ceph-csi-controller-ceph_csi-us-ind-test",
              "PortLabel": "metrics",
              "AddressMode": "auto",
              "Address": "",
              "EnableTagOverride": false,
              "Tags": [
                "prometheus"
              ],
              "CanaryTags": null,
              "Checks": null,
              "Connect": null,
              "Meta": null,
              "CanaryMeta": null,
              "TaggedAddresses": null,
              "Namespace": "default",
              "OnUpdate": "require_healthy",
              "Provider": "consul"
            }
          ],
          "Vault": null,
          "Templates": [
            {
              "SourcePath": "",
              "DestPath": "local/config.json",
              "EmbeddedTmpl": "[{\n    \"clusterID\": \"******************************\",\n    \"monitors\": [\"*****,\"\"*****,\"\"*****,\"\"*****,\"\"*****\"    ]\n}]\n",
              "ChangeMode": "restart",
              "ChangeSignal": "",
              "ChangeScript": null,
              "Splay": 5000000000,
              "Perms": "0644",
              "Uid": null,
              "Gid": null,
              "LeftDelim": "{{",
              "RightDelim": "}}",
              "Envvars": false,
              "VaultGrace": 0,
              "Wait": null,
              "ErrMissingKey": false
            }
          ],
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 500,
            "Cores": 0,
            "MemoryMB": 256,
            "MemoryMaxMB": 0,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null
          },
          "RestartPolicy": {
            "Attempts": 2,
            "Interval": 1800000000000,
            "Delay": 15000000000,
            "Mode": "fail"
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 10,
            "MaxFileSizeMB": 10
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": {
            "ID": "ceph-csi",
            "Type": "controller",
            "MountDir": "/csi",
            "StagePublishBaseDir": "/local/csi",
            "HealthTimeout": 30000000000
          }
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 300,
        "Migrate": false
      },
      "Meta": null,
      "ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 0,
        "Delay": 30000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 3600000000000,
        "Unlimited": true
      },
      "Affinities": null,
      "Spreads": null,
      "Networks": [
        {
          "Mode": "",
          "Device": "",
          "CIDR": "",
          "IP": "",
          "Hostname": "",
          "MBits": 0,
          "DNS": null,
          "ReservedPorts": null,
          "DynamicPorts": [
            {
              "Label": "metrics",
              "Value": 0,
              "To": 0,
              "HostNetwork": "default"
            }
          ]
        }
      ],
      "Consul": {
        "Namespace": ""
      },
      "Services": null,
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "MaxClientDisconnect": null
    }
  ],
  "Update": {
    "Stagger": 30000000000,
    "MaxParallel": 1,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "DispatchIdempotencyToken": "",
  "Payload": null,
  "Meta": null,
  "ConsulToken": "",
  "ConsulNamespace": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "587d579f-8609-ed1d-927d-2d8ad54c2b1c",
  "Status": "running",
  "StatusDescription": "",
  "Stable": true,
  "Version": 0,
  "SubmitTime": 1674674304925691600,
  "CreateIndex": 2172,
  "ModifyIndex": 2247,
  "JobModifyIndex": 2172
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

acziryak commented 1 year ago

FWIW, I do see the directory being created on the worker node:

root@ind-test-nomad-worker13:/opt/nomad/data/alloc/da0eb8db-4ea4-bcb8-9b8e-92c161ac4652/ceph-csi-controller-ceph_csi-us-ind-test/local# ls -al
total 16
drwxrwxrwx 3 nobody nogroup 4096 Jan 25 14:20 .
drwxrwxrwx 5 nobody nogroup 4096 Jan 25 14:19 ..
-rw-r--r-- 1 root   root     155 Jan 25 14:19 config.json
drwxr-xr-x 2 100000  100000 4096 Jan 25 14:20 csi
lgfa29 commented 1 year ago

Hi @acziryak 👋

Just to double-check, are you running the Nomad agent as root?

acziryak commented 1 year ago

AFAICT, yes

# ps -ef | grep nomad
root      204168       1  0 13:04 ?        00:02:21 /usr/bin/nomad agent -config /etc/nomad.d
acziryak commented 1 year ago

FWIW, I also see that the /sys/fs/cgroup/pids.max file is owned by nobody:nobody as well when viewed from inside the container.

acziryak commented 1 year ago

So I was able to get it work with an additional parameter:

config {
  userns_mode = "host"
}

For some reason, this fixed the permissions. However, I don't see how it would be possible without that. I'm not sure if the documentation here should reflect that, or if there alternative agent configurations where ACLs or permissions, or default user namespaces would be set up differently than mine, which would alleviate the need of this parameter.

For future reference this param is documented here: https://developer.hashicorp.com/nomad/docs/drivers/docker#userns_mode

In there, it does say:

Set to host to use the host's user namespace (effectively disabling user namespacing) when user namespace remapping is enabled on the docker daemon. This field has no effect if the docker daemon does not have user namespace remapping enabled.

Which I was able to verify with:

# grep 'userns' /etc/docker/daemon.json
  "userns-remap": "default"

I would assume that there would be no harm in putting in this param in the example, because it purportedly will not do anything if userns-remap is not enabled, and is apparently required if userns-remap is indeed enabled. But that's just my suggestion going forward.

tgross commented 1 year ago

@acziryak the docs for csi_plugin note the following:

Note: Plugins running as node or monolith require root privileges (or CAP_SYS_ADMIN on Linux) to mount volumes on the host. With the Docker task driver, you can use the privileged = true configuration, but no other default task drivers currently have this option.

Mounting volumes is a privileged operation in Linux and can't be done "rootlessly". The allow_caps and allow_privileged settings you have in the plugin config are fine, but you don't have privileged = true (or an equivalent combination of caps) set anywhere in the task configuration block:

          "Config": {
            "image": "quay.io/cephcsi/cephcsi:v3.7.2",
            "volumes": [
              "./local/config.json:/etc/ceph-csi-config/config.json"
            ],
            "mounts": [
              {
                "type": "tmpfs",
                "target": "/tmp/csi/keys",
                "readonly": false,
                "tmpfs_options": {
                  "size": 1000000
                }
              }
            ],
            "args": [
              "--type=rbd",
              "--controllerserver=true",
              "--drivername=rbd.csi.ceph.com",
              "--endpoint=unix://csi/csi.sock",
              "--nodeid=${node.unique.name}",
              "--instanceid=${node.unique.name}-controller",
              "--logtostderr=true",
              "--v=5",
              "--metricsport=${NOMAD_PORT_metrics}"
            ]
          },