hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

Nomad NFS CSI Integration doesn't work with exec / containerd-driver #19165

Closed 116davinder closed 10 months ago

116davinder commented 11 months ago

Nomad version

Output from nomad version Nomad server: 1.6.3 Nomad Client: 1.6.3

Operating system and Environment details

Ubuntu 20.04

Issue

NFS CSI Volume Mount Fails

failed to setup alloc: pre-run hook "csi_hook" failed: mounting volumes: rpc error: code = Unknown desc = Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer'

What i haven't understood so far is that why Nomad is asking for this '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer' instead of what i have mentioned in the job spec /mnt/backups

Any pointers will be much appreciated.

Reproduction steps

variable "nfs_server_path" { type = string default = "/backup/nomad-dev-dynamic-volumes" description = "this path should exist in the nfs" }

variable "controller_count" { type = number default = 1 }

job "csi-nfs-controller" {

remove the constraint

constraint { attribute = "${attr.unique.hostname}" value = "dev-kdc01" }

group "nfs" { count = var.controller_count

task "controller" {

  driver = "containerd-driver"

  csi_plugin {
    id   = "rocketduck-nfs"
    type = "monolith"
    mount_dir              = "/csi"
    health_timeout         = "30s"
    stage_publish_base_dir = "/local/csi"
  }

  config {
    image = "registry.gitlab.com/rocketduck/csi-plugin-nfs:0.7.0"
    args = [
      "--type=monolith",
      "--endpoint=${CSI_ENDPOINT}", # provided by csi_plugin{}
      "--node-id=${attr.unique.hostname}",
      "--nfs-server=${var.nfs_server_address}:${var.nfs_server_path}",
      "--mount-options=rw,mountproto=tcp,nfsvers=3,rsize=1048576,wsize=1048576,namlen=255,soft,retrans=5,relatime,nolock",
      "--allow-nested-volumes",
      "--log-level=DEBUG",
    ]
    privileged = true
    host_network = true
    cap_add = [
      "CAP_SYS_ADMIN",
      "CAP_CHOWN",
      "CAP_SYS_CHROOT"
    ]
  }
}

} }


<img width="1309" alt="image" src="https://github.com/hashicorp/nomad/assets/9644409/c875bc58-97d5-4f76-9119-efa3f84b6cf2">

* **Step-2** Create Volume HCL ---- working
```hcl
id = "kerberos-backup"
namespace = "default"
name = "kerberos-backup"
type = "csi"
plugin_id = "rocketduck-nfs"

capability {
  access_mode     = "multi-node-multi-writer"
  attachment_mode = "file-system"
}

parameters {
    mode = "777"
}

mount_options {
  fs_type     = "ext4"
}
image
$mount -l | grep tmpiumip_43
random-nfs-server:/backup on /tmp/tmpiumip_43 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.79.252.236,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=10.79.252.236)
$ls -lh /tmp/tmpiumip_43/nomad-dev-dynamic-volumes/
total 2.5K
drwxr-xr-x 2 root root 0 Nov 23 22:11 kerberos-backup

Expected Result

Volume mount inside the exec driver chroot or job folder.

Actual Result

failed to setup alloc: pre-run hook "csi_hook" failed: mounting volumes: rpc error: code = Unknown desc = Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer'

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

{"@level":"trace","@message":"running pre-run hook","@module":"client.alloc_runner","@timestamp":"2023-11-23T22:34:11.986144Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","name":"csi_hook","start":"2023-11-23T22:34:11.986142679Z"}
{"@level":"debug","@message":"found CSI plugin","@module":"client.alloc_runner.runner_hook.csi_hook","@timestamp":"2023-11-23T22:34:11.992235Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","name":"rocketduck-nfs","type":"csi-node"}
{"@level":"info","@message":"finished client unary call","@module":"client.csi_manager.rocketduck-nfs","@timestamp":"2023-11-23T22:34:11.994634Z","duration":1867970,"grpc.code":2,"grpc.method":"NodePublishVolume","grpc.service":"csi.v1.Node"}
{"@level":"trace","@message":"finished pre-run hooks","@module":"client.alloc_runner","@timestamp":"2023-11-23T22:34:11.994945Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","duration":15129701,"end":"2023-11-23T22:34:11.994944943Z"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2023-11-23T22:34:11.995221Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","error":"pre-run hook \"csi_hook\" failed: mounting volumes: rpc error: code = Unknown desc = Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer'"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2023-11-23T22:34:11.995494Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","failed":true,"msg":"failed to setup alloc: pre-run hook \"csi_hook\" failed: mounting volumes: rpc error: code = Unknown desc = Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer'","task":"40-setup-kerberos-db","type":"Setup Failure"}
{"@level":"trace","@message":"next heartbeat","@module":"client","@timestamp":"2023-11-23T22:34:11.996121Z","period":10791125118}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2023-11-23T22:34:11.997186Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","failed":true,"msg":"failed to setup alloc: pre-run hook \"csi_hook\" failed: mounting volumes: rpc error: code = Unknown desc = Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer'","task":"kdc","type":"Setup Failure"}
{"@level":"trace","@message":"handling task state update","@module":"client.alloc_runner","@timestamp":"2023-11-23T22:34:11.997742Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","done":false}

NFS Controller/Node Logs (if appropriate)

2023-11-23 22:56:11,062:DEBUG:csi:Executing method '/csi.v1.Node/NodeGetCapabilities', with request:

2023-11-23 22:56:11,062:DEBUG:csi:Finished execution of method '/csi.v1.Node/NodeGetCapabilities', with response:

2023-11-23 22:56:20,694:DEBUG:csi:Executing method '/csi.v1.Node/NodePublishVolume', with request:
  volume_id: "kerberos-backup"
  target_path: "/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer"
  volume_capability {
    mount {
      fs_type: "ext4"
    }
    access_mode {
      mode: MULTI_NODE_MULTI_WRITER
    }
  }
  volume_context {
    key: "mode"
    value: "777"
  }

2023-11-23 22:56:20,694:INFO:node:Received mount request for 'kerberos-backup' at '/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer'
2023-11-23 22:56:20,696:DEBUG:csi:Finished execution of method '/csi.v1.Node/NodePublishVolume'
2023-11-23 22:56:20,696:ERROR:grpc._server:Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer'
Traceback (most recent call last):
  File "/opt/python/lib/python3.11/site-packages/grpc/_server.py", line 494, in _call_behavior
    response_or_iterator = behavior(argument, context)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python/lib/python3.11/site-packages/grpc_interceptor/server.py", line 63, in invoke_intercept_method
    return self.intercept(
           ^^^^^^^^^^^^^^^
  File "/opt/python/lib/python3.11/site-packages/csi_plugin_nfs/interceptor.py", line 21, in intercept
    response = method(request, context)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python/lib/python3.11/site-packages/csi_plugin_nfs/validators.py", line 45, in inner
    return func(self, request, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python/lib/python3.11/site-packages/csi_plugin_nfs/node.py", line 32, in NodePublishVolume
    os.mkdir(request.target_path)
FileNotFoundError: [Errno 2] No such file or directory: '/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer'
2023-11-23 22:56:20,704:DEBUG:csi:Executing method '/csi.v1.Node/NodeUnpublishVolume', with request:
  volume_id: "kerberos-backup"
  target_path: "/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer"

2023-11-23 22:56:20,705:INFO:node:Received unmount request for 'kerberos-backup' at '/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer'
2023-11-23 22:56:20,705:WARNING:node:Target path '/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer' does not exist for 'kerberos-backup'
2023-11-23 22:56:20,705:DEBUG:csi:Finished execution of method '/csi.v1.Node/NodeUnpublishVolume', with response:

Other Information / References

  1. https://github.com/hashicorp/nomad/tree/main/demo/csi/nfs
  2. https://gitlab.com/rocketduck/csi-plugin-nfs
116davinder commented 11 months ago

I have checked the code for rocketduck/csi-plugin-nfs and it is dead simple where it expects a system path for mounting a dir but it gets arget_path: "/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer" and since it doesn't create folder recursively, it fails with FileNotFoundError as expected from python standpoint

os.mkdir(path, mode=0o777, *, dir_fd=None) Create a directory named path with numeric mode mode. If the directory already exists, FileExistsError is raised. If a parent directory in the path does not exist, FileNotFoundError is raised.

116davinder commented 11 months ago

I manage resolve this error by setting stage_publish_base_dir = "/tmp/csi". For some reason containerd driver doesn't allow creating folders at /local/csi and that's why csi-plugin fails.

Flow of mount process

  1. create volume ( by csi controller only )
  2. when job is started, csi-node mounts the nfs path inside its container at stage_publish_base_dir
  3. csi-node does mount based on driver used by job Example docker driver
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/opt/nomad/data/client/csi/monolith/rocketduck-nfs/per-alloc/77c4d1d8-0fb7-83e0-f35a-5c2cf35c35d7/kerberos-backup/rw-file-system-multi-node-multi-writer",
                "Destination": "/alloc/backups",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ]

Example exec driver: I don't know yet, how it binds the nfs path from container because, I can't make nfs mount working yet.

116davinder commented 11 months ago

As of now, when I am running csi-plugin with containerd-driver it doesn't expose nfs mount to the system but it does mount the bfs path within the container where as docker driver is able to mount the nfs inside and outside the container.

lgfa29 commented 11 months ago

Hi @116davinder 👋

Thanks for the report and the detailed info. Just so I understand the status here, is this a fair summary of things?

  1. You were initially unable to run the rocketduck/csi-plugin-nfs CSI plugin as a containerd task. Changing the value for csi_plugin.stage_publish_base_dir to a path outside the local directory fixed the problem.
  2. A task running the exec driver is not able to mount a CSI volume.

For some reason containerd driver doesn't allow creating folders at /local/csi and that's why csi-plugin fails.

Is https://github.com/Roblox/nomad-driver-containerd the plugin you're using? If so, that's a community plugin that I don't know enough to provide any guidance. Perhaps you could open an issue in that repo? Another thing to try would be to use the NOMAD_TASK_DIR environment variable instead of hardcoding /local.

116davinder commented 10 months ago

Since i am using containerd and exec in my stack a lot, I am blocked because of missing feature / Issues

Early Notes for Docker Driver with CSI NFS Plugin

  1. I can run csi plugins fine
  2. I can see nfs mounts are being exposed to other driver like exec/docker.
  3. Missing Piece, which I saw https://github.com/hashicorp/nomad/issues/15540 but ignored since I was focused on making containerd working but now I will try one more time docker driver.

Last, I do agree that containerd related issue should be moved to Robox/Containerd Repo.

116davinder commented 10 months ago

I am closing this issue, since docker driver is only supported and working with CSI NFS Plugins.