hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.86k stars 1.95k forks source link

sysbatch/system type jobs fail to be scheduled when using a multi-node-multi-writer CSI volume #15094

Closed mmeier86 closed 1 year ago

mmeier86 commented 1 year ago

Nomad version

Nomad v1.4.1 (2aa7e66bdb526e25f59883952d74dad7ea9a014e)

Operating system and Environment details

Underlying systems are either Ubuntu or Debian Linux, with up to date patches. Debian 11, Ubuntu 22.04.

The Ceph CSI plugin is used in version 3.6.2.

Issue

When adding a CSI volume stanza to a sysbatch job, Nomad consistently says "missing CSI Volume" when trying to run the job.

The same behavior appears when setting the type to system, so it seems that the system part is the problem here, while pure batch or service jobs with the type being the only difference work without a problem.

Reproduction steps

First: Create a Ceph cluster and a Nomad cluster and configure Ceph CSI for the Nomad cluster. Then create a volume as follows:

id = "vol-backup-s3"
name = "vol-backup-s3"
type = "csi"
plugin_id = "ceph-csi-cephfs"

capability {
  access_mode     = "multi-node-multi-writer"
  attachment_mode = "file-system"
}

secrets {
  adminID  = "<ceph client ID here>"
  adminKey = "<ceph client key here>"
}

parameters {
  clusterID = "<ceph cluster id here>"
  fsName = "<name of the Ceph FS here>"
  pool = "<name of the pool here>"
}

Create the volume with nomad volume create.

Then Create a simple job like this:

job "csi-test" {
  datacenters = ["homenet"]

  type = "sysbatch"
  priority = 60

  constraint {
    attribute = "${node.class}"
    value     = "internal"
  }

  group "csi-test" {

    volume "vol-backup-s3" {
      type            = "csi"
      source          = "vol-backup-s3"
      attachment_mode = "file-system"
      access_mode     = "multi-node-multi-writer"
    }

    task "csi-test" {
      driver = "docker"

      config {
        image = "alpine:3.16.0"

        args = [
          3600
        ]
        command = "sleep"

      }

      volume_mount {
        volume      = "vol-backup-s3"
        destination = "/hn-data/buckets"
      }

      resources {
        cpu = 200
        memory = 200
        memory_max = 1024
      }
    }
  }
}

Plan the job with nomad job plan and observe the error message showing up:

Job: "csi-test"
Task Group: "csi-test" (18446744073709551615 create, 1 create/destroy update)
  Task: "csi-test"

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "csi-test" (failed to place 1 allocation):
    * Constraint "missing CSI Volume vol-backup-s3": 3 nodes excluded by filter

Next, purge the job with nomad job stop -purge csi-test

Then, switch the type to batch and rerun nomad job plan and observe that it works without a problem:

+ Job: "csi-test"
+ Task Group: "csi-test" (1 create/destroy update)
  + Task: "csi-test" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

You can do the same with the service type. I have also verified that the volume really is mounted. Interestingly, the same error as with sysbatch also appears when setting the system type.

Expected Result

The multi-node-multi-writer CSI volume gets mounted into all allocations on all nodes the job runs on. But I might just be misunderstanding the meaning of the "multi-node" part of the access definition here?

Actual Result

Nomad does not schedule the job, claiming a missing CSI volume.

Job file (if appropriate)

See above.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented 1 year ago

Hi @mmeier86! The error you're getting is bubbling up from this check in nomad/feasible.go which is just querying for the volume by namespace and ID. I dug a little bit and found that the "generic stack" used for checking node feasibility for service and batch jobs has these lines at stack.go#L110-L111:

    s.taskGroupCSIVolumes.SetNamespace(job.Namespace)
    s.taskGroupCSIVolumes.SetJobID(job.ID)

And then it calls SetVolumes at stack.go#L148

Whereas the "system stack" calls SetVolumes at stack.go#L323 but never calls taskGroupCSIVolumes.SetNamespace! That should be happening in the SetJob method.

So I suspect what's happening here is doesn't have anything to do with multi-node (or you'd be having a different problem like https://github.com/hashicorp/nomad/issues/15197) but just a bug where the volume is in a non-default namespace.

I just stood up my local test right with democratic-csi and was able to confirm this. If I switch my service job in the "prod" namespace to be a system job, I get the following when I plan my job:

$ nomad job plan ./jobs/csi/httpd.nomad
+ Job: "httpd"
+ Task Group: "web"
  + Task: "http" (forces create)

Scheduler dry-run:
- WARNING: Failed to place allocations on all nodes.
  Task Group "web" (failed to place 1 allocation):
    * Class "vagrant": 3 nodes excluded by filter
    * Constraint "missing CSI Volume csi-volume-nfs": 3 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 ./jobs/csi/httpd.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

This is a fairly straightforward fix. I'll have a PR up shortly.

tgross commented 1 year ago

The fix in https://github.com/hashicorp/nomad/pull/15262 has been merged and will ship in the next regular version of Nomad (likely 1.4.4), with backport to 1.3.x and 1.2.x. Thanks for opening this issue @mmeier86!

mmeier86 commented 1 year ago

Hello @tgross, thanks a bunch for the quick reaction and fix. Looking forward to testing it out once it's released, as the intended sysbatch job was meant to be part of my backup setup. :-)

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.