hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

Not showing warning when using a taken GPU #18364

Open ruspaul013 opened 1 year ago

ruspaul013 commented 1 year ago

Nomad version

nomad server: Nomad v1.6.1 + patch with #18141 nomad client: Nomad v1.6.1 + patch with #18141

Operating system and Environment details

Plugin "nomad-driver-podman" v0.5.1 Plugin "nomad-device-nvidia" v1.0.0

Issue

I created a patch with the solution provides in #18141 to test on our cluster. While testing I discovered that if a GPU or multiple GPU are already used in jobs, nomad will not give a warning about this and will place the job without using those GPUs.

Reproduction steps

  1. Create a job file where you use multiple GPUs and set constraint based on their UUID.
  2. Create a new job file where you use one or more GPUs that is already used in the first job.

Expected Result

Throw an warning like WARNING: Failed to place all allocations.

Actual Result

Place the job on the client.

Job file (if appropriate)

Job file 1:

job "test-2070-2" {
  datacenters = ["dc1"]
  group "test-2070-2" {

    restart {
        attempts=0
    }
    count=1
    task "test-2070-2" {
        driver = "podman"
        config {
            image = "image_with_gpu"
        }
        resources {
            cpu = 2650
            memory = 8192
            device "nvidia/gpu" {
                count = 2

                constraint {
                    attribute = "${device.model}"
                    value     = "NVIDIA GeForce RTX 2070 SUPER"
                }

                constraint {
                    attribute = "${device.ids}"
                    operator  = "set_contains"
                    value     = "GPU-9b5df054-6f08-f35c-9c4c-5709b19efea5,GPU-1846fc5f-8c71-bfab-00e1-9c190dd88ed7"
                }

            }
        }
    }
  }
}

Job file 2:

job "test-2070-2" {
  datacenters = ["dc1"]
  group "test-2070-2" {

    restart {
        attempts=0
    }
    count=1
    task "test-2070-2" {
        driver = "podman"
        config {
            image = "image_with_gpu"
        }
        resources {
            cpu = 2650
            memory = 8192
            device "nvidia/gpu" {
                count = 1

                constraint {
                    attribute = "${device.model}"
                    value     = "NVIDIA GeForce RTX 2070 SUPER"
                }

                constraint {
                    attribute = "${device.ids}"
                    operator  = "set_contains"
                    value     = "GPU-9b5df054-6f08-f35c-9c4c-5709b19efea5"
                }

            }
        }
    }
  }
}
tgross commented 1 year ago

@ruspaul013 I'm surprised that job spec doesn't simply return a validation error, but the constraint block belongs under job, group, or task, not under resources.device. Can you verify this is not working after that's corrected?

ruspaul013 commented 12 months ago

Hello @tgross, thank you for the suggestion. I tried it, but now I get an error even if I do not use constraint block for ids. I can not place a simple job using a GPU.

resources {
      cpu = 3200*4
      memory = 8192
      device "nvidia/gpu" {
          count = 1

      }
  }

  constraint {
      attribute = "${device.model}"
      value     = "NVIDIA GeForce RTX 4090"
  }

and I get this warning

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "paulr_gpu_test" (failed to place 1 allocation):
    * Constraint "${device.model} = NVIDIA GeForce RTX 4090": 2 nodes excluded by filter

but the constraint block belongs under job, group, or task, not under resources.device

I know that constraint block does not belong there, but the example from device block says otherwise.

lgfa29 commented 11 months ago

Yeah, that's something I often forget but you can set constraint and affinity at the device level 😅 https://developer.hashicorp.com/nomad/docs/job-specification/device#constraint

But setting the constraint only filters out nodes from scheduling. Taking a look at the code I think the problem is that AllocsFit doesn't take current device usage into account.

Devices are not considered comparable resources, so they're not filtered here: https://github.com/hashicorp/nomad/blob/e3c8700ded891702b7f94636033109efd8f71c3a/nomad/structs/funcs.go#L172-L178

And this part of the code only looks for device oversubscription among the allocs being scheduled, so it ignores the ones already running in the node: https://github.com/hashicorp/nomad/blob/e3c8700ded891702b7f94636033109efd8f71c3a/nomad/structs/funcs.go#L201-L207