hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.85k stars 1.95k forks source link

Restrict GPUs access for env NVIDIA_VISIBLE_DEVICES #16090

Open ruspaul013 opened 1 year ago

ruspaul013 commented 1 year ago

Nomad version

Nomad v1.4.3

Operating system and Environment details

Plugin "nomad-device-nvidia" v1.0.0 Plugin "nomad-driver-podman" v0.4.1

Issue

Able to use GPUs in task if using device block in resources. But if a user doesn't specify a device block but use NVIDIA_VISIBLE_DEVICES in env block, will have access to GPUs. Is there any way to prevent this from happening?

Reproduction steps

Run a job without device block but with env NVIDIA_VISIBLE_DEVICES set.

job "job_name" {
  datacenters = ["dc1"]
  type = "batch"
  group "group_name" {
    restart {
        attempts=0
    }
    count=1
    task "task_name" {
        driver = "podman"
        config {
            image = "custom_image"
            command= "nvidia-smi"
        }
        env {
          NVIDIA_VISIBLE_DEVICES="all"
        }
    }
  }
}

Expected Result

User that doesn't use device block, to don't have access to GPUs.

Actual Result

User have access to GPUs.

Thank you!

jrasell commented 1 year ago

Hi @ruspaul013 and thanks for raising this issue.

I understand the problem here, but I am not sure it is within Nomad's, or the device drivers remit to perform conditional logic such as block jobs from running if they include a particular env var, when another job specification block is not present or configured. Scheduling this job on a cluster with heterogeneous clients is likely to result in placement on a client that doesn't have GPUs available, which is part of the rationale for the device drivers.

The main question that comes to mind is why can't this env block be removed if the job should not have access to GPUs?

ruspaul013 commented 1 year ago

Hello @jrasell , thanks for your reply.

Scheduling this job on a cluster with heterogeneous clients is likely to result in placement on a client that doesn't have GPUs available, which is part of the rationale for the device drivers.

Unfortunately we don't have heterogeneous clients. All of our clients have GPUs.

The main question that comes to mind is why can't this env block be removed if the job should not have access to GPUs?

The env block can be removed, but we thought that there is a way to restrict access to some env variables.

The problem the we encountered is that nomad will reserved the GPUs only if the job have the block device, but if users use the var NVIDIA_VISIBLE_DEVICES will have access to the same GPUs that are reserved by nomad, if these 2 jobs are running at the same time.

jrasell commented 1 year ago

Hi @ruspaul013, that all makes sense. I am not sure what we can exactly do, but I'll keep this issue open.

tgross commented 1 year ago

we thought that there is a way to restrict access to some env variables.

Typically this kind of jobspec policy enforcement is handled by Sentinel, in the Nomad Enterprise product.