nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Ensure that we have adequate quotas and safeguards in place to control access to the gpus #526

Open dystewart opened 3 months ago

dystewart commented 3 months ago

We need to make sure only PIs can grant access to gpus to select project users.

For instance, an arbitrary user in a namespace should not be able to allocate workloads to gpus (purposefully or accidentally) unless explicit permission to to do has been granted.

Moreover, if a project namespace does not require gpu access we need a resourceQuota or some other method of preventing workloads from being scheduled onto gpu nodes.

Related https://github.com/nerc-project/operations/issues/306

dystewart commented 2 months ago

Using an arbitrary project in prod which was allocated via ColdFront (tajproj-0f4ad1) of which I am a member, I deployed the following workload:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd-nogpu
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: nvidia.com/gpu.present
            operator: NotIn
            values:
            - "true"
        weight: 1

This deploys the workload with a preference to avoid nodes which have the label nvidia.com/gpu.present. IMPORTANT: This deployment is used to test gpu access/functionality so should fail if we aren't able to hit a gpu node. The workload does fail:

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

Even though we are just declaring preferredDuringSchedulingIgnoredDuringExecution as opposed to requiring we schedule on a non gpu node requiredDuringSchedulingIgnoredDuringExecution

Alternatively, running the workload with the opposite preference:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: nvidia.com/gpu.present
            operator: In
            values:
            - "true"
        weight: 1

This is successful and shows we can access the underlying gpu resource just by scheduling on the gpu node by default:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
dystewart commented 2 months ago

I think the easiest solution here is to set a resourceQuota per namespace which grants or limits access to gpus. Testing this now