nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

stopped workbenches still allocate/block GPU? #665

Open schwesig opened 3 months ago

schwesig commented 3 months ago

coldfront - stopped workbenches still allocate/block GPU?

or why 4 gpus, 3 on workbenches (a100), 1 on server (V100)?

after stopping all workbenches, there's still one GPU (V100) used by a "server"

image image image image

human-eval-benchmark - 0GPU - 0 trainingexample - 1GPU - 1 traininggpt - 1GPU - 0 (stopped) trainingtest - 2GPU - 2 --server-- - ?GPU - 1

schwesig commented 3 months ago

after stopping all workbenches, the server GPU is still running image image

GPU = 0 in the table is the GPU ID on the 4x board and not the amount

schwesig commented 3 months ago

image the project has 4 GPUs Quota 4 GPU in kruizeoptimization-143c8e 0 GPU in kruizeoptimization-5ad5bc

the 4 workbenches have the following Deployment Size human-eval-benchmark - GPU 0 trainingexample - GPU 1 traininggpt - GPU 1 trainingtest - GPU 2

This means, that all four workbenches should be able to run at the same time (1+1+2 = 4 GPUs). But it seems, that the server (??) is also requesting a GPU (V100), and that seems to be added to the limit Deployment Size. When trying to start the 3rd workbench with GPU, it shows an Failed to create pod due to exceeded quota. no matter what Workbench I start as the last.

example: starting traininggpt as last workbench

Failed to create pod

FailedCreate
create Pod traininggpt-0 in StatefulSet traininggpt failed error: pods "traininggpt-0" is forbidden:
exceeded quota: kruizeoptimization-143c8e-project,
requested: limits.cpu=6100m,limits.memory=24640Mi,requests.nvidia.com/gpu=1, used: limits.cpu=31200m,limits.memory=109184Mi,requests.nvidia.com/gpu=4, limited: limits.cpu=32,limits.memory=128Gi,requests.nvidia.com/gpu=4
2024-08-01T16:42:05Z [Warning] create Pod traininggpt-0 in StatefulSet traininggpt failed error: pods "traininggpt-0" is forbidden: exceeded quota: kruizeoptimization-143c8e-project, requested: limits.cpu=6100m,limits.memory=24640Mi,requests.nvidia.com/gpu=1, used: limits.cpu=31200m,limits.memory=109184Mi,requests.nvidia.com/gpu=4, limited: limits.cpu=32,limits.memory=128Gi,requests.nvidia.com/gpu=4
2024-08-01T16:33:14Z [Normal] delete Pod traininggpt-0 in StatefulSet traininggpt successful
2024-08-01T16:32:56Z [Normal] create Pod traininggpt-0 in StatefulSet traininggpt successful

image

schwesig commented 1 month ago

notes: server seemed to be a container started from within the project, and was not stopped from within the project. coldfront is not aware of these created resources. the manual deletion tutorial does normally solve this, but if not done fully or at all, this seems to be leftovers.

  1. still needs verification
  2. how to enforce the process for deletion
  3. how to automate this? (labels?)