Open schwesig opened 3 months ago
after stopping all workbenches, the server GPU is still running
GPU = 0 in the table is the GPU ID on the 4x board and not the amount
the project has 4 GPUs Quota 4 GPU in kruizeoptimization-143c8e 0 GPU in kruizeoptimization-5ad5bc
the 4 workbenches have the following Deployment Size human-eval-benchmark - GPU 0 trainingexample - GPU 1 traininggpt - GPU 1 trainingtest - GPU 2
This means, that all four workbenches should be able to run at the same time (1+1+2 = 4 GPUs).
But it seems, that the server (??) is also requesting a GPU (V100), and that seems to be added to the limit Deployment Size.
When trying to start the 3rd workbench with GPU, it shows an Failed to create pod
due to exceeded quota
.
no matter what Workbench I start as the last.
example:
starting traininggpt
as last workbench
Failed to create pod
FailedCreate
create Pod traininggpt-0 in StatefulSet traininggpt failed error: pods "traininggpt-0" is forbidden:
exceeded quota: kruizeoptimization-143c8e-project,
requested: limits.cpu=6100m,limits.memory=24640Mi,requests.nvidia.com/gpu=1, used: limits.cpu=31200m,limits.memory=109184Mi,requests.nvidia.com/gpu=4, limited: limits.cpu=32,limits.memory=128Gi,requests.nvidia.com/gpu=4
2024-08-01T16:42:05Z [Warning] create Pod traininggpt-0 in StatefulSet traininggpt failed error: pods "traininggpt-0" is forbidden: exceeded quota: kruizeoptimization-143c8e-project, requested: limits.cpu=6100m,limits.memory=24640Mi,requests.nvidia.com/gpu=1, used: limits.cpu=31200m,limits.memory=109184Mi,requests.nvidia.com/gpu=4, limited: limits.cpu=32,limits.memory=128Gi,requests.nvidia.com/gpu=4
2024-08-01T16:33:14Z [Normal] delete Pod traininggpt-0 in StatefulSet traininggpt successful
2024-08-01T16:32:56Z [Normal] create Pod traininggpt-0 in StatefulSet traininggpt successful
notes:
server
seemed to be a container started from within the project, and was not stopped from within the project.
coldfront is not aware of these created resources.
the manual deletion tutorial does normally solve this, but if not done fully or at all, this seems to be leftovers.
coldfront - stopped workbenches still allocate/block GPU?
or why 4 gpus, 3 on workbenches (a100), 1 on server (V100)?
after stopping all workbenches, there's still one GPU (V100) used by a "server"
human-eval-benchmark - 0GPU - 0 trainingexample - 1GPU - 1 traininggpt - 1GPU - 0 (stopped) trainingtest - 2GPU - 2 --server-- - ?GPU - 1