Closed dgrove-oss closed 3 months ago
At first glance in the logs, it looks like the AppWrapper+RayCluster workload is pending because a GPU isn't available, but I haven't analyzed fully to figure out if that is just a transitory problem that might resolve once the concurrently running test that is using the GPU first finishes..
10T14:54:39.771940512Z","logger":"scheduler","caller":"scheduler/scheduler.go:263","msg":"Workload requires preemption, but there are no candidate workloads allowed for preemption","workload":{"name":"appwrapper-raycluster-af4b6","namespace":"test-ns-vpd2b"},"clusterQueue":{"name":"e2e-cluster-queue"},"preemption":{"reclaimWithinCohort":"Never","borrowWithinCohort":{"policy":"Never"},"withinClusterQueue":"Never"}}
{"level":"Level(-2)","ts":"2024-07-10T14:54:39.771986932Z","logger":"scheduler","caller":"scheduler/scheduler.go:619","msg":"Workload re-queued","workload":{"name":"appwrapper-raycluster-af4b6","namespace":"test-ns-vpd2b"},"clusterQueue":{"name":"e2e-cluster-queue"},"queue":{"name":"lq-vbsxg","namespace":"test-ns-vpd2b"},"requeueReason":"","added":true}
{"level":"Level(-2)","ts":"2024-07-10T14:54:39.772046413Z","logger":"cluster-queue-reconciler","caller":"core/clusterqueue_controller.go:330","msg":"Got generic event","obj":{"name":"appwrapper-raycluster-af4b6","namespace":"test-ns-vpd2b"},"kind":"/, Kind="}
{"level":"debug","ts":"2024-07-10T14:54:39.780801696Z","logger":"events","caller":"recorder/recorder.go:104","msg":"couldn't assign flavors to pod set raycluster-0-1: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 1 more needed","type":"Normal","object":{"kind":"Workload","namespace":"test-ns-vpd2b","name":"appwrapper-raycluster-af4b6","uid":"fd964893-f987-47b1-9858-7001fa2c6a89","apiVersion":"kueue.x-k8s.io/v1beta1","resourceVersion":"3691"},"reason":"Pending"}
I have already seen this intermittently, seems caused by AppWrapper workload not being admitted:
2024-07-10 14:54:39 | 2024-07-10 14:54:39 | appwrapper-raycluster-af4b6.17e0e1c1b799975a | - | Normal | Pending | couldn't assign flavors to pod set raycluster-0-1: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 1 more needed |
2024-07-10 14:54:39 | 2024-07-10 14:54:39 | raycluster.17e0e1c1b7086150 | - | Normal | CreatedWorkload | Created Workload: test-ns-vpd2b/appwrapper-raycluster-af4b6
Though I wasn't able to reproduce it yet. Will try to get some more context to that failure (LocalQueue status, Workloads and such).
Going to hold because back leveling Go to 1.22.2 is more trouble that it is worth.
e2e test instability is addressed in https://github.com/project-codeflare/codeflare-operator/pull/591
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: sutaakar
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Highlights:
Full Changelog: https://github.com/project-codeflare/appwrapper/compare/v0.20.2...v0.21.0