project-codeflare / codeflare-operator

Operator for installation and lifecycle management of CodeFlare distributed workload stack
Apache License 2.0
7 stars 43 forks source link

Update AppWrappers to v0.21.1 #588

Closed dgrove-oss closed 3 months ago

dgrove-oss commented 3 months ago

Highlights:

Full Changelog: https://github.com/project-codeflare/appwrapper/compare/v0.20.2...v0.21.0

dgrove-oss commented 3 months ago

At first glance in the logs, it looks like the AppWrapper+RayCluster workload is pending because a GPU isn't available, but I haven't analyzed fully to figure out if that is just a transitory problem that might resolve once the concurrently running test that is using the GPU first finishes..

10T14:54:39.771940512Z","logger":"scheduler","caller":"scheduler/scheduler.go:263","msg":"Workload requires preemption, but there are no candidate workloads allowed for preemption","workload":{"name":"appwrapper-raycluster-af4b6","namespace":"test-ns-vpd2b"},"clusterQueue":{"name":"e2e-cluster-queue"},"preemption":{"reclaimWithinCohort":"Never","borrowWithinCohort":{"policy":"Never"},"withinClusterQueue":"Never"}}
{"level":"Level(-2)","ts":"2024-07-10T14:54:39.771986932Z","logger":"scheduler","caller":"scheduler/scheduler.go:619","msg":"Workload re-queued","workload":{"name":"appwrapper-raycluster-af4b6","namespace":"test-ns-vpd2b"},"clusterQueue":{"name":"e2e-cluster-queue"},"queue":{"name":"lq-vbsxg","namespace":"test-ns-vpd2b"},"requeueReason":"","added":true}
{"level":"Level(-2)","ts":"2024-07-10T14:54:39.772046413Z","logger":"cluster-queue-reconciler","caller":"core/clusterqueue_controller.go:330","msg":"Got generic event","obj":{"name":"appwrapper-raycluster-af4b6","namespace":"test-ns-vpd2b"},"kind":"/, Kind="}
{"level":"debug","ts":"2024-07-10T14:54:39.780801696Z","logger":"events","caller":"recorder/recorder.go:104","msg":"couldn't assign flavors to pod set raycluster-0-1: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 1 more needed","type":"Normal","object":{"kind":"Workload","namespace":"test-ns-vpd2b","name":"appwrapper-raycluster-af4b6","uid":"fd964893-f987-47b1-9858-7001fa2c6a89","apiVersion":"kueue.x-k8s.io/v1beta1","resourceVersion":"3691"},"reason":"Pending"}
sutaakar commented 3 months ago

I have already seen this intermittently, seems caused by AppWrapper workload not being admitted:

2024-07-10 14:54:39  | 2024-07-10 14:54:39  | appwrapper-raycluster-af4b6.17e0e1c1b799975a  | -          | Normal  | Pending          | couldn't assign flavors to pod set raycluster-0-1: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 1 more needed  | 
2024-07-10 14:54:39  | 2024-07-10 14:54:39  | raycluster.17e0e1c1b7086150                   | -          | Normal  | CreatedWorkload  | Created Workload: test-ns-vpd2b/appwrapper-raycluster-af4b6      

Though I wasn't able to reproduce it yet. Will try to get some more context to that failure (LocalQueue status, Workloads and such).

dgrove-oss commented 3 months ago

Going to hold because back leveling Go to 1.22.2 is more trouble that it is worth.

sutaakar commented 3 months ago

e2e test instability is addressed in https://github.com/project-codeflare/codeflare-operator/pull/591

openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/project-codeflare/codeflare-operator/blob/main/OWNERS)~~ [sutaakar] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment