Closed asaiacai closed 2 months ago
I believe this behavior (tearing down a node after 10min) is driven by CA rather than Kueue.
@asaiacai what is your CA (and Kube) version?
Maybe @yaroslava-serdiuk or @mwielgus would have some suggestions.
@asaiacai another thing you might be hitting is maxRunDuration, you can try to update similarly as in this example.
Let the community know if this helps. If it doesn't, I would recommend to close the issue here, and open a support ticket for GKE/DWS.
i'm on 1.30.3-gke.1225000
will try changing the run duration.
just updating that bumping the maxRunDuration
extends the node lifetime. It'd be nice if there was a log or event emitted in the future explaining the pod/node scale down. Closing this issue. Thanks @mimowo !
I've gotten node provisioning to work via the Kueue-DWS integration but the if my pods happen to have low CPU for 10m, the node get deprovisioned and my job fails. I have a separate lifecycle manager for the pods so it'd be nice to have the nodes/pods be alive until I explicitly terminate them. I already tried to include cluster-autoscaler.kubernetes.io/safe-to-evict: "false" but with no success.
these were my kueue resources
and my pod definitions