kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.32k stars 230 forks source link

Workload is not admitted after upgrading from kueue 0.3 to 0.4 #827

Closed alculquicondor closed 1 year ago

alculquicondor commented 1 year ago

What happened:

manager {"level":"error","ts":"2023-06-01T14:50:36.169498131Z","logger":"scheduler","caller":"scheduler/scheduler.go:306","msg":"Could not admit Workload and assign flavors in apiserver","workload":{"name":"job-jabba-job-y3tjgtaosbfc3aoctuqxtisu7bhw65edjy7eaoz5dvktyqo3qrhq-d9a73","namespace":"amp-batch"},"clusterQueue":{"name":"cluster-queue-part-mlops"},"error":"Workload.kueue.x-k8s.io \"job-jabba-job-y3tjgtaosbfc3aoctuqxtisu7bhw65edjy7eaoz5dvktyqo3qrhq-d9a73\" is invalid: status.admission.podSetAssignments.count: Required value","stacktrace":"sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).admit.func1\n\t/workspace/pkg/scheduler/scheduler.go:306\nsigs.k8s.io/kueue/pkg/util/routine.(*wrapper).Run.func1\n\t/workspace/pkg/util/routine/wrapper.go:44"}

What you expected to happen:

The admission to succeed.

How to reproduce it (as minimally and precisely as possible):

This was reproduced by using the helm charts in the main branch, while using a 0.3.1 binary.

Anything else we need to know?:

We probably need to make count optional.

Environment:

alculquicondor commented 1 year ago

/assign @trasc

trasc commented 1 year ago

@alculquicondor , this could, theoretically, only happen when the CRDs are v0.4 but the running manager is still 0.3 , do we want to cover this scenario?

Anyway I'll have another look on Tuesday.

alculquicondor commented 1 year ago

In the case of admission, you are right. And no, we don't need to support that.

But I would be worried about cases where a 0.4 controller doesn't populate the field for an old workload, and gets rejected by the API. Although I'm not sure if this is possible. Maybe the controller would still send count: 0.

Anyway I'll have another look on Tuesday.

Thanks!

trasc commented 1 year ago

It looks like there is another way we can get to that error. While SSA even if only an unrelated condition is changed, the entire sub-resource is re validated. Well heed to make the count optional from an API point point view.

alculquicondor commented 1 year ago

Ah... yes, SSA does checks that other methods don't

Thanks for investigating!