Open alculquicondor opened 7 months ago
@tenzen-y @kerthcet have you seen any common user errors?
If Workload is not admitted
, check the workload status.
Also sometimes, I need to check the feature gates like kubernetes does kubectl get --raw /metrics | grep kubernetes_feature_enabled
, maybe we should do the same in kueue. This is not an error.
The integrated component's version is also something we should consider, I used to meet our users complaining about kueue not working with kubeflow, he already installed kubeflow1.7, however, the training-operator is 1.6, but we need 1.7 specifically. Maybe we can take this as a special case.
Q1. The desired flavor isn't assigned to the Job. A2. The flavor in clusterQueue is evaluated from top to bottom and assigned to jobs. The highest-priority flavor need to be put on the top.
Q2. In spite of a job being admitted, pods from a job are pending. A2. Kueue will consider only quotas defined in clusterQueues, not consider actual cluster usage. Please check if the cluster has free capacity.
Q3. In spite of enabled sequential admission, all pods can not be started, and the part of pods are started. A3. Kueue isn't pod's scheduler. Kueue doesn't guarantee that all pods are started at the same time.
/remove-kind feature
/assign
A state diagram of Workload conditions would be useful. Annecdotically, I just got a question from a developer about what is QuotaReserved.
Another common user error: installing the integration (for example jobset or kuberay) after installing kueue. Kueue will not monitor these jobs.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
What would you like to be added:
We can start documenting common user errors. For example:
Why is this needed:
I think one of this scenarios was reported in #1407.
/kind documentation