kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.25k stars 222 forks source link

Add a troubleshooting page #1410

Open alculquicondor opened 7 months ago

alculquicondor commented 7 months ago

What would you like to be added:

We can start documenting common user errors. For example:

Why is this needed:

I think one of this scenarios was reported in #1407.

/kind documentation

alculquicondor commented 7 months ago

@tenzen-y @kerthcet have you seen any common user errors?

kerthcet commented 7 months ago

If Workload is not admitted, check the workload status.

Also sometimes, I need to check the feature gates like kubernetes does kubectl get --raw /metrics | grep kubernetes_feature_enabled, maybe we should do the same in kueue. This is not an error.

The integrated component's version is also something we should consider, I used to meet our users complaining about kueue not working with kubeflow, he already installed kubeflow1.7, however, the training-operator is 1.6, but we need 1.7 specifically. Maybe we can take this as a special case.

tenzen-y commented 7 months ago

Q1. The desired flavor isn't assigned to the Job. A2. The flavor in clusterQueue is evaluated from top to bottom and assigned to jobs. The highest-priority flavor need to be put on the top.

Q2. In spite of a job being admitted, pods from a job are pending. A2. Kueue will consider only quotas defined in clusterQueues, not consider actual cluster usage. Please check if the cluster has free capacity.

Q3. In spite of enabled sequential admission, all pods can not be started, and the part of pods are started. A3. Kueue isn't pod's scheduler. Kueue doesn't guarantee that all pods are started at the same time.

tenzen-y commented 7 months ago

/remove-kind feature

PBundyra commented 4 months ago

/assign

alculquicondor commented 4 months ago

A state diagram of Workload conditions would be useful. Annecdotically, I just got a question from a developer about what is QuotaReserved.

alculquicondor commented 4 months ago

Another common user error: installing the integration (for example jobset or kuberay) after installing kueue. Kueue will not monitor these jobs.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alculquicondor commented 1 month ago

/remove-lifecycle stale