kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.47k stars 262 forks source link

TAS: fix dropping of reconcile requests for non-leading replica #3612

Closed mimowo closed 3 days ago

mimowo commented 3 days ago

What type of PR is this?

/kind bug

What this PR does / why we need it:

To fix the bug where Kueue drops reconcile requests for the non-leading replica, due to the handling here. The reconciled type for the object was passed wrong (ClusterQueue instead of ResourceFlavor), and thus this was returning NotFound, and thus be ignored here.

As a consequence, after a rolling restart, the previously non-leading replica would not perform Reconcile when becoming the leading replica (and the events where lost). Users could observe this as workloads stuck with the following message: couldn''t assign flavors to pod set main: Flavor "tas-flavor" information missing in TAS cache, Workload requires Topology, but there is no TAS cache information

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix dropping of reconcile requests for non-leading replica, which was resulting in workloads
getting stuck pending after the rolling restart of Kueue.
k8s-ci-robot commented 3 days ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[pkg/controller/OWNERS](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/OWNERS)~~ [mimowo] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
netlify[bot] commented 3 days ago

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
Latest commit 52a05e33dfffbb8e88482dab148c4c02ce2da5b4
Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6740312146780900085f78f6
Deploy Preview https://deploy-preview-3612--kubernetes-sigs-kueue.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

mimowo commented 3 days ago

cc @mwysokin - thank you for testing and nailing down the tricky scenario! cc @PBundyra @mbobrovskyi

mbobrovskyi commented 3 days ago

/lgtm Thanks!

k8s-ci-robot commented 3 days ago

LGTM label has been added.

Git tree hash: b3ecd4af9145ad7a36d2a49625f05d0775312834

mbobrovskyi commented 3 days ago

/cherry-pick release-0.9

k8s-infra-cherrypick-robot commented 3 days ago

@mbobrovskyi: once the present PR merges, I will cherry-pick it on top of release-0.9 in a new PR and assign it to you.

In response to [this](https://github.com/kubernetes-sigs/kueue/pull/3612#issuecomment-2493052724): >/cherry-pick release-0.9 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
k8s-infra-cherrypick-robot commented 3 days ago

@mbobrovskyi: new pull request created: #3613

In response to [this](https://github.com/kubernetes-sigs/kueue/pull/3612#issuecomment-2493052724): >/cherry-pick release-0.9 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.