project-codeflare / codeflare-operator

Operator for installation and lifecycle management of CodeFlare distributed workload stack
Apache License 2.0
7 stars 45 forks source link

RBAC fix to enable slack cluster queue lending limit adjustment #613

Closed dgrove-oss closed 2 months ago

dgrove-oss commented 2 months ago

The codeflare operator needs permission to read and write clusterqueues to enable the AppWrapper controller to adjust the lending limit of a designated slack cluster queue to reflect cordoned nodes.

dgrove-oss commented 2 months ago

Although we can work around this in MLBatch, it would be nice if this fix could be merged in time to make the next release so MLBatch only needs to have the patch for the codeflare operator's role in our configuration for RHOAI 2.12.

dgrove-oss commented 2 months ago

Summarizing an offline discussion, the lendingLimit field of the ClusterQueue won't be updated by the kueue controller. Cluster Admins in MLBatch are expected to modify the quota information in the slack ClusterQueue, but are expecting the AppWrapper controller to be modifying the lendingLimit. The AppWrapper controller re-computes the value of the lendingLimit each time from the quota and the node status, so even if the value was mistakenly modified by a cluster admin the AppWrapper controller would correct the mistake on the next reconcile by writing an updated value (after dealing with the reconcile conflict in the usual manner).

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: varshaprasad96

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/project-codeflare/codeflare-operator/blob/main/OWNERS)~~ [varshaprasad96] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment