Closed dgrove-oss closed 2 months ago
Although we can work around this in MLBatch, it would be nice if this fix could be merged in time to make the next release so MLBatch only needs to have the patch for the codeflare operator's role in our configuration for RHOAI 2.12.
Summarizing an offline discussion, the lendingLimit
field of the ClusterQueue won't be updated by the kueue controller. Cluster Admins in MLBatch are expected to modify the quota information in the slack ClusterQueue, but are expecting the AppWrapper controller to be modifying the lendingLimit. The AppWrapper controller re-computes the value of the lendingLimit each time from the quota and the node status, so even if the value was mistakenly modified by a cluster admin the AppWrapper controller would correct the mistake on the next reconcile by writing an updated value (after dealing with the reconcile conflict in the usual manner).
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: varshaprasad96
The full list of commands accepted by this bot can be found here.
The pull request process is described here
The codeflare operator needs permission to read and write clusterqueues to enable the AppWrapper controller to adjust the lending limit of a designated slack cluster queue to reflect cordoned nodes.