Open ChristianZaccaria opened 2 weeks ago
Adding a watch just for this looks too much for me since having the CRDs installed before enabling an integration in the Kueue's config it's a reasonable expectation and even if it's not the case rollout restart
-ing the Kueue's controller manager can easily solve the issue.
@alculquicondor WDYT?
I also think that it's a reasonable expectation that the administrator needs to restart Kueue.
But I'm willing to accept a PR if it doesn't add significant complexity. It's probably going to be similar to the watch we have for k8s version.
@alculquicondor here is a draft PR we have on our fork: https://github.com/opendatahub-io/kueue/pull/33/files
Planning on changing it a bit to account for Training Operator OR Ray CRDs, then restart the JobFramework controller and webhook once CRDs are available. The approach in the PR separates concerns, providing a more readable and robust code. Please let me know if it looks good, and if should continue and make a PR. Thanks a lot for your help!
so overall you are adding a watch... Maybe it isn't too bad.
wdyt @trasc?
I guess it could work, but it should work for any CRD (maybe some kind of wrapper like https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/core/leader_aware_reconciler.go).
Given the limited use-case and available alternatives, the work might outweigh the added benefits, however we can try it.
(controller-runtime might a better place to implement this kind of feature)
What happened: When the Kueue component is installed before the Ray OR Training Operator, Kueue doesn't monitor RayCluster or PyTorchJob resources + any other CRDs from the Training Operator.
When the user creates a RayCluster or PyTorchJob resources, Kueue doesn't control admission of those resources.
Logs:
What you expected to happen: JobFramework Controller and Webhook should start/restart once required CRDs from Ray OR Training-Operator are available.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: Proposed Solution: Add a watcher to controller and webhook, to start/restart in the event of availability of Ray OR Training-Operator CRDs.
I have a draft PR ready - still needs some work.
Environment:
kubectl version
):git describe --tags --dirty --always
):cat /etc/os-release
):uname -a
):