kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.25k stars 222 forks source link

Add a watch for Ray and Training Operator CRDs #2414

Open ChristianZaccaria opened 2 weeks ago

ChristianZaccaria commented 2 weeks ago

What happened: When the Kueue component is installed before the Ray OR Training Operator, Kueue doesn't monitor RayCluster or PyTorchJob resources + any other CRDs from the Training Operator.

When the user creates a RayCluster or PyTorchJob resources, Kueue doesn't control admission of those resources.

Logs:

{"level":"info","ts":"2024-05-30T11:07:08.761740832Z","logger":"setup","caller":"jobframework/setup.go:69","msg":"No matching API in the server for job framework, skipped setup of controller and webhook","jobFrameworkName":"kubeflow.org/pytorchjob"}

What you expected to happen: JobFramework Controller and Webhook should start/restart once required CRDs from Ray OR Training-Operator are available.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy Kueue without the Ray or Training-Operator CRDs.
  2. Check Kueue logs.

Anything else we need to know?: Proposed Solution: Add a watcher to controller and webhook, to start/restart in the event of availability of Ray OR Training-Operator CRDs.

I have a draft PR ready - still needs some work.

Environment:

trasc commented 2 weeks ago

Adding a watch just for this looks too much for me since having the CRDs installed before enabling an integration in the Kueue's config it's a reasonable expectation and even if it's not the case rollout restart-ing the Kueue's controller manager can easily solve the issue.

@alculquicondor WDYT?

alculquicondor commented 2 weeks ago

I also think that it's a reasonable expectation that the administrator needs to restart Kueue.

But I'm willing to accept a PR if it doesn't add significant complexity. It's probably going to be similar to the watch we have for k8s version.

ChristianZaccaria commented 2 weeks ago

@alculquicondor here is a draft PR we have on our fork: https://github.com/opendatahub-io/kueue/pull/33/files

Planning on changing it a bit to account for Training Operator OR Ray CRDs, then restart the JobFramework controller and webhook once CRDs are available. The approach in the PR separates concerns, providing a more readable and robust code. Please let me know if it looks good, and if should continue and make a PR. Thanks a lot for your help!

alculquicondor commented 2 weeks ago

so overall you are adding a watch... Maybe it isn't too bad.

wdyt @trasc?

trasc commented 1 week ago

I guess it could work, but it should work for any CRD (maybe some kind of wrapper like https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/core/leader_aware_reconciler.go).

Given the limited use-case and available alternatives, the work might outweigh the added benefits, however we can try it.

(controller-runtime might a better place to implement this kind of feature)