kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.28k stars 225 forks source link

[RayJob] Add support for partial admission. #783

Closed trasc closed 1 month ago

trasc commented 1 year ago

What would you like to be added:

Add support for partial admission for RayJobs. Check #420 and https://github.com/kubernetes-sigs/kueue/pull/667/files#r1198519116 for detail.

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

alculquicondor commented 1 year ago

FYI, we don't need a new KEP, but you can add the details to the existing one.

kerthcet commented 1 year ago

Hi all, what's left here, our team is interested with the integration with rayJob ?

tenzen-y commented 1 year ago

IIRC, we don't support partial admission on RayJob, now. So, we need to implement minPodsCount and then modify functions for RayJob based on minPodCount like this:

https://github.com/kubernetes-sigs/kueue/blob/f215a43a7be9b3c2788e00ec03c130b8fbc053b5/pkg/controller/jobs/job/job_controller.go#L279-L286

https://github.com/kubernetes-sigs/kueue/blob/f215a43a7be9b3c2788e00ec03c130b8fbc053b5/pkg/controller/jobs/job/job_controller.go#L199-L208

kerthcet commented 1 year ago

Thanks @tenzen-y for the feedbacks. cc @BinL233

alculquicondor commented 1 year ago

@kerthcet can you share how heterogeneous your Ray jobs are?

I wonder if we can simplify support for partial admission by restricting it to one podset. Otherwise it's an NP problem.

kerthcet commented 1 year ago

We're still exploring this, but we found the rayCluster's autoscaling is complex, and maybe that's out of the scope of kueue but related to cluster-autoscaler. It's recommended by the ray community as 1 pod(raynode) : 1 node.

Some phenomenons like when we don't have enough resources for autoscaling, the rayjob will hang forever, although part of its tasks finished, the resources will not be reclaimed. Then I think kueue can do little here ..

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tenzen-y commented 6 months ago

/remove-lifecycle stale

alculquicondor commented 6 months ago

cc @astefanutti @vicentefb @andrewsykim in case you have interest on this.

Partial admission is different from elastic in that, during admission, Kueue decides to give a smaller size to the RayJob and the job runs like this until it completes.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/kueue/issues/783#issuecomment-2185214820): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.