volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.12k stars 953 forks source link

[CNCF LFX 2024 01-Mar-May]Volcano support multi-clusters AI workload scheduling. #3310

Open Monokaix opened 8 months ago

Monokaix commented 8 months ago

What would you like to be added:

Volcano supports multi-cluster AI workload scheduling and provides rich scheduling strategies to choose a appropriate cluster for jobs.

Why is this needed:

Volcano has provided rich AI workloads scheduling capabilities in the field of single-cluster. With the development of multi-cluster management, more and more users use multiple clusters to uniformly manage and run their AI workloads. Volcano needs to support multi-cluster AI job scheduling and provide a series of scheduling capabilities, such as job management, gang scheduling, queue management, etc., so as to select the appropriate cluster for the job, this is the first level of scheduling, the scheduler of each cluster selects the appropriate node for the job, this is second-level scheduling. Here we only need first-level scheduling.

lowang-bh commented 8 months ago

Repo is here: https://github.com/volcano-sh/federation

Monokaix commented 8 months ago

Repo is here: https://github.com/volcano-sh/federation

We should keep working on this: )

RohanMishra315 commented 8 months ago

Hey @Monokaix I would love to work on this ! I have previous experience working with Karmada. Would love to take it as a challenge , looking forward to it.

Monokaix commented 8 months ago

Hey @Monokaix I would love to work on this ! I have previous experience working with Karmada. Would love to take it as a challenge , looking forward to it.

Hi, thanks for your enthusiasm! Sorry that I didn't mention it's a CNCF LFX project, and you can apply for this project here : )

SpringWiz11 commented 8 months ago

Hey @Monokaix,

I just noticed that this project is a CNCF LFX project, and I am thrilled to work on this.

Having worked extensively on multi-cluster scheduling and AI, I bring valuable industrial experience to the table. I have experience building scalable cloud-native and AI applications, ranging from traditional deep learning models to cutting-edge Federated Learning models deployed in production environments using frameworks like flower, FedML and PySyft

I also have hands-on experience with Karmada and would love to explore more and do valuable contribution.

By getting this opportunity I would like to leverage my Multi-cloud, multi-cluster and AI skillset under the guidance of the establised engineers at Volcano.

Monokaix commented 8 months ago

Hey @Monokaix,

I just noticed that this project is a CNCF LFX project, and I am thrilled to work on this.

Having worked extensively on multi-cluster scheduling and AI, I bring valuable industrial experience to the table. I have experience building scalable cloud-native and AI applications, ranging from traditional deep learning models to cutting-edge Federated Learning models deployed in production environments using frameworks like flower, FedML and PySyft

I also have hands-on experience with Karmada and would love to explore more and do valuable contribution.

By getting this opportunity I would like to leverage my Multi-cloud, multi-cluster and AI skillset under the guidance of the establised engineers at Volcano.

Welcome! And you can apply here.

TrungBui59 commented 8 months ago

Hi @Monokaix,

I just applied to the CNCF LFX Mentorship program for this project. I am very interested in this project and would love to contribute to it. Is there any advice you have for me to get to understand the codebase and start with the good-first-issue issues?

Vacant2333 commented 7 months ago

hi! im very interested on this issue, and i just aplied the lfx now, im the karmada reviewer now, u can take a look about my github page~