open-cluster-management-io / ocm

Core components in the OCM project. Report here if you found any issues in OCM.
https://open-cluster-management.io
Apache License 2.0
743 stars 93 forks source link

[GSoC 2024] Scheduling AI workload among multiple clusters #369

Open haoqing0110 opened 7 months ago

haoqing0110 commented 7 months ago

This is one of GSoC 2024 projects.

Announcement https://github.com/cncf/mentoring/discussions/1221

Google Summer of Code 2024 Timeline https://developers.google.com/open-source/gsoc/timeline

Description

Open Cluster Management (OCM) focuses on multicluster and multicloud management scenarios for Kubernetes applications. Open APIs are evolving within this project for cluster registration, workload distribution, dynamic placement of policies and workloads, and much more. The placement concept is used to dynamically select a set of clusters so that higher level users can either replicate Kubernetes resources to the member clusters or run their advanced workload. For example: as an application developer, I can deploy my workload to clusters with the most allocatable memory and CPU.

Now, with the rise of AI technology, there’s a growing need to schedule AI workload based on GPU/TPU resources. In this project we want you to use the placement extensible scheduling mechanism to implement a GPU/TPU resource collector addon by addon template and provide an AddonPlacementScore to make placement decision based on GPU/TPU resources. We also want you to propose a customized external Kueue Admission Check controller to consume the placement decision to schedule AI workload among multiple clusters based on GPU/TPU resources.

Expected Outcome

Recommended Skills

Golang, Kubernetes, Scheduling

Mentor(s)

Qing Hao (@haoqing0110, qhao@redhat.com) - primary Jian Qiu (@qiujian16, jqiu@redhat.com)

References Open Cluster Management Placement concept AddOn concept Placement extensible scheduling mechanism Build an addon with addon template GPU on *KS, for example GPUs in GKE Kueue Admission Check

Discussion Feel free to raise your questions here. Can also reach out to us in the slack channel. Failed to join by the link? See solutions at https://github.com/open-cluster-management-io/ocm/issues/369#issuecomment-1988798011 .

Sayanjones commented 7 months ago

Hi @haoqing0110, I am interested to work on this project. Can we discuss this further?

Sayanjones commented 7 months ago

I gone through the project, I got to know that it requires an addon to collect and score clusters based on GPU/TPU(contribute to addon-contrib). Propose an external Kueue Admission Check controller that uses OCM's placement decisions for scheduling (community review needed).

z1ens commented 7 months ago

Hello @haoqing0110 :) I am really interested in this GSoC project and looking forward to contribute useful code to OCM. 



About me: My name is Zhe Shen, and I am a third year undergraduate student of computer science in Germany. I am familiar with GO and also Kubernetes, I recently done a project to build a FaaS which integrated with Kubernetes environment from scratch, which can deploy functions, manage them and scale them easily.


I went through the OCM official page and tried some of the functions, including installing OCM, deploy Kubernetes resources on a specific cluster(Manifestwork) on a cluster, and also tried to create a Placement to manage set of cluster(distribute the deployments in both clusters), and they all done successfully.



After researching about addon templates, I have a few questions:


  1. How will the AddonPlacementScore algorithm evaluate clusters based on their GPU/TPU resources? What factors will it consider( utilization rates? Custom Metrics )?
  2. How will the AddonPlacementScore integrate with existing OCM scheduling mechanisms?
  3. How can we ensure the addon and controller are compatible with different Kubernetes distributions and versions?

    All in all, I am aware that this project is more challenging then building a FaaS, and I am ready to learn and work on it! Thank you for your attention to read through it, looking forward to your reply. p.s. I have noticed that in the website of OCM you are supporting documentation language in Chinese, I can try to maintain them as well since it’s my mother-language.

haoqing0110 commented 7 months ago

Hello @Sayanjones @z1ens, thanks for being interested in this project. Feel free to join our community slack channel if you want to have further discussion.


@z1ens Thank you for your question, below are some of my thought:

  1. The most basic is by the allocatable resource, as well as the usage. Metrics is a good idea, could do some investigation to see if it‘s feasible.
  2. The scheduling is logically divided into two phases internally: Predicate and Prioritize, using AddonPlacementScore to select the clusters one part of the progress. Hope the placement concept page makes it clear. And from code level, can refer to https://github.com/open-cluster-management-io/ocm/blob/main/pkg/placement/plugins/addon/addon.go to see how it works.
  3. In most cases I think a k8s upgrade should ensure its backward compatibility, and we also need to pay attention to any breaking changes.
haoqing0110 commented 7 months ago

cc @qiujian16

k2nt commented 6 months ago

Hi @haoqing0110, My name is Khai. I came across this project in GSOC24, and I would love to be a contributor. I tried to join the Slack page but I ran into the error "It looks like there isn’t an account on Kubernetes tied to this email address.". I look forward to discuss more with you!

mikeshng commented 6 months ago

https://communityinviter.com/apps/kubernetes/community

@k2nt you can get an invite here for the Slack channel.

k2nt commented 6 months ago

Hi @mikeshng. Thank you for your email (and post)! I hope you can point me to the correct channel for this project (I assume that it is open-cluster-mgmt). I am posting here instead of replying via email so that other contributors can see this also.

mikeshng commented 6 months ago

Thanks @k2nt yes, the channel is #open-cluster-mgmt

z1ens commented 6 months ago

Hello, @haoqing0110 Thank you for your patience to answer my questions, your ideas sounds inspiring, I will take a look at the code, and I just joined the slack channel right now. Have a nice day!

mikeshng commented 6 months ago

Hi all, @haoqing0110 is going to talk more about this topic in this week's community meeting.

Please feel free to ask any questions here or during the meeting.

You can find the community meeting schedule here: https://calendar.google.com/calendar/u/0/embed?src=openclustermanagement@gmail.com

haoqing0110 commented 5 months ago

This has been selected to participate in this year's Google Summer of Code! 🎉 https://github.com/cncf/mentoring/discussions/1221

haoqing0110 commented 5 months ago

/assign @z1ens

ivan-cai commented 1 month ago

@qiujian16 @haoqing0110 resource-usage-collect agent needs to consider the available resources of each node, ometimes the cluster resources are sufficient, but the node resources are insufficient.

haoqing0110 commented 1 month ago

@ivan-cai yes, I suppose @z1ens 's PR https://github.com/open-cluster-management-io/addon-contrib/pull/20 has changed to calculate the score based on the max node resource. We also had a discussion about whether need both cluster resource score and node resource score, it seems node resource score is more useful.

z1ens commented 1 month ago

@ivan-cai Exactly as @haoqing0110 mentioned, I’ve implemented a scoring strategy in the resource-usage-collect-addon that includes both node scope and cluster scope scores. In Kubernetes, a job can only be scheduled if a single node in the cluster has resources >= the job's request. Therefore, linking the scoring mechanism to the node with the maximum available resources is logical. I also developed a cluster scope score that assesses the total available resources in the cluster, as sometimes cluster admins want to spread workloads across multiple clusters or nodes to enhance resource utilization.

haoqing0110 commented 1 month ago

Congratulations to @z1ens for completing the Google Summer of Code 2024 and contributing to the Open Cluster Management community.

The following PRs have been merged to our repos: GPU/TPU-resource-usage-collect-addon OCM Kueue Admission Check Controller

These contributions are also an important part of two KubeCon topics. Connecting the Dots: Towards a Unified Multi-Cluster AI/ML Experience Boundaryless Computing: Optimizing LLM Performance, Cost and Efficiency in Multi-Cloud Architecture

Thanks again for your contributions!