Open haoqing0110 opened 7 months ago
Hi @haoqing0110, I am interested to work on this project. Can we discuss this further?
I gone through the project, I got to know that it requires an addon to collect and score clusters based on GPU/TPU(contribute to addon-contrib
). Propose an external Kueue Admission Check controller that uses OCM's placement decisions for scheduling (community review needed).
Hello @haoqing0110 :) I am really interested in this GSoC project and looking forward to contribute useful code to OCM.
All in all, I am aware that this project is more challenging then building a FaaS, and I am ready to learn and work on it! Thank you for your attention to read through it, looking forward to your reply. p.s. I have noticed that in the website of OCM you are supporting documentation language in Chinese, I can try to maintain them as well since it’s my mother-language.
Hello @Sayanjones @z1ens, thanks for being interested in this project. Feel free to join our community slack channel if you want to have further discussion.
@z1ens Thank you for your question, below are some of my thought:
cc @qiujian16
Hi @haoqing0110, My name is Khai. I came across this project in GSOC24, and I would love to be a contributor. I tried to join the Slack page but I ran into the error "It looks like there isn’t an account on Kubernetes tied to this email address.". I look forward to discuss more with you!
https://communityinviter.com/apps/kubernetes/community
@k2nt you can get an invite here for the Slack channel.
Hi @mikeshng. Thank you for your email (and post)! I hope you can point me to the correct channel for this project (I assume that it is open-cluster-mgmt). I am posting here instead of replying via email so that other contributors can see this also.
Thanks @k2nt yes, the channel is #open-cluster-mgmt
Hello, @haoqing0110 Thank you for your patience to answer my questions, your ideas sounds inspiring, I will take a look at the code, and I just joined the slack channel right now. Have a nice day!
Hi all, @haoqing0110 is going to talk more about this topic in this week's community meeting.
Please feel free to ask any questions here or during the meeting.
You can find the community meeting schedule here: https://calendar.google.com/calendar/u/0/embed?src=openclustermanagement@gmail.com
This has been selected to participate in this year's Google Summer of Code! 🎉 https://github.com/cncf/mentoring/discussions/1221
/assign @z1ens
@qiujian16 @haoqing0110 resource-usage-collect agent needs to consider the available resources of each node, ometimes the cluster resources are sufficient, but the node resources are insufficient.
@ivan-cai yes, I suppose @z1ens 's PR https://github.com/open-cluster-management-io/addon-contrib/pull/20 has changed to calculate the score based on the max node resource. We also had a discussion about whether need both cluster resource score and node resource score, it seems node resource score is more useful.
@ivan-cai Exactly as @haoqing0110 mentioned, I’ve implemented a scoring strategy in the resource-usage-collect-addon that includes both node scope and cluster scope scores. In Kubernetes, a job can only be scheduled if a single node in the cluster has resources >= the job's request. Therefore, linking the scoring mechanism to the node with the maximum available resources is logical. I also developed a cluster scope score that assesses the total available resources in the cluster, as sometimes cluster admins want to spread workloads across multiple clusters or nodes to enhance resource utilization.
Congratulations to @z1ens for completing the Google Summer of Code 2024 and contributing to the Open Cluster Management community.
The following PRs have been merged to our repos: GPU/TPU-resource-usage-collect-addon OCM Kueue Admission Check Controller
These contributions are also an important part of two KubeCon topics. Connecting the Dots: Towards a Unified Multi-Cluster AI/ML Experience Boundaryless Computing: Optimizing LLM Performance, Cost and Efficiency in Multi-Cloud Architecture
Thanks again for your contributions!
This is one of GSoC 2024 projects.
Announcement https://github.com/cncf/mentoring/discussions/1221
Google Summer of Code 2024 Timeline https://developers.google.com/open-source/gsoc/timeline
Description
Open Cluster Management (OCM) focuses on multicluster and multicloud management scenarios for Kubernetes applications. Open APIs are evolving within this project for cluster registration, workload distribution, dynamic placement of policies and workloads, and much more. The placement concept is used to dynamically select a set of clusters so that higher level users can either replicate Kubernetes resources to the member clusters or run their advanced workload. For example: as an application developer, I can deploy my workload to clusters with the most allocatable memory and CPU.
Now, with the rise of AI technology, there’s a growing need to schedule AI workload based on GPU/TPU resources. In this project we want you to use the placement extensible scheduling mechanism to implement a GPU/TPU resource collector addon by addon template and provide an
AddonPlacementScore
to make placement decision based on GPU/TPU resources. We also want you to propose a customized external Kueue Admission Check controller to consume the placement decision to schedule AI workload among multiple clusters based on GPU/TPU resources.Expected Outcome
Develop the GPU/TPU resource collector addon, which includes documentation of the addon architecture and describing the
AddonPlacementScore
usage. Also, implement the addon using the addon template and contribute the code to the addon-contrib repository.Deliver a proposal for the external Kueue Admission Check controller. The proposal should outline the API design and explain how the controller uses the OCM scheduling result and interacts with Kueue. The proposal needs to be finally reviewed in OCM community meeting. Also, you need to deliver a prototype based on the proposal.
Recommended Skills
Golang, Kubernetes, Scheduling
Mentor(s)
Qing Hao (@haoqing0110, qhao@redhat.com) - primary Jian Qiu (@qiujian16, jqiu@redhat.com)
References Open Cluster Management Placement concept AddOn concept Placement extensible scheduling mechanism Build an addon with addon template GPU on *KS, for example GPUs in GKE Kueue Admission Check
Discussion Feel free to raise your questions here. Can also reach out to us in the slack channel. Failed to join by the link? See solutions at https://github.com/open-cluster-management-io/ocm/issues/369#issuecomment-1988798011 .