volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.24k stars 971 forks source link

[feature] Integrate with kwok to simulate mock GPU/NPU nodes #3829

Open JesseStutler opened 3 days ago

JesseStutler commented 3 days ago

What is the problem you're trying to solve

In production, we may meet bugs related with GPU/NPU, if we have ensured that the bug came from scheduler, we have to reproduce the bugs to locate where the bug is. But in fact, we may not have a GPU or NPU environment to reproduce, so we need mock GPU/NPU nodes to help us. Besides, currently extended resources like GPU/NPU are just fields in capacity/allocatable field of node, if we just only need to verify the feature of scheduler, we may not care about if there are indeed real GPU/NPU nodes

Describe the solution you'd like

We may follow this guide to write a shell to integrate kwok into volcano, if users need to mock NPU/GPU nodes, they can easily create a node yaml and test scheduling: https://kwok.sigs.k8s.io/docs/user/kwok-manage-nodes-and-pods/

Additional context

No response