koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.36k stars 331 forks source link

[proposal] DeviceShare supports allocating GPUs of different gpu memory sizes when gpu-memory is not requested explicitly #2191

Closed saintube closed 2 months ago

saintube commented 2 months ago

What is your proposal:

Currently, when a pod requests a GPU card without a particular GPU memory request, the DeviceShare scheduling supposes all the GPUs on one node have the same memory capacity and randomly picks one as the pod's gpu-memory request. This assumption is broken when the node has different GPU memory sizes, which may cause the pod not to allocate the remaining GPUs. Since a pod sometimes requests barely a GPU card ignoring the size of GPU memory, DeviceShare should allow this allocation.

Why is this needed:

Is there a suggested solution, if so, please add it: