volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.24k stars 971 forks source link

GPU being allocated repeatedly #3824

Open linuxfhy opened 1 week ago

linuxfhy commented 1 week ago

Description

When multiple pods request a portion of the GPUs on a node, there is an issue of the GPUs being allocated repeatedly. As shown below,three pods are scheduled to the same node, on which has 8 GPUs(id:0,1,2,3,4,5,6,7).Three pods request 2, 2, and 4 GPUs respectively. Another condition is the pods must not being launched at the same time(pods must not being processed in the same volcano scheduling-loop);

The result is GPU 1 and 2 being allocated to pod1 and also to pod3, which I believe is incorrect.

kubectl -n ai-vc get po notebook-pod1 -o yaml volcano.sh/gpu-index: 1,2 creationTimestamp: "2024-11-04T02:03:48Z" kubectl -n ai-vc get po notebook-pod2 -o yaml volcano.sh/gpu-index: 7,0 creationTimestamp: "2024-11-06T06:43:32Z" kubectl -n ai-vc get po notebook-pod3 -o yaml volcano.sh/gpu-index: 0,1,2,3 creationTimestamp: "2024-11-07T10:23:46Z"

Steps to reproduce the issue

1.Deploy voclano(v1.7.0) and volcano-device-plugin correctly. Describe gpu node and see 4 "volcano.sh/gpu-number" resources on it. 2.Launch 4 pods one by one at 3-second intervals, with each pod requesting one GPU resource. kubectl apply -f gpu-test-gpu3-1.yaml;sleep 3
kubectl apply -f gpu-test-gpu3-2.yaml;sleep 3 kubectl apply -f gpu-test-gpu3-3.yaml;sleep 3 kubectl apply -f gpu-test-gpu3-4.yaml;sleep 3

test yaml is shown as below:

apiVersion: v1 kind: Pod metadata: name: gpu-test-gpu3-1 #or 2,3,4 spec: restartPolicy: Never nodeSelector: kubernetes.io/hostname: gpu3 schedulerName: volcano tolerations:

Describe the results you received and expected

result receive: all pod running and some gpu-index of pods are same [root@master1 fhy]# kubectl get po | grep gpu3 gpu-test-gpu3-1 1/1 Running 0 13s gpu-test-gpu3-2 1/1 Running 0 10s gpu-test-gpu3-3 1/1 Running 0 6s gpu-test-gpu3-4 1/1 Running 0 3s [root@master1 fhy]# kubectl describe po gpu-test-gpu3-1 | grep gpu-index volcano.sh/gpu-index: 3 [root@master1 fhy]# kubectl describe po gpu-test-gpu3-2 | grep gpu-index volcano.sh/gpu-index: 0 [root@master1 fhy]# kubectl describe po gpu-test-gpu3-3 | grep gpu-index volcano.sh/gpu-index: 3 [root@master1 fhy]# kubectl describe po gpu-test-gpu3-4 | grep gpu-index volcano.sh/gpu-index: 0

result expect: all pod running and gpu-index of pods should be different to each other. [root@master1 fhy]# kubectl get po | grep gpu3 gpu-test-gpu3-1 1/1 Running 0 12s gpu-test-gpu3-2 1/1 Running 0 9s gpu-test-gpu3-3 1/1 Running 0 6s gpu-test-gpu3-4 1/1 Running 0 3s [root@master1 fhy]# kubectl describe po gpu-test-gpu3-1 | grep gpu-index volcano.sh/gpu-index: 1 [root@master1 fhy]# kubectl describe po gpu-test-gpu3-2 | grep gpu-index volcano.sh/gpu-index: 0 [root@master1 fhy]# kubectl describe po gpu-test-gpu3-3 | grep gpu-index volcano.sh/gpu-index: 2 [root@master1 fhy]# kubectl describe po gpu-test-gpu3-4 | grep gpu-index volcano.sh/gpu-index: 3

What version of Volcano are you using?

v1.7.0

Any other relevant information

I think the reason is:
ssn.Node[nodename].GPUDevices[dev-id] records GPUs occupied by existing pod and Gpus allocated by current scheduler-loop; But pods using "volcano.sh/gpu-number" are not recorded correctly, so the allocated GPU is mistakenly considered idle.

My solution is shown below:

//Before modify: func (ni NodeInfo) AddGPUResource(pod v1.Pod) { gpuRes := GetGPUMemoryOfPod(pod)
if gpuRes > 0 { //only consider pods using gpu-memory; ids := GetGPUIndex(pod) for _, id := range ids { if dev := ni.GPUDevices[id]; dev != nil { dev.PodMap[string(pod.UID)] = pod } } } }

//After modify: func (ni NodeInfo) AddGPUResource(pod v1.Pod) { gpuRes := GetGPUMemoryOfPod(pod) gpuNumRes := GetGPUNumberOfPod(pod) //both consider pods using gpu-memory and gpu-number; if gpuRes > 0 || gpuNumRes > 0 { //both consider pods using gpu-memory and gpu-number; ids := GetGPUIndex(pod) for _, id := range ids { if dev := ni.GPUDevices[id]; dev != nil { dev.PodMap[string(pod.UID)] = pod } } } }

linuxfhy commented 3 days ago

如果可以阅读中文,请参考问题中的问题分析: GPU重复分配问题.docx