Closed WingkaiHo closed 1 year ago
In my test environment, machine have six gpu:
There are three gpu damage and hide by nvidia driver:
We use unit test code to list gpu
func TestNvmlLiteDev(t *testing.T) {
if err := nvml.Init(); err != nil {
panic(err)
}
n, err := nvml.GetDeviceCount()
if err != nil {
panic(err)
}
for i := uint(0); i < n; i++ {
d, _ := nvml.NewDeviceLite(i)
fmt.Printf("%d %+v\n", i, d)
}
}
Run result:
0 &{handle:{dev:0x7efee1c60e58} UUID:GPU-0a63b8a0-8f98-1919-cb8f-d921d663ceeb Path:/dev/nvidia1 Model:<nil> Power:<nil> Memory:<nil> CPUAffinity:0xc00036e490 PCI:{BusID:00000000:05:00.0 BAR1:<nil> Bandwidth:<nil>} Clocks:{Cores:<nil> Memory:<nil>} Topology:[] CudaComputeCapability:{Major:<nil> Minor:<nil>}}
1 &{handle:{dev:0x7efee1c77938} UUID:GPU-13b320da-04ca-0b45-d9ad-7247b93f4cc7 Path:/dev/nvidia4 Model:<nil> Power:<nil> Memory:<nil> CPUAffinity:0xc00036e580 PCI:{BusID:00000000:86:00.0 BAR1:<nil> Bandwidth:<nil>} Clocks:{Cores:<nil> Memory:<nil>} Topology:[] CudaComputeCapability:{Major:<nil> Minor:<nil>}}
2 &{handle:{dev:0x7efee1c8e418} UUID:GPU-5968ca31-fa62-241b-2d42-e6cb2edc4ba8 Path:/dev/nvidia5 Model:<nil> Power:<nil> Memory:<nil> CPUAffinity:0xc00036e680 PCI:{BusID:00000000:8A:00.0 BAR1:<nil> Bandwidth:<nil>} Clocks:{Cores:<nil> Memory:<nil>} Topology:[] CudaComputeCapability:{Major:<nil> Minor:<nil>}}
Sequence of gpu list are the same as nvidia-smi. If we use path id /dev/naidiax for devi ndex in device plugin, ids is 1, 4, 5. But the scheduler schedule gpu ids are from 0 - 2, so it have failed to find gpu id, which is unexpected, such as
W0826 02:57:02.676741 1 server.go:326] Failed to find the dev for pod default/nginx-deployment-f5f484f7d-r9zcc because it's not able to find dev with index 0
Use 0 - DeviceCount - 1 as gpu index for NVIDIA_VISIBLE_DEVICES
more compatible.
@Thor-wl @william-wang
/assign @Thor-wl
/cc @william-wang
I use the pr in production 3 month, it can fix the UnexpectedAdmissionError which machine gpu disable by nvidia driver. @Thor-wl
Any other question to merge the code. @william-wang @shinytang6
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: Thor-wl, william-wang
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Signed-off-by: yougjiahe yongjiahe@tuputech.com
Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected
If you have use nvidia-smi to hide gpu, index will be change, number of device will be decrease. Such as
Or some gpu is damage,it disable by drive. The ids of /dev/nvdiax are not continuous. Use nvml gpu sequence index replace dev path id more reasonable.