volcano-sh / devices

Device plugins for Volcano, e.g. GPU
Apache License 2.0
97 stars 41 forks source link

Fix #17 Warning UnexpectedAdmissionError #29

Closed WingkaiHo closed 1 year ago

WingkaiHo commented 2 years ago

Signed-off-by: yougjiahe yongjiahe@tuputech.com

Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected

If you have use nvidia-smi to hide gpu, index will be change, number of device will be decrease. Such as

nvidia-smi drain -p <id> -m 0

Or some gpu is damage,it disable by drive. The ids of /dev/nvdiax are not continuous. Use nvml gpu sequence index replace dev path id more reasonable.

WingkaiHo commented 2 years ago

In my test environment, machine have six gpu: 企业微信截图_16618255582517

There are three gpu damage and hide by nvidia driver: 企业微信截图_16618257872691

We use unit test code to list gpu

func TestNvmlLiteDev(t *testing.T) {
    if err := nvml.Init(); err != nil {
        panic(err)
    }
    n, err := nvml.GetDeviceCount()
    if err != nil {
        panic(err)
    }

    for i := uint(0); i < n; i++ {
        d, _ := nvml.NewDeviceLite(i)
        fmt.Printf("%d %+v\n", i, d)
    }
}

Run result:

0 &{handle:{dev:0x7efee1c60e58} UUID:GPU-0a63b8a0-8f98-1919-cb8f-d921d663ceeb Path:/dev/nvidia1 Model:<nil> Power:<nil> Memory:<nil> CPUAffinity:0xc00036e490 PCI:{BusID:00000000:05:00.0 BAR1:<nil> Bandwidth:<nil>} Clocks:{Cores:<nil> Memory:<nil>} Topology:[] CudaComputeCapability:{Major:<nil> Minor:<nil>}}
1 &{handle:{dev:0x7efee1c77938} UUID:GPU-13b320da-04ca-0b45-d9ad-7247b93f4cc7 Path:/dev/nvidia4 Model:<nil> Power:<nil> Memory:<nil> CPUAffinity:0xc00036e580 PCI:{BusID:00000000:86:00.0 BAR1:<nil> Bandwidth:<nil>} Clocks:{Cores:<nil> Memory:<nil>} Topology:[] CudaComputeCapability:{Major:<nil> Minor:<nil>}}
2 &{handle:{dev:0x7efee1c8e418} UUID:GPU-5968ca31-fa62-241b-2d42-e6cb2edc4ba8 Path:/dev/nvidia5 Model:<nil> Power:<nil> Memory:<nil> CPUAffinity:0xc00036e680 PCI:{BusID:00000000:8A:00.0 BAR1:<nil> Bandwidth:<nil>} Clocks:{Cores:<nil> Memory:<nil>} Topology:[] CudaComputeCapability:{Major:<nil> Minor:<nil>}}

Sequence of gpu list are the same as nvidia-smi. If we use path id /dev/naidiax for devi ndex in device plugin, ids is 1, 4, 5. But the scheduler schedule gpu ids are from 0 - 2, so it have failed to find gpu id, which is unexpected, such as

W0826 02:57:02.676741       1 server.go:326] Failed to find the dev for pod default/nginx-deployment-f5f484f7d-r9zcc because it's not able to find dev with index 0
WingkaiHo commented 2 years ago

Use 0 - DeviceCount - 1 as gpu index for NVIDIA_VISIBLE_DEVICES more compatible.

WingkaiHo commented 2 years ago

@Thor-wl @william-wang

WingkaiHo commented 2 years ago

/assign @Thor-wl

WingkaiHo commented 1 year ago

/cc @william-wang

WingkaiHo commented 1 year ago

I use the pr in production 3 month, it can fix the UnexpectedAdmissionError which machine gpu disable by nvidia driver. @Thor-wl

WingkaiHo commented 1 year ago

Any other question to merge the code. @william-wang @shinytang6

volcano-sh-bot commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Thor-wl, william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/volcano-sh/devices/blob/master/OWNERS)~~ [william-wang] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment