volcano-sh / devices

Device plugins for Volcano, e.g. GPU
Apache License 2.0
103 stars 43 forks source link

Warning UnexpectedAdmissionErro #17

Closed GLL550C closed 1 year ago

GLL550C commented 2 years ago

运行pod报错 Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected @ @

GLL550C commented 2 years ago

@Jeffwan

Thor-wl commented 2 years ago

/bug This bug has also been mentioned in the Volcano wechat group.

Thor-wl commented 2 years ago

/cc @peiniliu @william-wang Can you help for that?

Thor-wl commented 2 years ago

Can you give the reproduce steps and environment? Let me take a try.

WingkaiHo commented 2 years ago

If you have use nvidia-smi to hide gpu, index will be change, number of device will be decrease. Such as

nvidia-smi drain -p <id> -m 0

id of gpu devices never delete in scheduler when number gpu decrease https://github.com/volcano-sh/volcano/pull/2215

WingkaiHo commented 2 years ago

If you have use nvidia-smi to hide gpu, index will be change, number of device will be decrease. Such as

nvidia-smi drain -p <id> -m 0

Or some gpu is damage,it disable by drive, such as device2

but is still in path dev path device1

When use nvidia-smi to hide gpu, for example is hide /dev/nvidia0, device plugin init

        deviceByIndex := map[uint]string{}
    for i := uint(0); i < n; i++ {
        d, err := nvml.NewDevice(i)
        check(err)
        var id uint
        _, err = fmt.Sscanf(d.Path, "/dev/nvidia%d", &id)
        check(err)
        deviceByIndex[id] = d.UUID
        // TODO: Do we assume all cards are of same capacity
    }

so device plugint do not have index 0 gpu, but scheduler start gpu index is 0, so if will fail to Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected error

631068264 commented 1 year ago

same issue https://github.com/volcano-sh/volcano/issues/2701 use Volcano 1.7 and https://github.com/volcano-sh/devices/blob/release-1.0/volcano-device-plugin.yml

I did not use

nvidia-smi drain -p <id> -m 0