Closed GLL550C closed 1 year ago
@Jeffwan
/bug This bug has also been mentioned in the Volcano wechat group.
/cc @peiniliu @william-wang Can you help for that?
Can you give the reproduce steps and environment? Let me take a try.
If you have use nvidia-smi to hide gpu, index will be change, number of device will be decrease. Such as
nvidia-smi drain -p <id> -m 0
id of gpu devices never delete in scheduler when number gpu decrease https://github.com/volcano-sh/volcano/pull/2215
If you have use nvidia-smi to hide gpu, index will be change, number of device will be decrease. Such as
nvidia-smi drain -p <id> -m 0
Or some gpu is damage,it disable by drive, such as
but is still in path dev path
When use nvidia-smi to hide gpu, for example is hide /dev/nvidia0, device plugin init
deviceByIndex := map[uint]string{}
for i := uint(0); i < n; i++ {
d, err := nvml.NewDevice(i)
check(err)
var id uint
_, err = fmt.Sscanf(d.Path, "/dev/nvidia%d", &id)
check(err)
deviceByIndex[id] = d.UUID
// TODO: Do we assume all cards are of same capacity
}
so device plugint do not have index 0 gpu, but scheduler start gpu index is 0, so if will fail to Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected error
same issue https://github.com/volcano-sh/volcano/issues/2701 use Volcano 1.7 and https://github.com/volcano-sh/devices/blob/release-1.0/volcano-device-plugin.yml
I did not use
nvidia-smi drain -p <id> -m 0
运行pod报错 Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected @ @