lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.16k stars 165 forks source link

Adding an Nvidia GPU works sporadically #946

Open C0rn3j opened 1 week ago

C0rn3j commented 1 week ago

Required information

Issue description

c0rn3j@Luxuria : ~
[0] % incus config show ai            
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Archlinux current amd64 (20240425_04:43)
  image.os: Archlinux
  image.release: current
  image.requirements.secureboot: "false"
  image.serial: "20240425_04:43"
  image.type: squashfs
  image.variant: default
  nvidia.runtime: "true"
  volatile.base_image: 4f39fcabe30ee9c3a36da0f317ebd1d43a83d405edcad3c0d2be0ef868079e39
  volatile.cloud-init.instance-id: a44a0ce2-118a-4e05-a2fe-8f7c1f45b8fe
  volatile.eth0.host_name: veth190c1d08
  volatile.eth0.hwaddr: 00:16:3e:06:2c:96
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: bce2b402-db8a-4808-8aeb-27cb3457621c
  volatile.uuid.generation: bce2b402-db8a-4808-8aeb-27cb3457621c
devices:
  gpu:
    type: gpu
ephemeral: false
profiles:
- default
stateful: false
description: ""

I have added a GPU to the container - this seems to work very sporadically, I think I notice this especially after a driver update and a host reboot - it does not seem to add back properly until I reboot the container perhaps?

Unsure yet how to actually reproduce.

Here's a demo of the broken container spurring back to life after a reboot:

c0rn3j@Luxuria : ~
[0] % incus exec ai -- zsh -c 'ls -lah /dev/nvi*'   
crw-rw-rw- 1 nobody nobody 195, 255 Jun 14 18:46 /dev/nvidiactl

c0rn3j@Luxuria : ~
[0] % incus restart ai    

c0rn3j@Luxuria : ~
[0] % incus exec ai -- zsh -c 'ls -lah /dev/nvi*'
crw-rw-rw- 1 nobody nobody 236,   0 Jun 14 18:46 /dev/nvidia-uvm
crw-rw-rw- 1 nobody nobody 236,   1 Jun 14 18:46 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root   root   195,   0 Jun 18 12:45 /dev/nvidia0
crw-rw-rw- 1 nobody nobody 195, 255 Jun 14 18:46 /dev/nvidiactl

Information to attach

stgraber commented 1 week ago

Could be some kind of race condition between the NVIDIA driver stuff loading and the container starting?

Can you maybe try boot.autostart=false on the container so it doesn't start when the system boots up and see if things then behave properly when you first incus start it?