lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.51k stars 204 forks source link

GPU not released on VM stop #30

Open itzsimpl opened 1 year ago

itzsimpl commented 1 year ago

Following https://github.com/lxc/lxc/issues/4332#issuecomment-1669029013 I'm opening the issue here.

Required information

Click to see full * Distribution: Ubuntu * Distribution version: 22.04 * The output of * `lxc-start --version` ``` 5.0.0~git2209-g5a7b9ce67 ``` * `lxc-checkconfig` ``` LXC version 5.0.0~git2209-g5a7b9ce67 Kernel configuration not found at /proc/config.gz; searching... Kernel configuration found at /boot/config-6.2.0-26-generic --- Namespaces --- Namespaces: enabled Utsname namespace: enabled Ipc namespace: enabled Pid namespace: enabled User namespace: enabled Network namespace: enabled --- Control groups --- Cgroups: enabled Cgroup namespace: enabled Cgroup v1 mount points: Cgroup v2 mount points: /sys/fs/cgroup Cgroup v1 systemd controller: missing Cgroup v1 freezer controller: missing Cgroup ns_cgroup: required Cgroup device: enabled Cgroup sched: enabled Cgroup cpu account: enabled Cgroup memory controller: enabled Cgroup cpuset: enabled --- Misc --- Veth pair device: enabled, not loaded Macvlan: enabled, not loaded Vlan: enabled, not loaded Bridges: enabled, loaded Advanced netfilter: enabled, loaded CONFIG_IP_NF_TARGET_MASQUERADE: enabled, not loaded CONFIG_IP6_NF_TARGET_MASQUERADE: enabled, not loaded CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled, not loaded CONFIG_NETFILTER_XT_MATCH_COMMENT: enabled, not loaded FUSE (for use with lxcfs): enabled, not loaded --- Checkpoint/Restore --- checkpoint restore: enabled CONFIG_FHANDLE: enabled CONFIG_EVENTFD: enabled CONFIG_EPOLL: enabled CONFIG_UNIX_DIAG: enabled CONFIG_INET_DIAG: enabled CONFIG_PACKET_DIAG: enabled CONFIG_NETLINK_DIAG: enabled File capabilities: Note : Before booting a new kernel, you can check its configuration usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig ``` * `uname -a` ``` Linux q1 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux ``` * `cat /proc/self/cgroup` ``` 0::/user.slice/user-1000.slice/session-3.scope ``` * `cat /proc/1/mounts` ``` sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 udev /dev devtmpfs rw,nosuid,relatime,size=263874368k,nr_inodes=65968592,mode=755,inode64 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,noexec,relatime,size=52797032k,mode=755,inode64 0 0 efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0 /dev/nvme0n1p2 / ext4 rw,relatime,stripe=32 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,inode64 0 0 tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,inode64 0 0 cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=92734 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0 mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0 tracefs /sys/kernel/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0 configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0 ramfs /run/credentials/systemd-sysusers.service ramfs ro,nosuid,nodev,noexec,relatime,mode=700 0 0 /dev/nvme0n1p1 /boot/efi vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 0 /dev/loop0 /snap/core20/1852 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 /dev/loop1 /snap/core20/1974 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 /dev/loop2 /snap/core22/858 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 /dev/loop4 /snap/snapd/18596 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 /dev/loop3 /snap/lxd/25112 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 /dev/loop5 /snap/snapd/19457 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /run/snapd/ns tmpfs rw,nosuid,nodev,noexec,relatime,size=52797032k,mode=755,inode64 0 0 nsfs /run/snapd/ns/lxd.mnt nsfs rw 0 0 tmpfs /var/snap/lxd/common/ns tmpfs rw,relatime,size=1024k,mode=700,inode64 0 0 nsfs /var/snap/lxd/common/ns/shmounts nsfs rw 0 0 nsfs /var/snap/lxd/common/ns/mntns nsfs rw 0 0 tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=52797028k,nr_inodes=13199257,mode=700,uid=1000,gid=1000,inode64 0 0 lxcfs /var/lib/lxcfs fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 ```

Issue description

I have an Ubuntu:22.04 system with multiple GPUs, NVIDIA drivers 535 server installed, persistence mode is off. I pass individual GPUs to a VM as PCI passthrough. When I pass a single GPU to the VM, start it and then stop, the GPU is not returned to the host system (i.e. nvidia-smi does not show it anymore). When I pass multiple GPUs to the VM, start it and then stop, the GPU with the lowest PCI address on the host is not returned to the host system (i.e. nvidia-smi does not show it anymore), but the other GPUs get returned just fine.

Restarting the VMs again the GPUs are visible inside the VM, but if I start a container with nvidia-driver passthrough, only the GPUs that are currently visible on the host (i.e. all installed minus those that were not returned from the VMs earlier) are visible in the container. The only info I can find is that syslog says "Failed to stop device".

Steps to reproduce

  1. run nvidia-smi -L on host
  2. create VM with single GPU via passthrough
  3. start VM
  4. stop VM
  5. run nvidia-smi -L on host (the GPU that was passthrough to the VM will not be listed)
  6. create VM with multiple GPUs via passthrough
  7. start VM
  8. stop VM
  9. run nvidia-smi -L on host (the GPU will the lowest PCI address on the host that was passthrough to the VM will also not be listed)
    1. run container with nvidia-driver passthrough (same status as on the host)

Information to attach

Click to see full - [x] VM log (`lxc info --show-log vm2`) ``` Name: vm2 Status: STOPPED Type: virtual-machine Architecture: x86_64 Created: 2023/08/07 22:14 UTC Last Used: 2023/08/07 23:19 UTC Log: qemu-system-x86_64: Issue while setting TUNSETSTEERINGEBPF: Invalid argument with fd: 83, prog_fd: -1 ``` - [x] any relevant kernel output (`syslog`), the single GPU case ``` Aug 7 23:18:53 q1 kernel: [ 846.086912] vfio-pci 0000:ca:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none Aug 7 23:18:53 q1 snapd[2334]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data Aug 7 23:18:54 q1 kernel: [ 846.548326] xhci_hcd 0000:ca:00.2: remove, state 4 Aug 7 23:18:54 q1 kernel: [ 846.548343] usb usb10: USB disconnect, device number 1 Aug 7 23:18:54 q1 kernel: [ 846.549060] xhci_hcd 0000:ca:00.2: USB bus 10 deregistered Aug 7 23:18:54 q1 kernel: [ 846.549083] xhci_hcd 0000:ca:00.2: remove, state 4 Aug 7 23:18:54 q1 kernel: [ 846.549091] usb usb9: USB disconnect, device number 1 Aug 7 23:18:54 q1 kernel: [ 846.550896] xhci_hcd 0000:ca:00.2: USB bus 9 deregistered Aug 7 23:18:54 q1 kernel: [ 846.653021] kauditd_printk_skb: 9 callbacks suppressed Aug 7 23:18:54 q1 kernel: [ 846.653026] audit: type=1400 audit(1691450334.129:54): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-vm2_" pid=5316 comm="apparmor_parser" Aug 7 23:18:53 q1 snapd[2334]: message repeated 3 times: [ udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data] Aug 7 23:18:55 q1 systemd[3823]: Started snap.lxd.lxc.b9b13195-c7c3-46d4-842a-856565db2c99.scope. Aug 7 23:19:13 q1 kernel: [ 865.800363] vfio-pci 0000:ca:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 Aug 7 23:19:13 q1 kernel: [ 865.800386] vfio-pci 0000:ca:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 Aug 7 23:19:46 q1 systemd[3823]: Started snap.lxd.lxc.0a424bc8-95d2-4cb9-bdd0-468d3dbce737.scope. Aug 7 23:19:51 q1 systemd[3823]: Started snap.lxd.lxc.63564057-7dd7-462c-9548-3a5153ddd1e7.scope. Aug 7 23:19:51 q1 systemd[1]: Starting Cleanup of Temporary Directories... Aug 7 23:19:51 q1 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully. Aug 7 23:19:51 q1 systemd[1]: Finished Cleanup of Temporary Directories. Aug 7 23:19:54 q1 kernel: [ 907.246377] vfio-pci 0000:ca:00.0: Relaying device request to user (#0) Aug 7 23:20:01 q1 kernel: [ 913.710624] vfio-pci 0000:ca:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Aug 7 23:20:01 q1 kernel: [ 913.711376] vfio-pci 0000:ca:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Aug 7 23:20:01 q1 lxd.daemon[3076]: time="2023-08-07T23:20:01Z" level=error msg="Failed to stop device" device=gpu3 err="Failed probing device \"0000:ca:00.0\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=vm2 instanceType=virtual-machine project=default Aug 7 23:20:01 q1 systemd-networkd[2222]: mac6293c2ac: Link DOWN Aug 7 23:20:01 q1 systemd-networkd[2222]: mac6293c2ac: Lost carrier Aug 7 23:20:01 q1 kernel: [ 913.898141] audit: type=1400 audit(1691450401.373:55): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-vm2_" pid=10366 comm="apparmor_parser" Aug 7 23:32:42 q1 systemd[3823]: Started snap.lxd.lxc.44a0582a-97eb-4f56-9149-a7b6f2afec5b.scope. ``` - [x] any relevant kernel output (`syslog`), two GPU case ``` Aug 7 23:45:38 q1 kernel: [ 2450.861745] vfio-pci 0000:17:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none Aug 7 23:45:38 q1 snapd[2334]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data Aug 7 23:45:38 q1 kernel: [ 2451.339448] xhci_hcd 0000:17:00.2: remove, state 4 Aug 7 23:45:38 q1 kernel: [ 2451.339464] usb usb4: USB disconnect, device number 1 Aug 7 23:45:38 q1 kernel: [ 2451.340164] xhci_hcd 0000:17:00.2: USB bus 4 deregistered Aug 7 23:45:38 q1 kernel: [ 2451.340188] xhci_hcd 0000:17:00.2: remove, state 4 Aug 7 23:45:38 q1 kernel: [ 2451.340197] usb usb3: USB disconnect, device number 1 Aug 7 23:45:38 q1 kernel: [ 2451.341944] xhci_hcd 0000:17:00.2: USB bus 3 deregistered Aug 7 23:45:40 q1 kernel: [ 2453.384621] vfio-pci 0000:31:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none Aug 7 23:45:41 q1 kernel: [ 2453.867449] xhci_hcd 0000:31:00.2: remove, state 4 Aug 7 23:45:41 q1 kernel: [ 2453.867464] usb usb6: USB disconnect, device number 1 Aug 7 23:45:41 q1 kernel: [ 2453.868123] xhci_hcd 0000:31:00.2: USB bus 6 deregistered Aug 7 23:45:41 q1 kernel: [ 2453.868144] xhci_hcd 0000:31:00.2: remove, state 4 Aug 7 23:45:41 q1 kernel: [ 2453.868151] usb usb5: USB disconnect, device number 1 Aug 7 23:45:41 q1 kernel: [ 2453.869683] xhci_hcd 0000:31:00.2: USB bus 5 deregistered Aug 7 23:45:41 q1 kernel: [ 2453.966981] audit: type=1400 audit(1691451941.446:56): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-vm2_" pid=11010 comm="apparmor_parser" Aug 7 23:46:00 q1 kernel: [ 2472.883434] vfio-pci 0000:17:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 Aug 7 23:46:00 q1 kernel: [ 2472.883457] vfio-pci 0000:17:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 Aug 7 23:46:00 q1 kernel: [ 2473.055433] vfio-pci 0000:31:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 Aug 7 23:46:00 q1 kernel: [ 2473.055455] vfio-pci 0000:31:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 Aug 7 23:45:40 q1 snapd[2334]: message repeated 7 times: [ udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data] Aug 7 23:47:16 q1 systemd[3823]: Started snap.lxd.lxc.c918eda7-03e8-4d84-9cb2-c9e1b4d6bfa2.scope. Aug 7 23:49:01 q1 kernel: [ 2653.889634] vfio-pci 0000:31:00.0: Relaying device request to user (#0) Aug 7 23:49:08 q1 kernel: [ 2660.602855] vfio-pci 0000:31:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Aug 7 23:49:08 q1 kernel: [ 2660.603292] nvidia 0000:31:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none Aug 7 23:49:08 q1 kernel: [ 2660.690297] snd_hda_intel 0000:31:00.1: Disabling MSI Aug 7 23:49:08 q1 kernel: [ 2660.690325] snd_hda_intel 0000:31:00.1: Handle vga_switcheroo audio client Aug 7 23:49:08 q1 kernel: [ 2660.714786] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input19 Aug 7 23:49:08 q1 kernel: [ 2660.714916] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input20 Aug 7 23:49:08 q1 kernel: [ 2660.715088] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input21 Aug 7 23:49:08 q1 kernel: [ 2660.715283] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input22 Aug 7 23:49:08 q1 snapd[2334]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data Aug 7 23:49:08 q1 kernel: [ 2660.726602] xhci_hcd 0000:31:00.2: xHCI Host Controller Aug 7 23:49:08 q1 kernel: [ 2660.726615] xhci_hcd 0000:31:00.2: new USB bus registered, assigned bus number 3 Aug 7 23:49:08 q1 kernel: [ 2660.727221] xhci_hcd 0000:31:00.2: hcc params 0x0180ff05 hci version 0x110 quirks 0x0000000000000010 Aug 7 23:49:08 q1 kernel: [ 2660.727606] xhci_hcd 0000:31:00.2: xHCI Host Controller Aug 7 23:49:08 q1 kernel: [ 2660.727610] xhci_hcd 0000:31:00.2: new USB bus registered, assigned bus number 4 Aug 7 23:49:08 q1 kernel: [ 2660.727613] xhci_hcd 0000:31:00.2: Host supports USB 3.1 Enhanced SuperSpeed Aug 7 23:49:08 q1 kernel: [ 2660.727661] usb usb3: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 6.02 Aug 7 23:49:08 q1 kernel: [ 2660.727664] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Aug 7 23:49:08 q1 kernel: [ 2660.727666] usb usb3: Product: xHCI Host Controller Aug 7 23:49:08 q1 kernel: [ 2660.727668] usb usb3: Manufacturer: Linux 6.2.0-26-generic xhci-hcd Aug 7 23:49:08 q1 kernel: [ 2660.727669] usb usb3: SerialNumber: 0000:31:00.2 Aug 7 23:49:08 q1 kernel: [ 2660.727830] hub 3-0:1.0: USB hub found Aug 7 23:49:08 q1 kernel: [ 2660.727837] hub 3-0:1.0: 2 ports detected Aug 7 23:49:08 q1 kernel: [ 2660.727975] usb usb4: We don't know the algorithms for LPM for this host, disabling LPM. Aug 7 23:49:08 q1 kernel: [ 2660.727993] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 6.02 Aug 7 23:49:08 q1 kernel: [ 2660.727995] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Aug 7 23:49:08 q1 kernel: [ 2660.727997] usb usb4: Product: xHCI Host Controller Aug 7 23:49:08 q1 kernel: [ 2660.727999] usb usb4: Manufacturer: Linux 6.2.0-26-generic xhci-hcd Aug 7 23:49:08 q1 kernel: [ 2660.728000] usb usb4: SerialNumber: 0000:31:00.2 Aug 7 23:49:08 q1 kernel: [ 2660.728175] hub 4-0:1.0: USB hub found Aug 7 23:49:08 q1 kernel: [ 2660.728184] hub 4-0:1.0: 4 ports detected Aug 7 23:49:08 q1 snapd[2334]: message repeated 3 times: [ udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data] Aug 7 23:49:08 q1 systemd[3823]: Reached target Sound Card. Aug 7 23:49:08 q1 kernel: [ 2660.807453] vfio-pci 0000:17:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Aug 7 23:49:08 q1 kernel: [ 2660.807674] vfio-pci 0000:17:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Aug 7 23:49:08 q1 lxd.daemon[3076]: time="2023-08-07T23:49:08Z" level=error msg="Failed to stop device" device=gpu0 err="Failed probing device \"0000:17:00.0\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=vm2 instanceType=virtual-machine project=default Aug 7 23:49:08 q1 systemd-networkd[2222]: mac43379c64: Link DOWN Aug 7 23:49:08 q1 systemd-networkd[2222]: mac43379c64: Lost carrier Aug 7 23:49:08 q1 kernel: [ 2661.011901] audit: type=1400 audit(1691452148.495:57): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-vm2_" pid=13584 comm="apparmor_parser" ``` - [x] the VM configuration file ``` architecture: x86_64 config: agent.nic_config: "true" cloud-init.network-config: | version: 1 config: - type: physical name: eth0 subnets: - type: static ipv4: true address: 10.10.10.10/25 gateway: 10.10.10.1 control: auto - type: nameserver address: - 1.1.1.1 - 1.0.0.1 cloud-init.user-data: | #cloud-config ssh_import_id: [gh:itzsimpl] image.architecture: amd64 image.description: ubuntu 22.04 LTS amd64 (release) (20230729) image.label: release image.os: ubuntu image.release: jammy image.serial: "20230729" image.type: disk-kvm.img image.version: "22.04" limits.cpu: "20" limits.memory: 64GiB security.secureboot: "false" volatile.base_image: c3a32ce371819c4fb845867e8e602ad6a636e211cfaeca448e767de4b415c038 volatile.cloud-init.instance-id: f6fa9720-3024-4574-bbd7-e29a10e14ca0 volatile.eth0.hwaddr: 00:16:3e:73:46:f3 volatile.last_state.power: STOPPED volatile.last_state.ready: "false" volatile.uuid: 114bc8ad-0afb-4732-9911-f2583a3330c4 volatile.uuid.generation: 114bc8ad-0afb-4732-9911-f2583a3330c4 volatile.vsock_id: "1262936222" devices: eth0: name: eth0 nictype: macvlan parent: ens97f0np0 type: nic gpu0: gputype: physical pci: "0000:17:00.0" type: gpu gpu1: gputype: physical pci: "0000:31:00.0" type: gpu root: path: / pool: default size: 128GB type: disk ephemeral: false profiles: - default - pub-macvlan - gpu0 - gpu1 stateful: false description: vm2 ```
adamcstephens commented 1 year ago

I think you want to post this at https://github.com/canonical/lxd instead.

stgraber commented 1 year ago

I'm happy to still keep this one open as Incus is very likely to have this exact same issue given where we're at with the fork. But indeed if you're looking for reasonably quick resolution and for that fix to be available in LXD, you're better off reporting the issue against LXD.

itzsimpl commented 1 year ago

Just to let you know, I've opened the issue also on on Canonical/lxd, and there is a little bit more info (additional tests that I made on different GPUs and with vGPU drivers), see https://github.com/canonical/lxd/issues/12128.

stgraber commented 7 months ago

Going to poke at that one tomorrow. Sadly the only system I have with multiple NVIDIA GPUs is a box where I have no intention to ever install the binary NVIDIA driver :)

But I do have our other test system which has a single NVIDIA GPU and where I don't mind installing the NVIDIA drivers on the host, so I'm hoping I can reproduce what you're seeing on that one.

stgraber commented 7 months ago

I'm unable to reproduce the described issue with current Incus:

root@argos:~# nvidia-smi
Thu Feb 22 15:41:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:07:00.0 Off |                  Off |
| N/A   93C    P0              67W / 250W |      0MiB / 40960MiB |     41%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
iroot@argos:~# incus config show v1
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu jammy amd64 (20240221_07:42)
  image.os: Ubuntu
  image.release: jammy
  image.serial: "20240221_07:42"
  image.type: disk-kvm.img
  image.variant: default
  limits.cpu: "8"
  limits.memory: 8GiB
  volatile.base_image: 22ab00c001e2a464dabf7c813bb448797900ca922bd96a8104a8089584c07e95
  volatile.cloud-init.instance-id: 77946807-0039-4423-b30f-2cba99b265a9
  volatile.eth0.hwaddr: 00:16:3e:30:f2:10
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: f27532d7-eadb-4487-9d07-15dcd1dde1ce
  volatile.uuid.generation: f27532d7-eadb-4487-9d07-15dcd1dde1ce
  volatile.vsock_id: "1338125073"
devices:
  gpu:
    gputype: physical
    pci: "07:00.0"
    type: gpu
ephemeral: false
profiles:
- default
stateful: false
description: ""
root@argos:~# incus start v1
root@argos:~# readlink -f /sys/bus/pci/devices/0000\:07\:00.0/driver
/sys/bus/pci/drivers/vfio-pci
root@argos:~# incus exec v1 bash
Error: VM agent isn't currently running
root@argos:~# incus exec v1 bash
root@v1:~# apt install pciutils
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libpci3 pci.ids
Suggested packages:
  bzip2 wget | curl | lynx-cur
The following NEW packages will be installed:
  libpci3 pci.ids pciutils
0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 343 kB of archives.
After this operation, 1581 kB of additional disk space will be used.
Do you want to continue? [Y/n] 
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 pci.ids all 0.0~2022.01.22-1 [251 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libpci3 amd64 1:3.7.0-6 [28.9 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 pciutils amd64 1:3.7.0-6 [63.6 kB]
Fetched 343 kB in 0s (841 kB/s)    
Selecting previously unselected package pci.ids.
(Reading database ... 18356 files and directories currently installed.)
Preparing to unpack .../pci.ids_0.0~2022.01.22-1_all.deb ...
Unpacking pci.ids (0.0~2022.01.22-1) ...
Selecting previously unselected package libpci3:amd64.
Preparing to unpack .../libpci3_1%3a3.7.0-6_amd64.deb ...
Unpacking libpci3:amd64 (1:3.7.0-6) ...
Selecting previously unselected package pciutils.
Preparing to unpack .../pciutils_1%3a3.7.0-6_amd64.deb ...
Unpacking pciutils (1:3.7.0-6) ...
Setting up pci.ids (0.0~2022.01.22-1) ...
Setting up libpci3:amd64 (1:3.7.0-6) ...
Setting up pciutils (1:3.7.0-6) ...
Processing triggers for libc-bin (2.35-0ubuntu3.6) ...
root@v1:~# lspci -nnn | grep -i nvidia
06:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 40GB] [10de:20f1] (rev a1)
root@v1:~# 
exit
root@argos:~# incus stop v1
root@argos:~# readlink -f /sys/bus/pci/devices/0000\:07\:00.0/driver
/sys/bus/pci/drivers/nvidia
root@argos:~# nvidia-smi
Thu Feb 22 15:42:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:07:00.0 Off |                  Off |
| N/A   94C    P0              68W / 250W |      0MiB / 40960MiB |     48%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@argos:~# 

@itzsimpl can you confirm that this is still an issue on current Incus? If so, I may need to find a system with a similar set of GPUs and drivers as you're running since our test system has no such problems.

itzsimpl commented 7 months ago

@stgraber thank you for starting to look into this. Unfortunately, I do not have a system with the same setup available at the moment. Based on the experiments when we first saw the issue it may be limited to "older" and "non-datacenter" GPUs, as these load/unload more devices (eg. Quadro RTX 6000 in our case, see https://github.com/canonical/lxd/issues/12128#issuecomment-1672818544).

FWW. We also noticed issues with unloading of vGPU drivers. The only workaround that we managed to setup was to remove the devices and rescan the PCI once the VM shuts down, but that does not work with vGPU drivers (see https://github.com/canonical/lxd/issues/12128#issuecomment-1705351116).

stgraber commented 7 months ago

Okay, so we're going to need to get access to a system with such a GPU to be able to reproduce the issue and look for a fix.

Having multiple devices in the group definitely sounds like it may be the problem but we don't have anything in our lab that behaves that way.

Similarly for vGPU, we only have the A100 for that and it uses mdev which doesn't have any such issues.

itzsimpl commented 7 months ago

FWW. vis vGPU, we had mdev as well, the Quadro RTX 6000 is on the list of supported GPUs (https://docs.nvidia.com/grid/gpus-supported-by-vgpu.html), but the result was that on VM shutdown some vGPUs did not get released properly, so VM shutdown and startup eventually drained the GPU memory, only a reboot helped (https://github.com/canonical/lxd/issues/12128#issuecomment-1684163228). The drivers were 535.54.03, this is all I can remember or have on file from then, sorry.