lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.74k stars 223 forks source link

Nvidia GPU not showing up is /dev after passing through #1322

Closed svenhendrikx closed 3 weeks ago

svenhendrikx commented 3 weeks ago

When trying to pass through my gpu to my development instance (dev), which I have done before succesfully, it does not show up when listing /dev. When setting nvidia.runtime=true, the container does not start

Steps to reproduce

  1. Create container: incus launch images:archlinux/cloud --profile default gputest
  2. Add gpu to device: incus config device add gputest mygpu gpu
  3. Inside container: ls /dev
    console  core  dri  fd full  fuse  incus  log  mqueue  net  null  ptmx  pts  random  shm  stderr  stdin  stdout  tty  urandom  zero

    output of incus info --show-log gputest

    
    Name: gputest
    Status: RUNNING
    Type: container
    Architecture: x86_64
    PID: 23791
    Created: 2024/10/19 20:23 CEST
    Last Used: 2024/10/19 20:30 CEST
    Started: 2024/10/19 20:30 CEST

Resources: Processes: 19 CPU usage: CPU usage (in seconds): 1 Memory usage: Memory (current): 54.97MiB Network usage: eth0: Type: broadcast State: UP Host interface: veth68fb752c MAC address: 00:16:3e:88:9a:6d MTU: 1500 Bytes received: 354B Bytes sent: 775B Packets received: 1 Packets sent: 7 IP addresses: inet: 10.198.92.110/24 (global) inet6: fe80::216:3eff:fe88:9a6d/64 (link) lo: Type: loopback State: UP MTU: 65536 Bytes received: 0B Bytes sent: 0B Packets received: 0 Packets sent: 0 IP addresses: inet: 127.0.0.1/8 (local) inet6: ::1/128 (local)

Log:

 4. Try setting `nvidia.runtime=true`: `incus config set gputest nvidia.runtime=true`
when starting the container:

Error: Failed to run: /usr/bin/incusd forkstart gputest /var/lib/incus/containers /run/incus/gputest/lxc.conf: exit status 1 Try incus info --show-log gputest for more info

output of `incus info --show-log gputest`:

Name: gputest Status: STOPPED Type: container Architecture: x86_64 Created: 2024/10/19 20:23 CEST Last Used: 2024/10/19 20:29 CEST

Log:

lxc gputest 20241019182947.171 ERROR utils - ../src/lxc/utils.c:run_buffer:571 - Script exited with status 1 lxc gputest 20241019182947.171 ERROR conf - ../src/lxc/conf.c:lxc_setup:3940 - Failed to run mount hooks lxc gputest 20241019182947.171 ERROR start - ../src/lxc/start.c:do_start:1273 - Failed to setup container "gputest" lxc gputest 20241019182947.171 ERROR sync - ../src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4) lxc gputest 20241019182947.179 WARN network - ../src/lxc/network.c:lxc_delete_network_priv:3674 - Failed to rename interface with index 0 from "eth0" to its initial name "vethb99950f1" lxc gputest 20241019182947.179 ERROR start - ../src/lxc/start.c:__lxc_start:2114 - Failed to spawn container "gputest" lxc gputest 20241019182947.179 ERROR lxccontainer - ../src/lxc/lxccontainer.c:wait_on_daemonized_start:837 - Received container state "ABORTING" instead of "RUNNING" lxc gputest 20241019182947.180 WARN start - ../src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 17 for process 23303 lxc 20241019182947.325 ERROR af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response lxc 20241019182947.326 ERROR commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"

 5. Step three

# Information to attach

 - [x] Any relevant kernel output (`dmesg`) (Nothing relevant I believe)
 - [x] Container log (`incus info NAME --show-log`)
 - [x] Container configuration (`incus config show NAME --expanded`)

architecture: x86_64 config: image.architecture: amd64 image.description: Archlinux current amd64 (20241018_04:18) image.os: Archlinux image.release: current image.requirements.secureboot: "false" image.serial: "20241018_04:18" image.type: squashfs image.variant: cloud nvidia.runtime: "false" security.secureboot: "false" volatile.base_image: 81330bba5b9337ec0ec8852efb4e437487312d088657bb2a0845f74299bb49df volatile.cloud-init.instance-id: 7861d41e-76fe-4be5-b143-b265f395dd84 volatile.eth0.host_name: veth68fb752c volatile.eth0.hwaddr: 00:16:3e:88:9a:6d volatile.idmap.base: "0" volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]' volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]' volatile.last_state.idmap: '[]' volatile.last_state.power: RUNNING volatile.last_state.ready: "false" volatile.uuid: b127eb57-3645-435a-81d7-a9340a349b84 volatile.uuid.generation: b127eb57-3645-435a-81d7-a9340a349b84 devices: eth0: name: eth0 network: lxdbr0 type: nic mygpu: type: gpu root: path: / pool: default type: disk ephemeral: false profiles:

stgraber commented 3 weeks ago

It looks like something got passed at least, can you show:

svenhendrikx commented 3 weeks ago

Hi Stephane,

Thanks for the quick response, here's the outputs: incus exec gputest -- ls -lh /dev/dri/

total 0
crw-rw---- 1 root root 226,   0 Oct 19 18:52 card0
crw-rw-rw- 1 root root 226, 128 Oct 19 18:52 renderD128

incus info --resources

System:
  UUID: 9f061de9-d9d3-e9c3-0562-fc3497c07ee9
  Vendor: ASUS
  Product: System Product Name
  Family: To be filled by O.E.M.
  Version: System Version
  SKU: SKU
  Serial: System Serial Number
  Type: physical
  Chassis:
      Vendor: Default string
      Type: Desktop
      Version: Default string
      Serial: Default string
  Motherboard:
      Vendor: ASUSTeK COMPUTER INC.
      Product: PRIME H510M-A
      Serial: 210484246000126
      Version: Rev 1.xx
  Firmware:
      Vendor: American Megatrends Inc.
      Version: 0406
      Date: 03/17/2021

Load:
  Processes: 512
  Average: 0.03 0.16 0.16

CPU:
  Architecture: x86_64
  Vendor: GenuineIntel
  Name: Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
  Caches:
    - Level 1 (type: Data): 32KiB
    - Level 1 (type: Instruction): 32KiB
    - Level 2 (type: Unified): 256KiB
    - Level 3 (type: Unified): 12MiB
  Cores:
    - Core 0
      Frequency: 3715Mhz
      Threads:
        - 0 (id: 0, online: true, NUMA node: 0)
        - 1 (id: 6, online: true, NUMA node: 0)
    - Core 1
      Frequency: 3811Mhz
      Threads:
        - 0 (id: 1, online: true, NUMA node: 0)
        - 1 (id: 7, online: true, NUMA node: 0)
    - Core 2
      Frequency: 1979Mhz
      Threads:
        - 0 (id: 2, online: true, NUMA node: 0)
        - 1 (id: 8, online: true, NUMA node: 0)
    - Core 3
      Frequency: 3602Mhz
      Threads:
        - 0 (id: 3, online: true, NUMA node: 0)
        - 1 (id: 9, online: true, NUMA node: 0)
    - Core 4
      Frequency: 3613Mhz
      Threads:
        - 0 (id: 4, online: true, NUMA node: 0)
        - 1 (id: 10, online: true, NUMA node: 0)
    - Core 5
      Frequency: 3704Mhz
      Threads:
        - 0 (id: 5, online: true, NUMA node: 0)
        - 1 (id: 11, online: true, NUMA node: 0)
  Frequency: 3404Mhz (min: 800Mhz, max: 4300Mhz)

Memory:
  Free: 12.13GiB
  Used: 3.87GiB
  Total: 16.00GiB

GPU:
  NUMA node: 0
  Vendor: NVIDIA Corporation (10de)
  Product: GA104 [GeForce RTX 3060 Ti Lite Hash Rate] (2489)
  PCI address: 0000:01:00.0
  Driver: nvidia (560.35.03)
  DRM:
    ID: 0
    Card: card0 (226:0)
    Control: controlD64 (226:0)
    Render: renderD128 (226:128)
  NVIDIA information:
    Architecture:
    Brand: NVIDIA
    Model: NVIDIA GeForce RTX 3060 Ti
    CUDA Version:
    NVRM Version:
    UUID:

NIC:
  NUMA node: 0
  Vendor: Intel Corporation (8086)
  Product: Ethernet Connection (14) I219-V (15fa)
  PCI address: 0000:00:1f.6
  Driver: e1000e (6.11.4-arch1-1)
  Ports:
    - Port 0 (ethernet)
      ID: eno1
      Address: fc:34:97:c0:7e:e9
      Supported modes: 10baseT/Half, 10baseT/Full, 100baseT/Half, 100baseT/Full, 1000baseT/Full
      Supported ports: twisted pair
      Port type: twisted pair
      Transceiver type: internal
      Auto negotiation: true
      Link detected: true
      Link speed: 1000Mbit/s (full duplex)

Disks:
  Disk 0:
    NUMA node: 0
    ID: nvme0n1
    Device: 259:0
    Model: Samsung SSD 980 1TB
    Type: nvme
    Size: 931.51GiB
    WWN: eui.002538d821a1cb0c
    Read-Only: false
    Removable: false
    Partitions:
      - Partition 1
        ID: nvme0n1p1
        Device: 259:1
        Read-Only: false
        Size: 1022.00MiB
      - Partition 2
        ID: nvme0n1p2
        Device: 259:2
        Read-Only: false
        Size: 4.00GiB
      - Partition 3
        ID: nvme0n1p3
        Device: 259:3
        Read-Only: false
        Size: 922.51GiB
      - Partition 4
        ID: nvme0n1p4
        Device: 259:4
        Read-Only: false
        Size: 4.00GiB
  Disk 1:
    NUMA node: 0
    ID: sda
    Device: 8:0
    Model: WDC WD40EFPX-68C6CN0
    Type: sata
    Size: 3.64TiB
    Read-Only: false
    Removable: false
  Disk 2:
    NUMA node: 0
    ID: sdb
    Device: 8:16
    Model: WDC WD40EFAX-68JH4N1
    Type: sata
    Size: 3.64TiB
    Read-Only: false
    Removable: false

USB devices:
  Device 0:
    Vendor:
    Vendor ID: 413c
    Product: DELL USB Keyboard
    Product ID: 2005
    Bus Address: 1
    Device Address: 2
  Device 1:
    Vendor:
    Vendor ID: 0b05
    Product: AURA LED Controller
    Product ID: 19af
    Bus Address: 1
    Device Address: 3

PCI devices:
  Device 0:
    Address: 0000:00:00.0
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: Comet Lake-S 6c Host Bridge/DRAM Controller
    Product ID: 9b53
    NUMA node: 0
    IOMMU group: 0
    Driver: skl_uncore
  Device 1:
    Address: 0000:00:01.0
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: 6th-10th Gen Core Processor PCIe Controller (x16)
    Product ID: 1901
    NUMA node: 0
    IOMMU group: 1
    Driver: pcieport
  Device 2:
    Address: 0000:00:14.0
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: Tiger Lake-H USB 3.2 Gen 2x1 xHCI Host Controller
    Product ID: 43ed
    NUMA node: 0
    IOMMU group: 2
    Driver: xhci_hcd
  Device 3:
    Address: 0000:00:14.2
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: Tiger Lake-H Shared SRAM
    Product ID: 43ef
    NUMA node: 0
    IOMMU group: 2
    Driver:
  Device 4:
    Address: 0000:00:15.0
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: Tiger Lake-H Serial IO I2C Controller #0
    Product ID: 43e8
    NUMA node: 0
    IOMMU group: 3
    Driver: intel-lpss
  Device 5:
    Address: 0000:00:16.0
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: Tiger Lake-H Management Engine Interface
    Product ID: 43e0
    NUMA node: 0
    IOMMU group: 4
    Driver: mei_me
  Device 6:
    Address: 0000:00:17.0
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product:
    Product ID: 43d2
    NUMA node: 0
    IOMMU group: 5
    Driver: ahci
  Device 7:
    Address: 0000:00:1c.0
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: Tiger Lake-H PCI Express Root Port #5
    Product ID: 43bc
    NUMA node: 0
    IOMMU group: 6
    Driver: pcieport
  Device 8:
    Address: 0000:00:1f.0
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: H510 LPC/eSPI Controller
    Product ID: 4388
    NUMA node: 0
    IOMMU group: 7
    Driver:
  Device 9:
    Address: 0000:00:1f.3
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product:
    Product ID: f0c8
    NUMA node: 0
    IOMMU group: 7
    Driver: snd_hda_intel
  Device 10:
    Address: 0000:00:1f.4
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: Tiger Lake-H SMBus Controller
    Product ID: 43a3
    NUMA node: 0
    IOMMU group: 7
    Driver: i801_smbus
  Device 11:
    Address: 0000:00:1f.5
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: Tiger Lake-H SPI Controller
    Product ID: 43a4
    NUMA node: 0
    IOMMU group: 7
    Driver: intel-spi
  Device 12:
    Address: 0000:00:1f.6
    Vendor: Intel Corporation
    Vendor ID: 8086
    Product: Ethernet Connection (14) I219-V
    Product ID: 15fa
    NUMA node: 0
    IOMMU group: 7
    Driver: e1000e
  Device 13:
    Address: 0000:01:00.0
    Vendor: NVIDIA Corporation
    Vendor ID: 10de
    Product: GA104 [GeForce RTX 3060 Ti Lite Hash Rate]
    Product ID: 2489
    NUMA node: 0
    IOMMU group: 1
    Driver: nvidia
  Device 14:
    Address: 0000:01:00.1
    Vendor: NVIDIA Corporation
    Vendor ID: 10de
    Product: GA104 High Definition Audio Controller
    Product ID: 228b
    NUMA node: 0
    IOMMU group: 1
    Driver: snd_hda_intel
  Device 15:
    Address: 0000:02:00.0
    Vendor: Samsung Electronics Co Ltd
    Vendor ID: 144d
    Product: NVMe SSD Controller 980 (DRAM-less)
    Product ID: a809
    NUMA node: 0
    IOMMU group: 8
    Driver: nvme
stgraber commented 3 weeks ago

Okay, so the card was passed through, its DRI/DRM devices anyways.

The remaining NVIDIA-specific devices likely aren't passed through because there's something wrong with your system which is preventing it from getting the needed information:

GPU:
  NUMA node: 0
  Vendor: NVIDIA Corporation (10de)
  Product: GA104 [GeForce RTX 3060 Ti Lite Hash Rate] (2489)
  PCI address: 0000:01:00.0
  Driver: nvidia (560.35.03)
  DRM:
    ID: 0
    Card: card0 (226:0)
    Control: controlD64 (226:0)
    Render: renderD128 (226:128)
  NVIDIA information:
    Architecture:
    Brand: NVIDIA
    Model: NVIDIA GeForce RTX 3060 Ti
    CUDA Version:
    NVRM Version:
    UUID:

Looks like you may need to install https://archlinux.org/packages/extra/x86_64/nvidia-container-toolkit/

svenhendrikx commented 3 weeks ago

That fixed it! It also resolved the issue I was having with containers not booting when setting nvidia.runtime=true, so no more (failing at) alligning nvidia runtime driver versions. Very happy with that.

Thank you @stgraber, very quick and helpful as usual.