harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.82k stars 320 forks source link

[BUG] vgpu passthrough failure with NVIDIA RTX5000 ADA GPUs #6294

Open ibrokethecloud opened 2 months ago

ibrokethecloud commented 2 months ago

Describe the bug

With certain GPUs like NVIDIA RTX5000 ADA, the vgpu profile names returned by parsing the vgpu profiles in /sys tree contain lower case characters

NAME                      ADDRESS        NODE NAME       ENABLED   UUID                                   VGPUTYPE                PARENTGPUDEVICE
harvesterdev7-00000b004   0000:0b:00.4   harvesterdev7   true      6f520761-e091-4a9d-b012-dab7ac476f4d   NVIDIA RTX5000-Ada-8Q   0000:0b:00.0
harvesterdev7-00000b005   0000:0b:00.5   harvesterdev7   true      475c556b-4efb-4714-974c-e080715f8a6d   NVIDIA RTX5000-Ada-8Q   0000:0b:00.0
harvesterdev7-00000b006   0000:0b:00.6   harvesterdev7   true      5bbbece9-d755-4e38-a34f-cae18bfeda0b   NVIDIA RTX5000-Ada-8Q   0000:0b:00.0
harvesterdev7-00000b007   0000:0b:00.7   harvesterdev7   true      77e13f64-dd22-42f2-a8c8-ab3f3c4a0e37   NVIDIA RTX5000-Ada-8Q   0000:0b:00.0

The device plugin setup by pcidevices controller converts device types to a upper case string, and same conversion is applied when setting up the permitted devices in kubevirt crd.

cpu: "64"
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 153707984Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 115285220Ki
nvidia.com/NVIDIA_RTX5000-ADA-8Q: "8"
pods: "200"
      mediatedDevices:
      - externalResourceProvider: true
        mdevNameSelector: NVIDIA RTX5000-Ada-8Q
        resourceName: nvidia.com/NVIDIA_RTX5000-ADA-8Q

The Harvester UI generates the name however does not convert the deviceName to uppercase which causes the VM to not be scheduled as the device requested in VM spec is not a permitted device in kubevirt cr.

      domain:
        cpu:
          cores: 8
        devices:
          disks:
          - disk:
              bus: virtio
            name: cloudinitdisk
          - bootOrder: 1
            disk:
              bus: virtio
            name: disk-0
          gpus:
          - deviceName: nvidia.com/NVIDIA_RTX5000-Ada-8Q
            name: harvesterdev7-00000b004

As a result the VM never schedules, and needs manual intervention by user to edit and convert the deviceName to an uppercase string.

To Reproduce Steps to reproduce the behavior:

  1. Setup NVIDIA RTX5000 ADA GPU
  2. Enable PCIDevices / nvidia driver toolkit
  3. Enable sriovgpu device
  4. Configure a vGPU profile
  5. Passthrough vGPU to a VM

VM will not schedule.

Expected behavior

VM should schedule and vGPU should be passed through fine.

Support bundle

Environment

Additional context

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.