mbilker / vgpu_unlock-rs

Unlock vGPU functionality for consumer grade GPUs
MIT License
443 stars 52 forks source link

NVIDIA A5000 Individual Override #37

Open gentoorax opened 1 week ago

gentoorax commented 1 week ago

I have an NVIDIA A5000 working with profile_override.toml overrides working for profiles. I've adjust my VM config to have the correct uuid, I can see that in the mdevctl list. For some reason the individual VM configuration override isn't working for me on my A5000. I have the exact same VM config and vgpu_unlock config working on another host with a Tesla P4. I'm not sure if it's because the A5000 works slightly differently.

Am I doing something wrong or is this some kind of bug maybe specific to the NVIDIA A series cards? Any advice would be much appreciated.

Only difference between the two that I can think of is the A5000 is using a much newer 17.x 550.54.10 driver. For my Tesla P4 I'm using 16.x 535.161.05 as nvidia remove the P4 in the 17.x versions. Could this be it?

lspci -nn | grep nvidia

3b:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:00.4 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:00.5 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:00.6 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:00.7 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:01.0 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:01.1 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:01.2 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:01.3 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:01.4 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:01.5 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:01.6 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:01.7 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:02.0 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:02.1 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:02.2 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:02.3 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:02.4 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:02.5 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:02.6 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:02.7 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:03.0 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:03.1 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:03.2 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
3b:03.3 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
mdevctl list
00000000-0000-0000-0000-000000000903 0000:3b:00.5 nvidia-660 manual

profile_override.toml

[profile.nvidia-660]
num_displays = 1
display_width = 5120
display_height = 1440
max_pixels = 7372800
cuda_enabled = 1
frl_enabled = 1

[mdev.00000000-0000-0000-0000-000000000903]
framebuffer = 0x38000000
framebuffer_reservation = 0x8000000

Also tried the following...

[profile.nvidia-660]
num_displays = 1
display_width = 5120
display_height = 1440
max_pixels = 7372800
cuda_enabled = 1
frl_enabled = 1

[vm.903]
framebuffer = 0x38000000
framebuffer_reservation = 0x8000000

kvm for running VM

/usr/bin/kvm -id 903 -name test5-vmwinwrk,debug-threads=on -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/903.qmp,server=on,wait=off -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5 -mon chardev=qmp-event,mode=control -pidfile /var/run/qemu-server/903.pid -daemonize -smbios type=1,uuid=00000000-0000-0000-0000-000000000903 -drive if=pflash,unit=0,format=raw,readonly=on,file=/usr/share/pve-edk2-firmware//OVMF_CODE_4M.secboot.fd -drive if=pflash,unit=1,id=drive-efidisk0,format=qcow2,file=/mnt/pve/nfs-standard-hdd-mirror-disks-alpha/images/903/vm-903-disk-0.qcow2 -smp 2,sockets=1,cores=2,maxcpus=2 -nodefaults -boot menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg -vnc unix:/var/run/qemu-server/903.vnc,password=on -cpu Westmere,enforce,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt,vendor=GenuineIntel -m 4096 -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device vmgenid,guid=00000000-0000-0000-0000-000000000903 -device usb-tablet,id=tablet,bus=ehci.0,port=1 -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000000000903,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0 -device VGA,id=vga,bus=pcie.0,addr=0x1 -chardev socket,path=/var/run/qemu-server/903.qga,server=on,wait=off,id=qga0 -device virtio-serial,id=qga0,bus=pci.0,addr=0x8 -device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on -iscsi initiator-name=iqn.1993-08.org.debian:01:e0837067769b -drive if=none,id=drive-ide0,media=cdrom,aio=io_uring -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0 -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 -drive file=/mnt/pve/nfs-standard-hdd-mirror-disks-alpha/images/903/vm-903-disk-1.raw,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100 -netdev type=tap,id=net0,ifname=tap903i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on -device virtio-net-pci,mac=56:6f:fd:7f:00:b5,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256 -rtc driftfix=slew,base=localtime -machine hpet=off,type=pc-q35-8.1+pve0 -global kvm-pit.lost_tick_policy=discard -uuid 00000000-0000-0000-0000-000000000903 -uuid 00000000-0000-0000-0000-000000000903

In any case this same VM config with the just a different GPU and the same profile_override.toml is working on my other host. Now I know the A5000 seems to provide multiple pci devices and the P4 doesn't, just mdev profiles, so I don't know if that has something to do with it.

nvidia-smi output, should be showing 1GB VRAM not 2GB in use. I have tried other framebuffer values like 0x3B000000 and without the framebuffer_reservation. If I move these to the profile.X then they are applied, but obviously globally for that profile.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.10              Driver Version: 550.54.10      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A5000               On  |   00000000:3B:00.0 Off |                    0 |
| 30%   56C    P8             32W /  230W |    1856MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     37652    C+G   vgpu                                         1856MiB |
+-----------------------------------------------------------------------------------------+

nvidia-smi vgpu

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.10              Driver Version: 550.54.10                 |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA RTX A5000           | 00000000:3B:00.0             |   0%       |
|      3251635139  NVIDIA RTXA... | 0000...  test5-vmwinwrk,d... |      0%    |
+---------------------------------+------------------------------+------------+
mbilker commented 1 week ago

Which version of vgpu_unlock-rs are you on? A recent PR reportedly fixed this issue.

gentoorax commented 6 days ago

@mbilker I believe it should be fairly recent as just pulled on the 21st Sept...

Do you know the commit/PR that fixed this, I can take a look. I'm a dev myself (not a rust developer though).

from Cargo.toml...

# cat Cargo.toml
[package]
name = "vgpu_unlock-rs"
version = "2.5.0"
edition = "2018"

[lib]
crate-type = ["cdylib"]

[dependencies]
ctor = "0.2.7"
libc = "0.2.102"
parking_lot = "0.12.1"
serde = { version = "1.0.130", features = ["derive"] }
toml = "0.8.11"

[features]
# Feature flag to enable syntactic sugar for proxmox users
default = ["proxmox"]
proxmox = []
mbilker commented 5 days ago

It was #35 that should have fixed this. Can you check the log of nvidia-vgpu-mgr.service to see if it is printing the Nv0000CtrlVgpuCreateDeviceParams upon starting a vGPU-enabled VM?

gentoorax commented 5 days ago

profile_overrides.toml is set like so for this test.

[profile.nvidia-660]
num_displays = 1
display_width = 5120
display_height = 1440
max_pixels = 7372800
cuda_enabled = 1
frl_enabled = 1

[profile.nvidia-662]
num_displays = 1
display_width = 5120
display_height = 1440
max_pixels = 7372800
cuda_enabled = 1
frl_enabled = 1

[mdev.00000000-0000-0000-0000-000000000903]
framebuffer = 0x3B000000
# Test5 VM

[vm.903]
framebuffer = 0x3B000000
# Test5 VM

# 1GB: 0x3B000000

I tried a few combinations of the above including just mdev. and just vm.. I also tried remove the profiles entirely and just having one or the other of these mdev. and vm..

I can see a cmd: 0x20801322 failed. hidden in here. Looks like it correctly identifies and tries to apply both overrides to me, the nvidia-662 profile and the vm.903, but it doesn't seem to be actually applied.

systemctl status nvidia-vgpu-mgr

● nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/nvidia-vgpu-mgr.service.d
             └─vgpu_unlock.conf
     Active: active (running) since Sat 2024-09-21 22:22:51 BST; 2 days ago
    Process: 1631 ExecStart=/usr/bin/nvidia-vgpu-mgr (code=exited, status=0/SUCCESS)
   Main PID: 1656 (nvidia-vgpu-mgr)
      Tasks: 11 (limit: 269078)
     Memory: 6.9M
        CPU: 36.781s
     CGroup: /system.slice/nvidia-vgpu-mgr.service
             ├─   1656 /usr/bin/nvidia-vgpu-mgr
             ├─1848619 vgpu
             └─1849962 vgpu

Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: Driver Version: 550.54.10
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vGPU BAR1 size 256 MB
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): Detected ECC enabled on physical GPU.
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): Guest usable FB size is reduced due to ECC.
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x140001)
Sep 24 18:00:23 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Sep 24 18:00:23 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vGPU migration enabled
Sep 24 18:00:23 gamma nvidia-vgpu-mgr[1849962]: cmd: 0x20801322 failed.
Sep 24 18:00:23 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vGPU manager is running in SRIOV mode.
Sep 24 18:00:23 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: display_init inst: 0 successful

journalctl -u nvidia-vgpu-mgr.service

Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1656]: Nv0000CtrlVgpuGetStartDataParams {
                                                 mdev_uuid: {00000000-0000-0000-0000-000000000903},
                                                 config_params: "vgpu_type_id=662",
                                                 qemu_pid: 1849696,
                                                 gpu_pci_id: 0x3b00,
                                                 vgpu_id: 2,
                                                 gpu_pci_bdf: 15111,
                                             }
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0000-000000000903 GPU PCI id 00:3b:00.7 config params vgpu_type_id=662
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=662
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_env_log: Successfully updated env symbols!
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): detected a VF at 0:3b:0.7
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
                                                    vgpu_type: 662,
                                                    vgpu_type_info: NvA081CtrlVgpuInfo {
                                                        vgpu_type: 662,
                                                        vgpu_name: "NVIDIA RTXA5000-4Q",
                                                        vgpu_class: "Quadro",
                                                        vgpu_signature: 0x3b000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>
                                                        license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                        max_instance: 6,
                                                        num_heads: 4,
                                                        max_resolution_x: 7680,
                                                        max_resolution_y: 4320,
                                                        max_pixels: 58982400,
                                                        frl_config: 60,
                                                        cuda_enabled: 1,
                                                        ecc_supported: 1,
                                                        gpu_instance_size: 0,
                                                        multi_vgpu_supported: 1,
                                                        vdev_id: 0x22311567,
                                                        pdev_id: 0x2231,
                                                        profile_size: 0x100000000,
                                                        fb_length: 0xec000000,
                                                        gsp_heap_size: 0x0,
                                                        fb_reservation: 0x14000000,
                                                        mappable_video_size: 0x400000,
                                                        encoder_capacity: 0x64,
                                                        bar1_length: 0x100,
                                                        frl_enable: 1,
                                                        adapter_name: "NVIDIA RTXA5000-4Q",
                                                        adapter_name_unicode: "NVIDIA RTXA5000-4Q",
                                                        short_gpu_name_string: "GA102GL-A",
                                                        licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                        vgpu_extra_params: 0x007f00007885e7aafd7f00000100000000000000ca8f63b5267c0000b048e7aafd7f0000b049e7aafd7f000090b17bb5267c0000d048e7aafd7f000000000000000000000000000000000000b048e7aafd7f0000ac47e7aafd7f000030ffffffffffffff42d85f2266f60cf67885e7aafd7f00000100000>
                                                        ftrace_enable: 0,
                                                        gpu_direct_supported: 0,
                                                        nvlink_p2p_supported: 0,
                                                        multi_vgpu_exclusive: 0,
                                                        exclusive_type: 0,
                                                        exclusive_size: 0,
                                                        gpu_instance_profile_id: 4294967295,
                                                    },
                                                }
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Applying profile nvidia-662 overrides
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/num_heads: 4 -> 1
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/max_resolution_x: 7680 -> 5120
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/max_resolution_y: 4320 -> 1440
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/max_pixels: 58982400 -> 7372800
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/cuda_enabled: 1 -> 1
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/frl_enable: 1 -> 1
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Applying mdev UUID 00000000-0000-0000-0000-000000000903 profile overrides
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/fb_length: 3959422976 -> 989855744
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Applying proxmox VMID 903 profile overrides
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/fb_length: 989855744 -> 989855744
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: cmd: 0x2080014b failed.
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
                                                    vgpu_type: 662,
                                                    vgpu_type_info: NvA081CtrlVgpuInfo {
                                                        vgpu_type: 662,
                                                        vgpu_name: "NVIDIA RTXA5000-4Q",
                                                        vgpu_class: "Quadro",
                                                        vgpu_signature: [],
                                                        license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                        max_instance: 6,
                                                        num_heads: 4,
                                                        max_resolution_x: 7680,
                                                        max_resolution_y: 4320,
                                                        max_pixels: 58982400,
                                                        frl_config: 60,
                                                        cuda_enabled: 1,
                                                        ecc_supported: 1,
                                                        gpu_instance_size: 0,
                                                        multi_vgpu_supported: 1,
                                                        vdev_id: 0x22311567,
                                                        pdev_id: 0x2231,
                                                        profile_size: 0x100000000,
                                                        fb_length: 0xec000000,
                                                        gsp_heap_size: 0x0,
                                                        fb_reservation: 0x14000000,
                                                        mappable_video_size: 0x400000,
                                                        encoder_capacity: 0x64,
                                                        bar1_length: 0x100,
                                                        frl_enable: 1,
                                                        adapter_name: "NVIDIA RTXA5000-4Q",
                                                        adapter_name_unicode: "NVIDIA RTXA5000-4Q",
                                                        short_gpu_name_string: "GA102GL-A",
                                                        licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                        vgpu_extra_params: [],
                                                        ftrace_enable: 0,
                                                        gpu_direct_supported: 0,
                                                        nvlink_p2p_supported: 0,
                                                        multi_vgpu_exclusive: 0,
                                                        exclusive_type: 0,
                                                        exclusive_size: 0,
                                                        gpu_instance_profile_id: 4294967295,
                                                    },
                                                }
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Applying profile nvidia-662 overrides
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/num_heads: 4 -> 1
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/max_resolution_x: 7680 -> 5120
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/max_resolution_y: 4320 -> 1440
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/max_pixels: 58982400 -> 7372800
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/cuda_enabled: 1 -> 1
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: Patching nvidia-662/frl_enable: 1 -> 1
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): gpu-pci-id : 0x3b00
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): Framebuffer: 0xec000000
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): Virtual Device Id: 0x2231:0x1567
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: ######## vGPU Manager Information: ########
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: Driver Version: 550.54.10
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vGPU BAR1 size 256 MB
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): Detected ECC enabled on physical GPU.
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): Guest usable FB size is reduced due to ECC.
Sep 24 18:00:22 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x140001)
Sep 24 18:00:23 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Sep 24 18:00:23 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vGPU migration enabled
Sep 24 18:00:23 gamma nvidia-vgpu-mgr[1849962]: cmd: 0x20801322 failed.
Sep 24 18:00:23 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vGPU manager is running in SRIOV mode.
Sep 24 18:00:23 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: display_init inst: 0 successful
Sep 24 18:03:11 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): Detected ECC enabled by guest.
Sep 24 18:03:11 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
Sep 24 18:03:11 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: Driver Version: 552.74
Sep 24 18:03:11 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: vGPU version: 0x140001
Sep 24 18:03:11 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
Sep 24 18:03:46 gamma nvidia-vgpu-mgr[1849962]: notice: vmiop_log: (0x0): vGPU license state: Licensed

With just mdev. being used (nothing else in the TOML) it might be easier to see, but it looks like it does try to apply the override but for some reason it don't seem to take effect.

Sep 24 18:16:12 gamma nvidia-vgpu-mgr[1656]: Nv0000CtrlVgpuGetStartDataParams {
                                                 mdev_uuid: {00000000-0000-0000-0000-000000000903},
                                                 config_params: "vgpu_type_id=662",
                                                 qemu_pid: 1857616,
                                                 gpu_pci_id: 0x3b00,
                                                 vgpu_id: 1,
                                                 gpu_pci_bdf: 15111,
                                             }
Sep 24 18:17:00 gamma nvidia-vgpu-mgr[1858470]: Applying mdev UUID 00000000-0000-0000-0000-000000000903 profile overrides
Sep 24 18:17:00 gamma nvidia-vgpu-mgr[1858470]: Patching nvidia-662/fb_length: 3959422976 -> 989855744
Sep 24 18:17:00 gamma nvidia-vgpu-mgr[1858470]: cmd: 0x2080014b failed.
Sep 24 18:17:00 gamma nvidia-vgpu-mgr[1858470]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
                                                    vgpu_type: 662,
                                                    vgpu_type_info: NvA081CtrlVgpuInfo {
                                                        vgpu_type: 662,
                                                        vgpu_name: "NVIDIA RTXA5000-4Q",
                                                        vgpu_class: "Quadro",
                                                        vgpu_signature: [],
                                                        license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                        max_instance: 6,
                                                        num_heads: 4,
                                                        max_resolution_x: 7680,
                                                        max_resolution_y: 4320,
                                                        max_pixels: 58982400,
                                                        frl_config: 60,
                                                        cuda_enabled: 1,
                                                        ecc_supported: 1,
                                                        gpu_instance_size: 0,
                                                        multi_vgpu_supported: 1,
                                                        vdev_id: 0x22311567,
                                                        pdev_id: 0x2231,
                                                        profile_size: 0x100000000,
                                                        fb_length: 0xec000000,
                                                        gsp_heap_size: 0x0,
                                                        fb_reservation: 0x14000000,
                                                        mappable_video_size: 0x400000,
                                                        encoder_capacity: 0x64,
                                                        bar1_length: 0x100,
                                                        frl_enable: 1,
                                                        adapter_name: "NVIDIA RTXA5000-4Q",
                                                        adapter_name_unicode: "NVIDIA RTXA5000-4Q",
                                                        short_gpu_name_string: "GA102GL-A",
                                                        licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                        vgpu_extra_params: [],
                                                        ftrace_enable: 0,
                                                        gpu_direct_supported: 0,
                                                        nvlink_p2p_supported: 0,
                                                        multi_vgpu_exclusive: 0,
                                                        exclusive_type: 0,
                                                        exclusive_size: 0,
                                                        gpu_instance_profile_id: 4294967295,
                                                    },
                                                }
mbilker commented 5 days ago

It looks like it is the information is being fetched multiple times. This does mean an additional change is necessary to support this case.

gentoorax commented 5 days ago

I think that's because in the first run I included the same VM twice with different formats to identify it e.g. [mdev.00000000-0000-0000-0000-000000000903] and [vm.903]. I've tried this with just [mdev.00000000-0000-0000-0000-000000000903] or [vm.903] as well, with no success.

So if my override toml file contains only this...

[mdev.00000000-0000-0000-0000-000000000903]
framebuffer = 0x3B000000
# Test5 VM

Then it looks like it is only fetched once from what I can tell, but it still isn't being applied.

mbilker commented 5 days ago

Then it looks like it is only fetched once from what I can tell, but it still isn't being applied.

I mean nvidia-vgpu-mgr itself is fetching the vGPU data twice. Notice NvA081CtrlVgpuConfigGetVgpuTypeInfoParams being printed twice with slightly different information.

gentoorax commented 5 days ago

Oh I see yeah. vgpu_signature and vgpu_extra_params are different. Let me check if this happened in the second run I did.

gentoorax commented 5 days ago

OK, I cleared the logs and did a fresh run without any noise. Looks like you are correct it is being fetched twice.

cat ./profile_override.toml

[mdev.00000000-0000-0000-0000-000000000903]
framebuffer = 0x3B000000
# Test5

journalctl -u nvidia-vgpu-mgr.service

Sep 24 18:40:20 gamma systemd[1]: Stopping nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon...
Sep 24 18:40:20 gamma systemd[1]: nvidia-vgpu-mgr.service: Deactivated successfully.
Sep 24 18:40:20 gamma systemd[1]: Stopped nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon.
Sep 24 18:40:20 gamma systemd[1]: nvidia-vgpu-mgr.service: Consumed 44.525s CPU time.
Sep 24 18:40:31 gamma systemd[1]: Starting nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon...
Sep 24 18:40:31 gamma systemd[1]: Started nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon.
Sep 24 18:40:31 gamma nvidia-vgpu-mgr[1869161]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869161]: Nv0000CtrlVgpuGetStartDataParams {
                                                    mdev_uuid: {00000000-0000-0000-0000-000000000903},
                                                    config_params: "vgpu_type_id=662",
                                                    qemu_pid: 1869635,
                                                    gpu_pci_id: 0x3b00,
                                                    vgpu_id: 1,
                                                    gpu_pci_bdf: 15111,
                                                }
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0000-000000000903 GPU PCI id 00:3b:00.7 config params vgpu_type_id=662
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=662
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_env_log: Successfully updated env symbols!
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): detected a VF at 0:3b:0.7
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
                                                    vgpu_type: 662,
                                                    vgpu_type_info: NvA081CtrlVgpuInfo {
                                                        vgpu_type: 662,
                                                        vgpu_name: "NVIDIA RTXA5000-4Q",
                                                        vgpu_class: "Quadro",
                                                        vgpu_signature: 0x3b000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>
                                                        license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                        max_instance: 6,
                                                        num_heads: 4,
                                                        max_resolution_x: 7680,
                                                        max_resolution_y: 4320,
                                                        max_pixels: 58982400,
                                                        frl_config: 60,
                                                        cuda_enabled: 1,
                                                        ecc_supported: 1,
                                                        gpu_instance_size: 0,
                                                        multi_vgpu_supported: 1,
                                                        vdev_id: 0x22311567,
                                                        pdev_id: 0x2231,
                                                        profile_size: 0x100000000,
                                                        fb_length: 0xec000000,
                                                        gsp_heap_size: 0x0,
                                                        fb_reservation: 0x14000000,
                                                        mappable_video_size: 0x400000,
                                                        encoder_capacity: 0x64,
                                                        bar1_length: 0x100,
                                                        frl_enable: 1,
                                                        adapter_name: "NVIDIA RTXA5000-4Q",
                                                        adapter_name_unicode: "NVIDIA RTXA5000-4Q",
                                                        short_gpu_name_string: "GA102GL-A",
                                                        licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                        vgpu_extra_params: 0x007f0000e8ce4cadfc7f00000100000000000000cadfae087a78000020924cadfc7f000020934cadfc7f00009001c7087a78000040924cadfc7f00002525252525252525252525252525252520924cadfc7f00001c914cadfc7f000030ffffffffffffff6d2f8ddbaded3edee8ce4cadfc7f00000100000>
                                                        ftrace_enable: 808464432,
                                                        gpu_direct_supported: 0,
                                                        nvlink_p2p_supported: 0,
                                                        multi_vgpu_exclusive: 0,
                                                        exclusive_type: 0,
                                                        exclusive_size: 0,
                                                        gpu_instance_profile_id: 4294967295,
                                                    },
                                                }
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: Applying mdev UUID 00000000-0000-0000-0000-000000000903 profile overrides
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: Patching nvidia-662/fb_length: 3959422976 -> 989855744
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: cmd: 0x2080014b failed.
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
                                                    vgpu_type: 662,
                                                    vgpu_type_info: NvA081CtrlVgpuInfo {
                                                        vgpu_type: 662,
                                                        vgpu_name: "NVIDIA RTXA5000-4Q",
                                                        vgpu_class: "Quadro",
                                                        vgpu_signature: [],
                                                        license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                        max_instance: 6,
                                                        num_heads: 4,
                                                        max_resolution_x: 7680,
                                                        max_resolution_y: 4320,
                                                        max_pixels: 58982400,
                                                        frl_config: 60,
                                                        cuda_enabled: 1,
                                                        ecc_supported: 1,
                                                        gpu_instance_size: 0,
                                                        multi_vgpu_supported: 1,
                                                        vdev_id: 0x22311567,
                                                        pdev_id: 0x2231,
                                                        profile_size: 0x100000000,
                                                        fb_length: 0xec000000,
                                                        gsp_heap_size: 0x0,
                                                        fb_reservation: 0x14000000,
                                                        mappable_video_size: 0x400000,
                                                        encoder_capacity: 0x64,
                                                        bar1_length: 0x100,
                                                        frl_enable: 1,
                                                        adapter_name: "NVIDIA RTXA5000-4Q",
                                                        adapter_name_unicode: "NVIDIA RTXA5000-4Q",
                                                        short_gpu_name_string: "GA102GL-A",
                                                        licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                        vgpu_extra_params: [],
                                                        ftrace_enable: 0,
                                                        gpu_direct_supported: 0,
                                                        nvlink_p2p_supported: 0,
                                                        multi_vgpu_exclusive: 0,
                                                        exclusive_type: 0,
                                                        exclusive_size: 0,
                                                        gpu_instance_profile_id: 4294967295,
                                                    },
                                                }
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): gpu-pci-id : 0x3b00
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): Framebuffer: 0xec000000
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): Virtual Device Id: 0x2231:0x1567
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: ######## vGPU Manager Information: ########
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: Driver Version: 550.54.10
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): vGPU BAR1 size 256 MB
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): Detected ECC enabled on physical GPU.
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): Guest usable FB size is reduced due to ECC.
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x140001)
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): vGPU migration enabled
Sep 24 18:41:34 gamma nvidia-vgpu-mgr[1869902]: cmd: 0x20801322 failed.
Sep 24 18:41:35 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): vGPU manager is running in SRIOV mode.
Sep 24 18:41:35 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: display_init inst: 0 successful
mbilker commented 5 days ago

Ok, I will commit a change to make this work when I am next available in a few hours.

gentoorax commented 5 days ago

No rush or pressure but much appreciated, let me know when you have something and I can test.

It's interesting if you look at the profile level override, you can see this seems to be applied twice, but the mdev (VM level) isn't.

profile_override.toml

[profile.nvidia-662]
num_displays = 1
display_width = 5120
display_height = 1440
max_pixels = 7372800
cuda_enabled = 1
frl_enabled = 1
framebuffer = 0x3B000000

Produces the following service logs

Sep 24 18:43:42 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): Detected ECC enabled by guest.
Sep 24 18:43:42 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
Sep 24 18:43:42 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: Driver Version: 552.74
Sep 24 18:43:42 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: vGPU version: 0x140001
Sep 24 18:43:42 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
Sep 24 18:44:17 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: (0x0): vGPU license state: Licensed
Sep 24 18:47:44 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_env_log: (0x0): Plugin migration stage change none -> stop_and_copy. QEMU migration state: STOPNCOPY_ACTIVE
Sep 24 18:47:45 gamma nvidia-vgpu-mgr[1869902]: notice: vmiop_log: Stopping all vGPU migration threads
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1869161]: Nv0000CtrlVgpuGetStartDataParams {
                                                    mdev_uuid: {00000000-0000-0000-0000-000000000903},
                                                    config_params: "vgpu_type_id=662",
                                                    qemu_pid: 1872822,
                                                    gpu_pci_id: 0x3b00,
                                                    vgpu_id: 1,
                                                    gpu_pci_bdf: 15111,
                                                }
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0000-000000000903 GPU PCI id 00:3b:00.7 config params vgpu_type_id=662
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=662
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_env_log: Successfully updated env symbols!
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): detected a VF at 0:3b:0.7
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
                                                    vgpu_type: 662,
                                                    vgpu_type_info: NvA081CtrlVgpuInfo {
                                                        vgpu_type: 662,
                                                        vgpu_name: "NVIDIA RTXA5000-4Q",
                                                        vgpu_class: "Quadro",
                                                        vgpu_signature: 0x3b000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>
                                                        license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                        max_instance: 6,
                                                        num_heads: 4,
                                                        max_resolution_x: 7680,
                                                        max_resolution_y: 4320,
                                                        max_pixels: 58982400,
                                                        frl_config: 60,
                                                        cuda_enabled: 1,
                                                        ecc_supported: 1,
                                                        gpu_instance_size: 0,
                                                        multi_vgpu_supported: 1,
                                                        vdev_id: 0x22311567,
                                                        pdev_id: 0x2231,
                                                        profile_size: 0x100000000,
                                                        fb_length: 0xec000000,
                                                        gsp_heap_size: 0x0,
                                                        fb_reservation: 0x14000000,
                                                        mappable_video_size: 0x400000,
                                                        encoder_capacity: 0x64,
                                                        bar1_length: 0x100,
                                                        frl_enable: 1,
                                                        adapter_name: "NVIDIA RTXA5000-4Q",
                                                        adapter_name_unicode: "NVIDIA RTXA5000-4Q",
                                                        short_gpu_name_string: "GA102GL-A",
                                                        licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                        vgpu_extra_params: 0x007f0000e8ce4cadfc7f00000100000000000000cadfae087a78000020924cadfc7f000020934cadfc7f00009001c7087a78000040924cadfc7f00002525252525252525252525252525252520924cadfc7f00001c914cadfc7f000030ffffffffffffff6d2f8ddbaded3edee8ce4cadfc7f00000100000>
                                                        ftrace_enable: 808464432,
                                                        gpu_direct_supported: 0,
                                                        nvlink_p2p_supported: 0,
                                                        multi_vgpu_exclusive: 0,
                                                        exclusive_type: 0,
                                                        exclusive_size: 0,
                                                        gpu_instance_profile_id: 4294967295,
                                                    },
                                                }
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Applying profile nvidia-662 overrides
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/num_heads: 4 -> 1
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/max_resolution_x: 7680 -> 5120
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/max_resolution_y: 4320 -> 1440
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/max_pixels: 58982400 -> 7372800
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/cuda_enabled: 1 -> 1
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/fb_length: 3959422976 -> 989855744
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/frl_enable: 1 -> 1
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: cmd: 0x2080014b failed.
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
                                                    vgpu_type: 662,
                                                    vgpu_type_info: NvA081CtrlVgpuInfo {
                                                        vgpu_type: 662,
                                                        vgpu_name: "NVIDIA RTXA5000-4Q",
                                                        vgpu_class: "Quadro",
                                                        vgpu_signature: [],
                                                        license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                        max_instance: 6,
                                                        num_heads: 4,
                                                        max_resolution_x: 7680,
                                                        max_resolution_y: 4320,
                                                        max_pixels: 58982400,
                                                        frl_config: 60,
                                                        cuda_enabled: 1,
                                                        ecc_supported: 1,
                                                        gpu_instance_size: 0,
                                                        multi_vgpu_supported: 1,
                                                        vdev_id: 0x22311567,
                                                        pdev_id: 0x2231,
                                                        profile_size: 0x100000000,
                                                        fb_length: 0xec000000,
                                                        gsp_heap_size: 0x0,
                                                        fb_reservation: 0x14000000,
                                                        mappable_video_size: 0x400000,
                                                        encoder_capacity: 0x64,
                                                        bar1_length: 0x100,
                                                        frl_enable: 1,
                                                        adapter_name: "NVIDIA RTXA5000-4Q",
                                                        adapter_name_unicode: "NVIDIA RTXA5000-4Q",
                                                        short_gpu_name_string: "GA102GL-A",
                                                        licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                        vgpu_extra_params: [],
                                                        ftrace_enable: 0,
                                                        gpu_direct_supported: 0,
                                                        nvlink_p2p_supported: 0,
                                                        multi_vgpu_exclusive: 0,
                                                        exclusive_type: 0,
                                                        exclusive_size: 0,
                                                        gpu_instance_profile_id: 4294967295,
                                                    },
                                                }
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Applying profile nvidia-662 overrides
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/num_heads: 4 -> 1
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/max_resolution_x: 7680 -> 5120
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/max_resolution_y: 4320 -> 1440
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/max_pixels: 58982400 -> 7372800
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/cuda_enabled: 1 -> 1
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/fb_length: 3959422976 -> 989855744
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: Patching nvidia-662/frl_enable: 1 -> 1
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): gpu-pci-id : 0x3b00
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): Framebuffer: 0x3b000000
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): Virtual Device Id: 0x2231:0x1567
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: ######## vGPU Manager Information: ########
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: Driver Version: 550.54.10
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): vGPU BAR1 size 256 MB
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): Detected ECC enabled on physical GPU.
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): Guest usable FB size is reduced due to ECC.
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x140001)
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): vGPU migration enabled
Sep 24 18:48:00 gamma nvidia-vgpu-mgr[1873063]: cmd: 0x20801322 failed.
Sep 24 18:48:01 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): vGPU manager is running in SRIOV mode.
Sep 24 18:48:01 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: display_init inst: 0 successful
Sep 24 18:48:14 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): Detected ECC enabled by guest.
Sep 24 18:48:14 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
Sep 24 18:48:14 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: Driver Version: 552.74
Sep 24 18:48:14 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: vGPU version: 0x140001
Sep 24 18:48:14 gamma nvidia-vgpu-mgr[1873063]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
mbilker commented 5 days ago

@gentoorax I pushed a commit that should fix your issue.

mbilker commented 5 days ago

On my testing box that does not use an SR-IOV capable card, I only see it fetch the vGPU type info once at the start of nvidia-vgpu-mgr and once when it starts the vGPU. I am pretty sure @gentoorax is encountering different behavior because of the use of an officially supported SR-IOV capable card so it starts up the vGPU instances differently.

mbilker commented 5 days ago

If @gentoorax can confirm if my commit fixes the issue, then I'll go mark this as version 2.5.1.

gentoorax commented 4 days ago

@mbilker awesome, at work for the next several hours, but will give it a test when finished.

gentoorax commented 4 days ago

@mbilker good news! Managed to test this on my lunch break and it is working! I did have to reboot the host (even though I had restarted the services), but I'm not sure if that's just due to my messing around previously. After reboot all seems well though.

root@gamma:~# nvidia-smi
Wed Sep 25 12:40:00 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.10              Driver Version: 550.54.10      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A5000               On  |   00000000:3B:00.0 Off |                    0 |
| 30%   55C    P8             32W /  230W |    6144MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4237    C+G   vgpu                                         1216MiB |
|    0   N/A  N/A      4925    C+G   vgpu                                         1216MiB |
|    0   N/A  N/A      6408    C+G   vgpu                                         3712MiB |
+-----------------------------------------------------------------------------------------+

...
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Applying profile nvidia-662 overrides
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/num_heads: 4 -> 1
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/max_resolution_x: 7680 -> 5120
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/max_resolution_y: 4320 -> 1440
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/max_pixels: 58982400 -> 7372800
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/cuda_enabled: 1 -> 1
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/frl_enable: 1 -> 1
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Applying mdev UUID 00000000-0000-0000-0000-000000000903 profile overrides
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/fb_length: 3959422976 -> 989855744
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: cmd: 0x2080014b failed.
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
                                                 vgpu_type: 662,
                                                 vgpu_type_info: NvA081CtrlVgpuInfo {
                                                     vgpu_type: 662,
                                                     vgpu_name: "NVIDIA RTXA5000-4Q",
                                                     vgpu_class: "Quadro",
                                                     vgpu_signature: [],
                                                     license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                     max_instance: 6,
                                                     num_heads: 4,
                                                     max_resolution_x: 7680,
                                                     max_resolution_y: 4320,
                                                     max_pixels: 58982400,
                                                     frl_config: 60,
                                                     cuda_enabled: 1,
                                                     ecc_supported: 1,
                                                     gpu_instance_size: 0,
                                                     multi_vgpu_supported: 1,
                                                     vdev_id: 0x22311567,
                                                     pdev_id: 0x2231,
                                                     profile_size: 0x100000000,
                                                     fb_length: 0xec000000,
                                                     gsp_heap_size: 0x0,
                                                     fb_reservation: 0x14000000,
                                                     mappable_video_size: 0x400000,
                                                     encoder_capacity: 0x64,
                                                     bar1_length: 0x100,
                                                     frl_enable: 1,
                                                     adapter_name: "NVIDIA RTXA5000-4Q",
                                                     adapter_name_unicode: "NVIDIA RTXA5000-4Q",
                                                     short_gpu_name_string: "GA102GL-A",
                                                     licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                     vgpu_extra_params: [],
                                                     ftrace_enable: 0,
                                                     gpu_direct_supported: 0,
                                                     nvlink_p2p_supported: 0,
                                                     multi_vgpu_exclusive: 0,
                                                     exclusive_type: 0,
                                                     exclusive_size: 0,
                                                     gpu_instance_profile_id: 4294967295,
                                                 },
                                             }
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Applying profile nvidia-662 overrides
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/num_heads: 4 -> 1
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/max_resolution_x: 7680 -> 5120
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/max_resolution_y: 4320 -> 1440
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/max_pixels: 58982400 -> 7372800
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/cuda_enabled: 1 -> 1
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/frl_enable: 1 -> 1
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Applying mdev UUID 00000000-0000-0000-0000-000000000903 profile overrides
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: Patching nvidia-662/fb_length: 3959422976 -> 989855744
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: notice: vmiop_log: (0x0): gpu-pci-id : 0x3b00
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: notice: vmiop_log: (0x0): Framebuffer: 0x3b000000
Sep 25 12:36:11 gamma nvidia-vgpu-mgr[4237]: notice: vmiop_log: (0x0): Virtual Device Id: 0x2231:0x1567