intel / linux-npu-driver

Intel® NPU (Neural Processing Unit) Driver
MIT License
189 stars 18 forks source link

ESXi passthrough of NPU to VM - failed with error -110 #46

Open lamw opened 2 months ago

lamw commented 2 months ago

Are there additional debug/verbose logs from the NPU Linux driver, I've been able to successfully do PCIe passthrough of the NPU from Intel 14th Gen system, but it looks like it fails to load firmware (-110) but no more details ... trying to understand what could be the cause whether this is on ESXi hypervisor and passthrough or something else ...

Here's snippet from dmesg (this is after installing the required drivers on Ubuntu 24.04)

[    2.036821] intel_vpu 0000:02:05.0: enabling device (0000 -> 0002)
[    2.052504] intel_vpu 0000:02:05.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20240726*MTL_CLIENT_SILICON-release*0004*ci_tag_ud202428_vpu_rc_20240726_0004*e4a99ed6b3e
[    3.078146] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_boot(): Failed to boot the firmware: -110
[    3.078427] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803000, fetch addr: 0x0
[    3.078889] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803010, fetch addr: 0x0
[    3.093452] intel_vpu 0000:02:05.0: [drm] ivpu_hw_37xx_power_down(): VPU not idle during power down
[    3.095388] intel_vpu: probe of 0000:02:05.0 failed with error -110
kwachows commented 2 months ago

Could you please give it a try to load the NPU kernel driver with force_snoop=1 module parameter set? (that is rmmod intel_vpu; modprobe intel_vpu force_snoop=1)

lamw commented 2 months ago

Using 1.6.0 instructions, looks like force_snoop=1 isn't working?

root@ubuntu:~# rmmod intel_vpu; modprobe intel_vpu force_snoop=1
root@ubuntu:~# dmesg|grep vpu
[    1.911597] intel_vpu 0000:02:05.0: enabling device (0000 -> 0002)
[    1.921169] intel_vpu 0000:02:05.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20240726*MTL_CLIENT_SILICON-release*0004*ci_tag_ud202428_vpu_rc_20240726_0004*e4a99ed6b3e
[    2.980059] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_boot(): Failed to boot the firmware: -110
[    2.980328] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803000, fetch addr: 0x0
[    2.980786] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803010, fetch addr: 0x0
[    2.987821] intel_vpu 0000:02:05.0: [drm] ivpu_hw_37xx_power_down(): VPU not idle during power down
[    2.987995] intel_vpu: probe of 0000:02:05.0 failed with error -110
[  160.695061] intel_vpu: unknown parameter 'force_snoop' ignored
[  160.697978] intel_vpu 0000:02:05.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20240726*MTL_CLIENT_SILICON-release*0004*ci_tag_ud202428_vpu_rc_20240726_0004*e4a99ed6b3e
[  161.722348] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_boot(): Failed to boot the firmware: -110
[  161.722367] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803000, fetch addr: 0x0
[  161.722387] intel_vpu 0000:02:05.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803010, fetch addr: 0x0
[  161.728714] intel_vpu 0000:02:05.0: [drm] ivpu_hw_37xx_power_down(): VPU not idle during power down
[  161.728995] intel_vpu: probe of 0000:02:05.0 failed with error -110
kwachows commented 2 months ago

It is possible that the issue you are observing might be related to the hypervisor cache configuration. There is a Patch that enables force_snoop module parameter for intel_vpu driver. You could try applying this patch or updating kernel to 6.11 that already contains the patch and retry with this parameter set.

lamw commented 2 months ago

hm ... so I just installed 6.11 kernel

# uname -r
6.11.0-061100rc6-generic

When I run the mmod intel_vpu; modprobe intel_vpu force_snoop=1, I see following in dmesg:

[    4.187932] intel_vpu 0000:13:00.0: [drm] *ERROR* ivpu_boot(): Failed to boot the firmware: -110
[    4.188097] intel_vpu 0000:13:00.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803000, fetch addr: 0x0
[    4.188361] intel_vpu 0000:13:00.0: [drm] *ERROR* ivpu_mmu_dump_event(): MMU EVTQ: 0x10 (Translation fault) SSID: 0 SID: 3, e[2] 00000000, e[3] 00000208, in addr: 0x84803010, fetch addr: 0x0
[    4.198809] intel_vpu 0000:13:00.0: [drm] ivpu_hw_power_down(): NPU not idle during power down
[    4.199002] intel_vpu 0000:13:00.0: probe with driver intel_vpu failed with error -110
[  386.779470] intel_vpu 0000:13:00.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20240726*MTL_CLIENT_SILICON-release*0004*ci_tag_ud202428_vpu_rc_20240726_0004*e4a99ed6b3e
[  386.904593] [drm] Initialized intel_vpu 1.0.0 for 0000:13:00.0 on minor 0

Interestingly, even though there's some issues I see VPU now initialized, does this mean its good?

I'm able to see the accel0 device :D

# ls /dev/accel/accel0
/dev/accel/accel0

If I reboot the system w/o using force_snoop=1, then it fails as before

kwachows commented 2 months ago

The force_snoop=1 parameter is only activated when explicitly specified with the modprobe command line. To have this parameter enabled by default when the module is loaded, you can create a configuration file in the /etc/modprobe.d/ directory. Please follow these steps:

  1. Create a file /etc/modprobe.d/intel_vpu.conf
  2. Add the following line to the file: options intel_vpu force_snoop=1

After you reboot your system, the force_snoop=1 parameter should be automatically applied when the intel_vpu module is loaded. As for the log message: [ 386.904593] [drm] Initialized intel_vpu 1.0.0 for 0000:13:00.0 on minor 0 and the presence of /dev/accel/accel0 these are indeed indications that the driver has been successfully initialized and the device is correctly set up.

lamw commented 2 months ago

Thanks for the commands to persist parameter. I guess I'm trying to understand why this is needed? Is there something missing on how device is being presented to guest preventing initiation by default?

I'm also seeing issue w/device passthru to Windows system which throws classic Error Code 43, is there similar parameter for Windows driver?