intel-gpu / intel-gpu-i915-backports

Other
94 stars 63 forks source link

Flex140 DataCenter GPU not detected by driver with core i9-12900K and ubuntu 22.04.4 #167

Closed jcourtat-ektacom closed 4 months ago

jcourtat-ektacom commented 7 months ago

Hello, i want to benchmark our Intel Flex140 Datacenter GPU card on encoding with HEVC. I installed the card in a Dell Precision 3660 (core i9-12900K), and installed latest Ubuntu 22.04.4 with kernel 5.15.

I followed the guide https://dgpu-docs.intel.com/driver/installation.html#install-steps from the ubuntu chapter.

After installing the backported i915 driver (automatic dkms driver installation), the boot process is stuck at loading the i915 driver after detecting the integrated VGA device (00:02.0 VGA compatible controller: Intel Corporation AlderLake-S GT1 (rev 0c)) The only way found to make the boot continuing was to add i915.modeset=0 on kernel command line. After the boot, i don't see any line in dmesg showing the driver has detected my flex card, neither xpu-smi discovery which only show the integrated card

ekta@telli:~/log_issues$ xpu-smi discovery
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) UHD Graphics 770                                               |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0200-0000-000c46808086                                       |
|           | PCI BDF Address: 0000:00:02.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+

I tryed to deactivate secureboot, but with no effects. I tryed to install unsigned kernel before installing the i915-backport-dkms driver with no effect.

Then i rebuild another branch of the driver, the backport/RELEASE_2335_23.6 one. I created a debian package from it using commands listed in README.rst and copied firmware files. This time, the boot process is not stuck but my card is still not detected.

What i am missing ? Regards dmidecode.txt dri.node.tar.gz hw_info.txt inxi.txt kern_i915_dkms_1.23.6.42.230425.56+i1-1_all.log lspci.txt packages.txt

smuqthya commented 7 months ago

@jcourtat-ektacom , Seems in pci data, card data is seen, but have failed during MMIO BAR re-allocations.

[ 0.874206] pci 0000:07:00.0: BAR 2: no space for [mem size 0x200000000 64bit pref] [ 0.874207] pci 0000:07:00.0: BAR 2: failed to assign [mem size 0x200000000 64bit pref]

I see that you are not having this cmdline parameter which is recommended for Flex D-GPU. Can you please try this and let us know if it works.

https://dgpu-docs.intel.com/driver/installation.html#multi-card-deployments

jcourtat-ektacom commented 7 months ago

Hi smuqthya, i added the pci=realloc=off to kernel boot line but with no success. So i tryed something else, on ubuntu desktop with branch backport/RELEASE_2405_23.10 i did make i915dkmsdeb-pkg OS_DISTRIBUTION=UBUNTU_22.04_DESKTOP installed the package alongside with firmwares. The card is detected and loaded successfully by the driver under some circumstances:

If i reboot the workstation, the i915 driver strictly doesn't see the device at all.

Once the driver has successfully loaded and detected my flex card, i can see it with xpu-smi tool

xpu-smi discovery
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) UHD Graphics 770                                               |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0200-0000-000c46808086                                       |
|           | PCI BDF Address: 0000:00:02.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 1         | Device Name: Intel(R) Data Center GPU Flex 140                                       |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0000-efaf-20fc6ecf15b1                                       |
|           | PCI BDF Address: 0000:07:00.0                                                        |
|           | DRM Device: /dev/dri/card1                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 2         | Device Name: Intel(R) Data Center GPU Flex 140                                       |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0000-9bea-9abe6a652303                                       |
|           | PCI BDF Address: 0000:0a:00.0                                                        |
|           | DRM Device: /dev/dri/card2                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
sudo xpu-smi stats -d 1
+-----------------------------+--------------------------------------------------------------------+
| Device ID                   | 1                                                                  |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%)         | 0                                                                  |
| EU Array Active (%)         | N/A                                                                |
| EU Array Stall (%)          | N/A                                                                |
| EU Array Idle (%)           | N/A                                                                |
|                             |                                                                    |
| Compute Engine Util (%)     | 0; Engine 0: 0                                                     |
| Render Engine Util (%)      | 0; Engine 0: 0                                                     |
| Media Engine Util (%)       | 0                                                                  |
| Decoder Engine Util (%)     | Engine 0: 0, Engine 1: 0                                           |
| Encoder Engine Util (%)     | Engine 0: 0, Engine 1: 0                                           |
| Copy Engine Util (%)        | 0; Engine 0: 0                                                     |
| Media EM Engine Util (%)    | Engine 0: 0, Engine 1: 0                                           |
| 3D Engine Util (%)          | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| Reset                       | N/A                                                                |
| Programming Errors          | N/A                                                                |
| Driver Errors               | N/A                                                                |
| Cache Errors Correctable    | N/A                                                                |
| Cache Errors Uncorrectable  | N/A                                                                |
| Mem Errors Correctable      | N/A                                                                |
| Mem Errors Uncorrectable    | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W)               | 15                                                                 |
| GPU Frequency (MHz)         | 1950                                                               |
| Media Engine Freq (MHz)     | 975                                                                |
| GPU Core Temperature (C)    | 74                                                                 |
| GPU Memory Temperature (C)  | N/A                                                                |
| GPU Memory Read (kB/s)      | 1440                                                               |
| GPU Memory Write (kB/s)     | 301                                                                |
| GPU Memory Bandwidth (%)    | 0                                                                  |
| GPU Memory Used (MiB)       | 31                                                                 |
| GPU Memory Util (%)         | 1                                                                  |
| Xe Link Throughput (kB/s)   | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+

I then start to encode some rtp flow using the renderD129

 ffmpeg -protocol_whitelist file,udp,rtp,sdp -hide_banner -nostats -nostdin -hwaccel vaapi -hwaccel_device /dev/dri/renderD129 -hwaccel_output_format vaapi -i /home/oem/testSDP.sdp -vf "scale_vaapi=w=288:h=162:mode=fast,format=nv12|vaapi" -c:v h264_vaapi -low_power true -coder ac -g 12 -f mpegts tv.ts

sometime it works, sometime not, but after some runs, i always get following kernel message

[  808.909971] pcieport 0000:00:01.0: AER: Corrected error received: 0000:04:08.0
[  808.909992] pcieport 0000:04:08.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  808.909997] pcieport 0000:04:08.0:   device [1000:c010] error status/mask=00000041/0000e000
[  808.910003] pcieport 0000:04:08.0:    [ 0] RxErr                  (First)
[  808.910008] pcieport 0000:04:08.0:    [ 6] BadTLP                
[  808.912095] pcieport 0000:00:01.0: AER: Corrected error received: 0000:04:18.0
[  808.912115] pcieport 0000:04:18.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  808.912120] pcieport 0000:04:18.0:   device [1000:c010] error status/mask=00000001/0000e000
[  808.912126] pcieport 0000:04:18.0:    [ 0] RxErr                  (First)
[  820.943330] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  821.544052] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  822.144710] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  822.746485] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  823.347072] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  824.698872] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  825.299513] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  825.900082] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] render: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  826.500647] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  827.101214] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] render: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  827.701787] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  828.302366] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  828.902940] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  829.503517] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  830.104091] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] vdbox0: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  830.704648] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  831.305226] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] vdbox2: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  831.905783] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  832.506360] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  833.106933] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] render: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  833.707492] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  834.308068] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] render: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  834.908641] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  835.509211] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  836.109786] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] render: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  836.710373] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] render: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  837.310943] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] gt: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  837.911665] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] vdbox0: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  838.512239] i915 0000:07:00.0: [drm] *ERROR* GT0 [GT OTHER] vdbox2: MMIO unreliable (forcewake register returns 0xFFFFFFFF)!
[  838.612293] i915 0000:07:00.0: [drm] debugger: gt0 l3 invalidation fail: incompatible bios(-13). Surfaces need to be declared uncached to avoid coherency issues!

Looks like the flex card is not very stable

smuqthya commented 7 months ago

@jcourtat-ektacom can you please check the value of PCI_REALLOC_ENABLE_AUTO in the kernel you are booting to.

CONFIG_PCI_REALLOC_ENABLE_AUTO

jcourtat-ektacom commented 6 months ago
oem@paris:~$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-6.5.0-28-generic root=UUID=18e629b6-9b89-40ad-8852-84000750e3ad ro quiet splash pci=realloc=off vt.handoff=7

oem@paris:~$ cat /lib/modules/$(uname -r)/build/.config | grep CONFIG_PCI_REALLOC
CONFIG_PCI_REALLOC_ENABLE_AUTO=y
smuqthya commented 5 months ago

@jcourtat-ektacom 23.6 release seems too old. can you please check with latest release and see if you are still seeing this issue.

smuqthya commented 4 months ago

@jcourtat-ektacom any update about the tests on latest backport release . if not we can close it

jcourtat-ektacom commented 4 months ago

i managed to install the flex board into Xeon Gold 3rd gen server and everything worked out of the box after installing Intel backport driver from intel package repository. I think trying to install that board on a desktop host (Core architecture) is just not the way it is designed for. Thanks for your help smuqthya.