Closed pengxin99 closed 1 year ago
@pengxin99 After installing xpumanager, it takes one to two minutes to start the service. If the issue persists, please provide more detailed information. You can use command "systemctl status xpum".
still failed after two days, and detailed information:
[pengxin@scsa-spr-smc ~]$ sudo systemctl status xpum
● xpum.service - XPUM daemon
Loaded: loaded (/usr/lib/systemd/system/xpum.service; enabled; vendor preset: disabled)
Active: failed (Result: signal) since Fri 2023-03-17 19:09:19 CST; 2min 4s ago
Process: 3604632 ExecStart=/usr/bin/xpumd -p /var/xpum_daemon.pid -d /usr/lib/xpum/dump (code=killed, signal=SEGV)
Process: 3604629 ExecStartPre=/bin/sh -c ulimit -c unlimited (code=exited, status=0/SUCCESS)
Main PID: 3604632 (code=killed, signal=SEGV)
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: me: error: Cannot connect to client [-16]:Device or resource busy
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/igsc_lib.c:gsc_driver_init():213) Error in HECI connect (9)
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/ifr.c:gsc_gfsp_memory_errors():543) Cannot initialize driver, status 1
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/igsc_lib.c:gsc_tee_command():628) Error in HECI write (10)
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/ifr.c:igsc_gfsp_get_health_indicator():1723) Invalid HECI message response 1
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: me: error: Cannot connect to client [-16]:Device or resource busy
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/igsc_lib.c:gsc_driver_init():213) Error in HECI connect (9)
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/ifr.c:gsc_gfsp_memory_errors():543) Cannot initialize driver, status 1
Mar 17 19:09:19 scsa-spr-smc systemd[1]: xpum.service: Main process exited, code=killed, status=11/SEGV
Mar 17 19:09:19 scsa-spr-smc systemd[1]: xpum.service: Failed with result 'signal'.
Could you please provide logs with journalctl
@pengxin99 It looks that GPU driver doesn't work well on your system. You may check it with command "dmesg | grep i915". Where did you get the GPU driver for CentOS 8 Stream? Please note that the GPU driver for RHEL 8 doesn't work for CentOS 8 Stream.
Hi, thanks @taotod, @huiqiwa , I installed the driver for RHEL8, so does this issue is expected? And where can I find the driver for CentOS8? BTW, the i915 log below:
[pengxin@scsa-spr-smc build]$ dmesg|grep i915
[ 9.192608] i915 0000:2e:00.0: [drm] GT count: 1, enabled: 1
[ 9.193633] i915 0000:2e:00.0: [drm] Using Transparent Hugepages
[ 9.193926] i915 0000:2e:00.0: [drm] Local memory IO size: 0x000000013cc00000
[ 9.193928] i915 0000:2e:00.0: [drm] Local memory available: 0x000000013cc00000
[ 9.548209] i915 0000:2e:00.0: [drm] GuC firmware i915/dg2_guc_70.6.2.bin version 70.6.2
[ 9.548212] i915 0000:2e:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[ 9.565662] i915 0000:2e:00.0: [drm] GuC submission enabled
[ 9.565663] i915 0000:2e:00.0: [drm] GuC SLPC enabled
[ 9.565986] i915 0000:2e:00.0: [drm] GuC RC: enabled
[ 9.586595] i915 0000:2e:00.0: GT0: local0 bcs'0.0 clear bandwidth:56177 MB/s
[ 9.605672] [drm] Initialized i915 1.6.0 20201103 for 0000:2e:00.0 on minor 0
[ 9.782949] i915 0000:32:00.0: [drm] GT count: 1, enabled: 1
[ 9.783996] i915 0000:32:00.0: [drm] Using Transparent Hugepages
[ 9.784048] i915 0000:32:00.0: [drm] Local memory IO size: 0x000000013cc00000
[ 9.784049] i915 0000:32:00.0: [drm] Local memory available: 0x000000013cc00000
[ 9.788646] i915 0000:32:00.0: [drm] GuC firmware i915/dg2_guc_70.6.2.bin version 70.6.2
[ 9.788649] i915 0000:32:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[ 9.802822] i915 0000:32:00.0: [drm] GuC submission enabled
[ 9.802824] i915 0000:32:00.0: [drm] GuC SLPC enabled
[ 9.803004] i915 0000:32:00.0: [drm] GuC RC: enabled
[ 9.816203] i915 0000:32:00.0: GT0: local0 bcs'0.0 clear bandwidth:77117 MB/s
[ 9.816400] [drm] Initialized i915 1.6.0 20201103 for 0000:32:00.0 on minor 2
[ 9.820408] i915 0000:ae:00.0: [drm] GT count: 1, enabled: 1
[ 9.821438] i915 0000:ae:00.0: [drm] Using Transparent Hugepages
[ 9.821496] i915 0000:ae:00.0: [drm] Local memory IO size: 0x000000013cc00000
[ 9.821497] i915 0000:ae:00.0: [drm] Local memory available: 0x000000013cc00000
[ 9.826433] i915 0000:ae:00.0: [drm] GuC firmware i915/dg2_guc_70.6.2.bin version 70.6.2
[ 9.826437] i915 0000:ae:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[ 9.840831] i915 0000:ae:00.0: [drm] GuC submission enabled
[ 9.840833] i915 0000:ae:00.0: [drm] GuC SLPC enabled
[ 9.841016] i915 0000:ae:00.0: [drm] GuC RC: enabled
[ 9.853410] i915 0000:ae:00.0: GT0: local0 bcs'0.0 clear bandwidth:80109 MB/s
[ 9.853666] [drm] Initialized i915 1.6.0 20201103 for 0000:ae:00.0 on minor 3
[ 9.857624] i915 0000:b2:00.0: [drm] GT count: 1, enabled: 1
[ 9.858587] i915 0000:b2:00.0: [drm] Using Transparent Hugepages
[ 9.858656] i915 0000:b2:00.0: [drm] Local memory IO size: 0x000000013cc00000
[ 9.858657] i915 0000:b2:00.0: [drm] Local memory available: 0x000000013cc00000
[ 9.863360] i915 0000:b2:00.0: [drm] GuC firmware i915/dg2_guc_70.6.2.bin version 70.6.2
[ 9.863363] i915 0000:b2:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[ 9.877842] i915 0000:b2:00.0: [drm] GuC submission enabled
[ 9.877843] i915 0000:b2:00.0: [drm] GuC SLPC enabled
[ 9.878024] i915 0000:b2:00.0: [drm] GuC RC: enabled
[ 9.890516] i915 0000:b2:00.0: GT0: local0 bcs'0.0 clear bandwidth:92061 MB/s
[ 9.890755] [drm] Initialized i915 1.6.0 20201103 for 0000:b2:00.0 on minor 4
[ 14.546967] Creating 4 MTD partitions on "i915.spi.11776":
[ 14.546970] 0x000000000000-0x000000001000 : "i915.spi.11776.DESCRIPTOR"
[ 14.547797] 0x000000001000-0x0000005f0000 : "i915.spi.11776.GSC"
[ 14.548568] 0x0000005f0000-0x0000007f0000 : "i915.spi.11776.OptionROM"
[ 14.551425] 0x0000007f0000-0x000000800000 : "i915.spi.11776.DAM"
[ 14.556456] Creating 4 MTD partitions on "i915.spi.12800":
[ 14.556457] 0x000000000000-0x000000001000 : "i915.spi.12800.DESCRIPTOR"
[ 14.561958] 0x000000001000-0x0000005f0000 : "i915.spi.12800.GSC"
[ 14.563112] 0x0000005f0000-0x0000007f0000 : "i915.spi.12800.OptionROM"
[ 14.563911] 0x0000007f0000-0x000000800000 : "i915.spi.12800.DAM"
[ 14.566461] Creating 4 MTD partitions on "i915.spi.44544":
[ 14.566462] 0x000000000000-0x000000001000 : "i915.spi.44544.DESCRIPTOR"
[ 14.567343] 0x000000001000-0x0000005f0000 : "i915.spi.44544.GSC"
[ 14.568113] 0x0000005f0000-0x0000007f0000 : "i915.spi.44544.OptionROM"
[ 14.569027] 0x0000007f0000-0x000000800000 : "i915.spi.44544.DAM"
[ 14.571783] Creating 4 MTD partitions on "i915.spi.45568":
[ 14.571784] 0x000000000000-0x000000001000 : "i915.spi.45568.DESCRIPTOR"
[ 14.572698] 0x000000001000-0x0000005f0000 : "i915.spi.45568.GSC"
[ 14.573443] 0x0000005f0000-0x0000007f0000 : "i915.spi.45568.OptionROM"
[ 14.574197] 0x0000007f0000-0x000000800000 : "i915.spi.45568.DAM"
[ 14.633640] mei_gsc i915.mei-gsc.11776: pm event: 0
[ 14.636196] mei_gsc i915.mei-gscfi.11776: pm event: 0
[ 14.636825] mei_gsc i915.mei-gsc.12800: pm event: 0
[ 14.639536] mei_gsc i915.mei-gscfi.12800: pm event: 0
[ 14.640154] mei_gsc i915.mei-gsc.44544: pm event: 0
[ 14.642659] mei_gsc i915.mei-gscfi.44544: pm event: 0
[ 14.643266] mei_gsc i915.mei-gsc.45568: pm event: 0
[ 14.645769] mei_gsc i915.mei-gscfi.45568: pm event: 0
[ 14.648041] mei_gsc i915.mei-gscfi.11776: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[ 14.648105] mei_gsc i915.mei-gscfi.11776: pm event: 67108864
[ 14.648824] mei_gsc i915.mei-gsc.11776: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[ 14.648864] mei_gsc i915.mei-gsc.11776: pm event: 67108864
[ 14.651285] mei_gsc i915.mei-gscfi.12800: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[ 14.651329] mei_gsc i915.mei-gscfi.12800: pm event: 67108864
[ 14.652070] mei_gsc i915.mei-gsc.12800: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[ 14.652112] mei_gsc i915.mei-gsc.12800: pm event: 67108864
[ 14.654424] mei_gsc i915.mei-gscfi.44544: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[ 14.654494] mei_gsc i915.mei-gscfi.44544: pm event: 67108864
[ 14.655214] mei_gsc i915.mei-gsc.44544: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[ 14.655260] mei_gsc i915.mei-gsc.44544: pm event: 67108864
[ 14.657763] mei_gsc i915.mei-gscfi.45568: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[ 14.657805] mei_gsc i915.mei-gscfi.45568: pm event: 67108864
[ 14.658548] mei_gsc i915.mei-gsc.45568: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[ 14.658587] mei_gsc i915.mei-gsc.45568: pm event: 67108864
[ 15.661956] i915 0000:2e:00.0: [drm] HuC authenticated
[ 15.661962] mei_pxp i915.mei-gsc.11776-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:2e:00.0 (ops i915_pxp_tee_component_ops [i915]) [ 15.690853] i915 0000:32:00.0: [drm] HuC authenticated
[ 15.690858] mei_pxp i915.mei-gsc.12800-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:32:00.0 (ops i915_pxp_tee_component_ops [i915]) [ 15.719822] i915 0000:ae:00.0: [drm] HuC authenticated
[ 15.719826] mei_pxp i915.mei-gsc.44544-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:ae:00.0 (ops i915_pxp_tee_component_ops [i915]) [ 15.747035] i915 0000:b2:00.0: [drm] HuC authenticated
[ 15.747039] mei_pxp i915.mei-gsc.45568-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:b2:00.0 (ops i915_pxp_tee_component_ops [i915])
Hi, @pengxin99 It looks that GPU driver works well on your system. Please try our latest version: https://github.com/intel/xpumanager/releases/download/V1.2.6/xpu-smi-1.2.6-20230322.060757.98a08a04.el8.x86_64.rpm
Worked with the 1.2.6, thanks a lot for your help.