intel / xpumanager

MIT License
87 stars 18 forks source link

meet XPUM Service Status Error on centos8(ATS-m3) #34

Closed pengxin99 closed 1 year ago

pengxin99 commented 1 year ago
xpumcli -v
Error: XPUM Service Status Error.

cat /proc/version
Linux version 5.15.47 (ubit@fm6pudocker160) (gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), GNU ld version 2.30-113.el8) #4485.el8 SMP Fri Feb 10 23:56:46 UTC 2023

xpumcli -h
Intel XPU Manager Command Line Interface -- v1.2
Intel XPU Manager Command Line Interface provides the Intel data center GPU model and monitoring capabilities. It can also be used to change the Intel data center GPU settings and update the firmware.
Intel XPU Manager is based on Intel oneAPI Level Zero. Before using Intel XPU Manager, the GPU driver and Intel oneAPI Level Zero should be installed rightly.

Supported devices:
  - Intel Data Center GPU

Usage: xpumcli [Options]
  xpumcli -v
  xpumcli -h
  xpumcli discovery

Options:
  -h,--help                   Print this help message and exit
  -v,--version                Display version information and exit.

Subcommands:
  discovery                   Discover the GPU devices installed on this machine and provide the device info.
  topology                    Get the system topology.
  group                       Group the managed GPU devices.
  diag                        Run some test suites to diagnose GPU.
  health                      Get the GPU device component health status.
  policy                      Get and set the GPU policies.
  updatefw                    Update GPU firmware
  config                      Get and change the GPU settings.
  topdown                     Expected feature.
  ps                          List status of processes.
  stats                       List the GPU aggregated statistics since last execution of this command or XPU Manager daemon is started.
  dump                        Dump device statistics data.
  log                         Collect GPU debug logs.
  agentset                    Get or change some XPU Manager settings.
  amcsensor                   List the AMC real-time sensor readings.
huiqiwa commented 1 year ago

@pengxin99 After installing xpumanager, it takes one to two minutes to start the service. If the issue persists, please provide more detailed information. You can use command "systemctl status xpum".

pengxin99 commented 1 year ago

still failed after two days, and detailed information:

[pengxin@scsa-spr-smc ~]$ sudo systemctl status xpum
● xpum.service - XPUM daemon
   Loaded: loaded (/usr/lib/systemd/system/xpum.service; enabled; vendor preset: disabled)
   Active: failed (Result: signal) since Fri 2023-03-17 19:09:19 CST; 2min 4s ago
  Process: 3604632 ExecStart=/usr/bin/xpumd -p /var/xpum_daemon.pid -d /usr/lib/xpum/dump (code=killed, signal=SEGV)
  Process: 3604629 ExecStartPre=/bin/sh -c ulimit -c unlimited (code=exited, status=0/SUCCESS)
 Main PID: 3604632 (code=killed, signal=SEGV)

Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: me: error: Cannot connect to client [-16]:Device or resource busy
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/igsc_lib.c:gsc_driver_init():213) Error in HECI connect (9)
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/ifr.c:gsc_gfsp_memory_errors():543) Cannot initialize driver, status 1
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/igsc_lib.c:gsc_tee_command():628) Error in HECI write (10)
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/ifr.c:igsc_gfsp_get_health_indicator():1723) Invalid HECI message response 1
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: me: error: Cannot connect to client [-16]:Device or resource busy
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/igsc_lib.c:gsc_driver_init():213) Error in HECI connect (9)
Mar 17 19:09:19 scsa-spr-smc xpumd[3604632]: IGSC: (/home/ubit/rpmbuild/BUILD/igsc-0.8.8+embargo/lib/ifr.c:gsc_gfsp_memory_errors():543) Cannot initialize driver, status 1
Mar 17 19:09:19 scsa-spr-smc systemd[1]: xpum.service: Main process exited, code=killed, status=11/SEGV
Mar 17 19:09:19 scsa-spr-smc systemd[1]: xpum.service: Failed with result 'signal'.
huiqiwa commented 1 year ago

Could you please provide logs with journalctl

taotod commented 1 year ago

@pengxin99 It looks that GPU driver doesn't work well on your system. You may check it with command "dmesg | grep i915". Where did you get the GPU driver for CentOS 8 Stream? Please note that the GPU driver for RHEL 8 doesn't work for CentOS 8 Stream.

pengxin99 commented 1 year ago

Hi, thanks @taotod, @huiqiwa , I installed the driver for RHEL8, so does this issue is expected? And where can I find the driver for CentOS8? BTW, the i915 log below:

[pengxin@scsa-spr-smc build]$ dmesg|grep i915
[    9.192608] i915 0000:2e:00.0: [drm] GT count: 1, enabled: 1
[    9.193633] i915 0000:2e:00.0: [drm] Using Transparent Hugepages
[    9.193926] i915 0000:2e:00.0: [drm] Local memory IO size: 0x000000013cc00000
[    9.193928] i915 0000:2e:00.0: [drm] Local memory available: 0x000000013cc00000
[    9.548209] i915 0000:2e:00.0: [drm] GuC firmware i915/dg2_guc_70.6.2.bin version 70.6.2
[    9.548212] i915 0000:2e:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[    9.565662] i915 0000:2e:00.0: [drm] GuC submission enabled
[    9.565663] i915 0000:2e:00.0: [drm] GuC SLPC enabled
[    9.565986] i915 0000:2e:00.0: [drm] GuC RC: enabled
[    9.586595] i915 0000:2e:00.0: GT0: local0 bcs'0.0 clear bandwidth:56177 MB/s
[    9.605672] [drm] Initialized i915 1.6.0 20201103 for 0000:2e:00.0 on minor 0
[    9.782949] i915 0000:32:00.0: [drm] GT count: 1, enabled: 1
[    9.783996] i915 0000:32:00.0: [drm] Using Transparent Hugepages
[    9.784048] i915 0000:32:00.0: [drm] Local memory IO size: 0x000000013cc00000
[    9.784049] i915 0000:32:00.0: [drm] Local memory available: 0x000000013cc00000
[    9.788646] i915 0000:32:00.0: [drm] GuC firmware i915/dg2_guc_70.6.2.bin version 70.6.2
[    9.788649] i915 0000:32:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[    9.802822] i915 0000:32:00.0: [drm] GuC submission enabled
[    9.802824] i915 0000:32:00.0: [drm] GuC SLPC enabled
[    9.803004] i915 0000:32:00.0: [drm] GuC RC: enabled
[    9.816203] i915 0000:32:00.0: GT0: local0 bcs'0.0 clear bandwidth:77117 MB/s
[    9.816400] [drm] Initialized i915 1.6.0 20201103 for 0000:32:00.0 on minor 2
[    9.820408] i915 0000:ae:00.0: [drm] GT count: 1, enabled: 1
[    9.821438] i915 0000:ae:00.0: [drm] Using Transparent Hugepages
[    9.821496] i915 0000:ae:00.0: [drm] Local memory IO size: 0x000000013cc00000
[    9.821497] i915 0000:ae:00.0: [drm] Local memory available: 0x000000013cc00000
[    9.826433] i915 0000:ae:00.0: [drm] GuC firmware i915/dg2_guc_70.6.2.bin version 70.6.2
[    9.826437] i915 0000:ae:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[    9.840831] i915 0000:ae:00.0: [drm] GuC submission enabled
[    9.840833] i915 0000:ae:00.0: [drm] GuC SLPC enabled
[    9.841016] i915 0000:ae:00.0: [drm] GuC RC: enabled
[    9.853410] i915 0000:ae:00.0: GT0: local0 bcs'0.0 clear bandwidth:80109 MB/s
[    9.853666] [drm] Initialized i915 1.6.0 20201103 for 0000:ae:00.0 on minor 3
[    9.857624] i915 0000:b2:00.0: [drm] GT count: 1, enabled: 1
[    9.858587] i915 0000:b2:00.0: [drm] Using Transparent Hugepages
[    9.858656] i915 0000:b2:00.0: [drm] Local memory IO size: 0x000000013cc00000
[    9.858657] i915 0000:b2:00.0: [drm] Local memory available: 0x000000013cc00000
[    9.863360] i915 0000:b2:00.0: [drm] GuC firmware i915/dg2_guc_70.6.2.bin version 70.6.2
[    9.863363] i915 0000:b2:00.0: [drm] HuC firmware i915/dg2_huc_7.10.3_gsc.bin version 7.10.3
[    9.877842] i915 0000:b2:00.0: [drm] GuC submission enabled
[    9.877843] i915 0000:b2:00.0: [drm] GuC SLPC enabled
[    9.878024] i915 0000:b2:00.0: [drm] GuC RC: enabled
[    9.890516] i915 0000:b2:00.0: GT0: local0 bcs'0.0 clear bandwidth:92061 MB/s
[    9.890755] [drm] Initialized i915 1.6.0 20201103 for 0000:b2:00.0 on minor 4
[   14.546967] Creating 4 MTD partitions on "i915.spi.11776":
[   14.546970] 0x000000000000-0x000000001000 : "i915.spi.11776.DESCRIPTOR"
[   14.547797] 0x000000001000-0x0000005f0000 : "i915.spi.11776.GSC"
[   14.548568] 0x0000005f0000-0x0000007f0000 : "i915.spi.11776.OptionROM"
[   14.551425] 0x0000007f0000-0x000000800000 : "i915.spi.11776.DAM"
[   14.556456] Creating 4 MTD partitions on "i915.spi.12800":
[   14.556457] 0x000000000000-0x000000001000 : "i915.spi.12800.DESCRIPTOR"
[   14.561958] 0x000000001000-0x0000005f0000 : "i915.spi.12800.GSC"
[   14.563112] 0x0000005f0000-0x0000007f0000 : "i915.spi.12800.OptionROM"
[   14.563911] 0x0000007f0000-0x000000800000 : "i915.spi.12800.DAM"
[   14.566461] Creating 4 MTD partitions on "i915.spi.44544":
[   14.566462] 0x000000000000-0x000000001000 : "i915.spi.44544.DESCRIPTOR"
[   14.567343] 0x000000001000-0x0000005f0000 : "i915.spi.44544.GSC"
[   14.568113] 0x0000005f0000-0x0000007f0000 : "i915.spi.44544.OptionROM"
[   14.569027] 0x0000007f0000-0x000000800000 : "i915.spi.44544.DAM"
[   14.571783] Creating 4 MTD partitions on "i915.spi.45568":
[   14.571784] 0x000000000000-0x000000001000 : "i915.spi.45568.DESCRIPTOR"
[   14.572698] 0x000000001000-0x0000005f0000 : "i915.spi.45568.GSC"
[   14.573443] 0x0000005f0000-0x0000007f0000 : "i915.spi.45568.OptionROM"
[   14.574197] 0x0000007f0000-0x000000800000 : "i915.spi.45568.DAM"
[   14.633640] mei_gsc i915.mei-gsc.11776: pm event: 0
[   14.636196] mei_gsc i915.mei-gscfi.11776: pm event: 0
[   14.636825] mei_gsc i915.mei-gsc.12800: pm event: 0
[   14.639536] mei_gsc i915.mei-gscfi.12800: pm event: 0
[   14.640154] mei_gsc i915.mei-gsc.44544: pm event: 0
[   14.642659] mei_gsc i915.mei-gscfi.44544: pm event: 0
[   14.643266] mei_gsc i915.mei-gsc.45568: pm event: 0
[   14.645769] mei_gsc i915.mei-gscfi.45568: pm event: 0
[   14.648041] mei_gsc i915.mei-gscfi.11776: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[   14.648105] mei_gsc i915.mei-gscfi.11776: pm event: 67108864
[   14.648824] mei_gsc i915.mei-gsc.11776: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[   14.648864] mei_gsc i915.mei-gsc.11776: pm event: 67108864
[   14.651285] mei_gsc i915.mei-gscfi.12800: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[   14.651329] mei_gsc i915.mei-gscfi.12800: pm event: 67108864
[   14.652070] mei_gsc i915.mei-gsc.12800: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[   14.652112] mei_gsc i915.mei-gsc.12800: pm event: 67108864
[   14.654424] mei_gsc i915.mei-gscfi.44544: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[   14.654494] mei_gsc i915.mei-gscfi.44544: pm event: 67108864
[   14.655214] mei_gsc i915.mei-gsc.44544: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[   14.655260] mei_gsc i915.mei-gsc.44544: pm event: 67108864
[   14.657763] mei_gsc i915.mei-gscfi.45568: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[   14.657805] mei_gsc i915.mei-gscfi.45568: pm event: 67108864
[   14.658548] mei_gsc i915.mei-gsc.45568: unexpected reset: dev_state = ENABLED fw status = 00000355 84670000 00000000 00000000 E0020002 00000000
[   14.658587] mei_gsc i915.mei-gsc.45568: pm event: 67108864
[   15.661956] i915 0000:2e:00.0: [drm] HuC authenticated
[   15.661962] mei_pxp i915.mei-gsc.11776-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:2e:00.0 (ops i915_pxp_tee_component_ops [i915])                                                                                                                                                                                                                                   [   15.690853] i915 0000:32:00.0: [drm] HuC authenticated
[   15.690858] mei_pxp i915.mei-gsc.12800-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:32:00.0 (ops i915_pxp_tee_component_ops [i915])                                                                                                                                                                                                                                   [   15.719822] i915 0000:ae:00.0: [drm] HuC authenticated
[   15.719826] mei_pxp i915.mei-gsc.44544-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:ae:00.0 (ops i915_pxp_tee_component_ops [i915])                                                                                                                                                                                                                                   [   15.747035] i915 0000:b2:00.0: [drm] HuC authenticated
[   15.747039] mei_pxp i915.mei-gsc.45568-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:b2:00.0 (ops i915_pxp_tee_component_ops [i915])   
taotod commented 1 year ago

Hi, @pengxin99 It looks that GPU driver works well on your system. Please try our latest version: https://github.com/intel/xpumanager/releases/download/V1.2.6/xpu-smi-1.2.6-20230322.060757.98a08a04.el8.x86_64.rpm

pengxin99 commented 1 year ago

Worked with the 1.2.6, thanks a lot for your help.