intel / xpumanager

MIT License
87 stars 19 forks source link

fail to change ARC 770 frequency. #72

Closed zhewang1-intc closed 6 months ago

zhewang1-intc commented 6 months ago

hi, I try to use this tool to limit GPU's frequency in my specified range. i install v1.2.29 deb package(xpumanager_1.2.29_20240201.035533.2b2f658d.u22.04_amd64.deb) on my machine after i execute xpumcli discovery i got an error Error: XPUM Service Status Error. then i check my xpum-service state and i got

 systemctl status xpum
× xpum.service - XPUM daemon
     Loaded: loaded (/lib/systemd/system/xpum.service; enabled; vendor preset: enabled)
     Active: failed (Result: signal) since Mon 2024-02-05 06:49:38 UTC; 22min ago
    Process: 9781 ExecStartPre=/bin/sh -c ulimit -c unlimited (code=exited, status=0/SUCCESS)
    Process: 9782 ExecStart=/usr/bin/xpumd -p /var/xpum_daemon.pid -d /usr/lib/xpum/dump (code=killed, signal=FPE)
   Main PID: 9782 (code=killed, signal=FPE)
        CPU: 150ms

Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.593] [I] [9782-9782] Level Zero:        1.15.0
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.593] [I] [9782-9782] xpumd core starts to initialize
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.593] [I] [9782-9782] initialize configuration
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.593] [I] [9782-9782] xpum mode: xpum
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.594] [I] [9782-9782] initialize datalogic
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.594] [I] [9782-9782] initialize device manager
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.675] [W] [9782-9816] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Memory Temperatur>
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.676] [W] [9782-9816] Capability Memory Temperature detection returned:
Feb 05 06:49:38 DUT001DG2SVC systemd[1]: xpum.service: Main process exited, code=killed, status=8/FPE
Feb 05 06:49:38 DUT001DG2SVC systemd[1]: xpum.service: Failed with result 'signal'.

but if i try to run xpumd directly, the service not be killed, but you can find some warnings & errors

xpumd
[2024-02-05 07:14:11.566] [I] [15258-15258] XPUM: Init xpum library
[2024-02-05 07:14:11.566] [I] [15258-15258] XPU Manager:        1.2.28.20240118
[2024-02-05 07:14:11.566] [I] [15258-15258] Build:              89af66d7
[2024-02-05 07:14:11.566] [I] [15258-15258] Level Zero: 1.15.0
[2024-02-05 07:14:11.566] [I] [15258-15258] xpumd core starts to initialize
[2024-02-05 07:14:11.566] [I] [15258-15258] initialize configuration
[2024-02-05 07:14:11.566] [I] [15258-15258] xpum mode: xpum
[2024-02-05 07:14:11.566] [I] [15258-15258] initialize datalogic
[2024-02-05 07:14:11.566] [I] [15258-15258] initialize device manager
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no GPU Temperature capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability GPU Temperature detection returned: [toGetTemperature:1827] zesTemperatureGetState:0x70020000
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Memory Temperature capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability Memory Temperature detection returned: 
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Memory Throughput and Bandwidth capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability Memory Throughput and Bandwidth detection returned: [toGetMemoryThroughputAndBandwidth:1954] zesMemoryGetBandwidth:0x70020000
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no GPU Utilization capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability GPU Utilization detection returned: 
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Engine Utilization capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability Engine Utilization detection returned: 
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Ras Error capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability Ras Error detection returned: toGetRasErrorOnSubdevice error
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Frequency Throttle capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability Frequency Throttle detection returned: [toGetFrequencyThrottle:1680] zesFrequencyGetThrottleTime:0x78000003
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no fabric throughput capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability fabric throughput detection returned: fabric port not found
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Compute Engine Group Utilization monitoring capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Media Engine Group Utilization monitoring capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Copy Engine Group Utilization monitoring capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Render Engine Group Utilization monitoring capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no 3D Engine Group Utilization monitoring capability.
[2024-02-05 07:14:11.645] [I] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has the following monitoring metric types: power, energy, frequency, request frequency, throttle reason, media engine frequency.
[2024-02-05 07:14:11.730] [I] [15258-15258] initialize health manager
[2024-02-05 07:14:11.730] [I] [15258-15258] initialize group manager
[2024-02-05 07:14:11.735] [I] [15258-15258] initialize diagnostic manager
[2024-02-05 07:14:11.735] [I] [15258-15258] initialize policy manager
[2024-02-05 07:14:11.735] [I] [15258-15258] initialize dump raw data manager
[2024-02-05 07:14:11.736] [I] [15258-15258] initialize firmware manager
[2024-02-05 07:14:11.736] [E] [15258-15289] Fail to get SoC fw version from device: /dev/mei2
[2024-02-05 07:14:11.736] [I] [15258-15258] IpmiAmcManager preInit
[2024-02-05 07:14:11.736] [E] [15258-15258] Unable to open /dev/ipmi0. errno: 2(No such file or directory)

[2024-02-05 07:14:11.736] [I] [15258-15258] IpmiAmcManager can not find AMC device
[2024-02-05 07:14:11.737] [I] [15258-15258] SMCRedfishAmcManager preInit
[2024-02-05 07:14:11.739] [I] [15258-15258] fail to parse redfish host interface
[2024-02-05 07:14:11.739] [I] [15258-15258] initialize monitor manager
[2024-02-05 07:14:11.739] [I] [15258-15258] xpumd core initialization completed
[2024-02-05 07:14:11.739] [I] [15258-15258] xpumd is providing services
[2024-02-05 07:14:11.739] [I] [15258-15258] XPUM: start XPUM RPC Server.
[2024-02-05 07:14:11.739] [I] [15258-15258] XPUM: start RPC server ...
[2024-02-05 07:14:11.741] [I] [15258-15258] XPUM: RPC server is listening at /tmp/xpum_p.sock

btw, if i execute xpumd with sudo, the service will crash

sudo xpumd
[2024-02-05 07:14:07.813] [I] [15219-15219] XPUM: Init xpum library
[2024-02-05 07:14:07.813] [I] [15219-15219] XPU Manager:        1.2.28.20240118
[2024-02-05 07:14:07.813] [I] [15219-15219] Build:              89af66d7
[2024-02-05 07:14:07.813] [I] [15219-15219] Level Zero: 1.15.0
[2024-02-05 07:14:07.813] [I] [15219-15219] xpumd core starts to initialize
[2024-02-05 07:14:07.813] [I] [15219-15219] initialize configuration
[2024-02-05 07:14:07.813] [I] [15219-15219] xpum mode: xpum
[2024-02-05 07:14:07.813] [I] [15219-15219] initialize datalogic
[2024-02-05 07:14:07.813] [I] [15219-15219] initialize device manager
[2024-02-05 07:14:07.893] [W] [15219-15234] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Memory Temperature capability.
[2024-02-05 07:14:07.893] [W] [15219-15234] Capability Memory Temperature detection returned: 
Floating point exception (core dumped)

anyway, after i run xpumd, xpumcli seems can give me some useful msg:

sudo xpumcli config -d 0 -t 0
+-------------+-------------------+----------------------------------------------------------------+
| Device Type | Device ID/Tile ID | Configuration                                                  |
+-------------+-------------------+----------------------------------------------------------------+
| GPU         | 0                 | Power Limit (w): 190                                           |
|             |                   |   Valid Range: 1 to 0                                          |
|             |                   |                                                                |
|             |                   | Memory ECC:                                                    |
|             |                   |   Current: N/A                                                 |
|             |                   |   Pending: N/A                                                 |
+-------------+-------------------+----------------------------------------------------------------+
| GPU         | 0/0               | GPU Min Frequency (MHz): 300                                   |
|             |                   | GPU Max Frequency (MHz): 2400                                  |
|             |                   |   Valid Options: 300, 350, 400, 450, 500, 550, 600, 650, 700,  |
|             |                   |     750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200,     |
|             |                   |     1250, 1300, 1350, 1400, 1450, 1500, 1550, 1600, 1650,      |
|             |                   |     1700, 1750, 1800, 1850, 1900, 1950, 2000, 2050, 2100,      |
|             |                   |     2150, 2200, 2250, 2300, 2350, 2400                         |
|             |                   |                                                                |
|             |                   | Standby Mode: default                                          |
|             |                   |   Valid Options: default, never                                |
|             |                   |                                                                |
|             |                   | Scheduler Mode: timeslice                                      |
|             |                   |   Timeout (us): N/A                                            |
|             |                   |   Interval (us): 5000                                          |
|             |                   |   Yield Timeout (us): 640000                                   |
|             |                   |                                                                |
|             |                   | Engine Type: compute                                           |
|             |                   |   Performance Factor: N/A                                      |
|             |                   | Engine Type: media                                             |
|             |                   |   Performance Factor: 50                                       |
|             |                   |                                                                |
|             |                   | Xe Link ports:                                                 |
|             |                   |   Up: N/A                                                      |
|             |                   |   Down: N/A                                                    |
|             |                   |   Beaconing On: N/A                                            |
|             |                   |   Beaconing Off: N/A                                           |
+-------------+-------------------+----------------------------------------------------------------+

but if i give the frequency range, xpumcli will throw an error without any hint.

sudo xpumcli config -d 0 -t 0 --frequencyrange 2400,2400
Error: Error
fmiao2372 commented 6 months ago

Please build the zello_sysman tool to check which Sysman API returns unexpected values or errors that XPUM doesn't handle properly.

wget https://raw.githubusercontent.com/intel/compute-runtime/releases/23.48/level_zero/tools/test/black_box_tests/zello_sysman.cpp
g++ -O2 -Wall -o zello_sysman zello_sysman.cpp -lze_loader -locloc
sudo ./zello_sysman --memory
sudo ./zello_sysman --temperature
sudo ./zello_sysman --frequency
fmiao2372 commented 6 months ago

From the zello_sysman's output, we see that memory maximum bandwidth is 0 returned by driver. It is the root cause of the crash. We will fix it in next release. After XPUM starts with the root privilege, frequency can be changed successfully.

----  Memory tests ---- 
Memory Type = ZES_MEM_TYPE_DDR
On Subdevice = 0
Subdevice Id = 0
Memory Size = 0
Number of channels = 2
Memory Health = ZES_MEM_HEALTH_UNKNOWN
The total allocatable memory in bytes = 17079205888
The free memory in bytes = 17010581504
Memory Read Counter = 17389969126208
Memory Write Counter = 342828039872
Memory Maximum Bandwidth = 0
Memory Timestamp = 18503820255
----  Temperature tests ---- 
For subDevice 0 temperature current state for ZES_TEMP_SENSORS_GLOBAL is: 50
For subDevice 0 temperature current state for ZES_TEMP_SENSORS_GPU is: 50
----  Frequency tests ---- 
freqProperties.type = 0
freqProperties.canControl = 1
freqProperties.isThrottleEventSupported = 0
freqProperties.min = 300
freqProperties.max = 2400
freqState.currentVoltage = -1
freqState.request = 2400
freqState.tdp = 0
freqState.efficient = 2100
freqState.actual = 2400
freqState.throttleReasons = 0
freqRange.min = 300
freqRange.max = 2400
frequency = 300
frequency = 350
frequency = 400
frequency = 450
frequency = 500
frequency = 550
frequency = 600
frequency = 650
frequency = 700
frequency = 750
frequency = 800
frequency = 850
frequency = 900
frequency = 950
frequency = 1000
frequency = 1050
frequency = 1100
frequency = 1150
frequency = 1200
frequency = 1250
frequency = 1300
frequency = 1350
frequency = 1400
frequency = 1450
frequency = 1500
frequency = 1550
frequency = 1600
frequency = 1650
frequency = 1700
frequency = 1750
frequency = 1800
frequency = 1850
frequency = 1900
frequency = 1950
frequency = 2000
frequency = 2050
frequency = 2100
frequency = 2150
frequency = 2200
frequency = 2250
frequency = 2300
frequency = 2350
frequency = 2400
Setting Frequency Range . min 300
Setting Frequency Range . max 300
After Setting Getting Frequency Range . min 300
After Setting Getting Frequency Range . max 300
Setting Frequency Range . min 300
Setting Frequency Range . max 2400
After Setting Getting Frequency Range . min 300
After Setting Getting Frequency Range . max 2400
zhewang1-intc commented 6 months ago

Thanks Intel XPU team!