intel / xpumanager

MIT License
87 stars 18 forks source link

Will ARC be supported? #74

Open nathanodle opened 6 months ago

nathanodle commented 6 months ago

There's currently no way to get most performance statistics on ARC GPUs. intel_gpu_top doesn't have memory usage, and while it appears xpu-smi has some metrics it's missing a lot on ARC.

I'm working on a multi-GPU ARC system and it's hard to troubleshoot certain things without knowing what the GPUs are doing outside of code.

Thanks!

fmiao2372 commented 6 months ago

XPU Manager mainly targets Intel data center GPU. For some missing metrics, please refer to the issue 26. What metrics are supported, depends on the underlying HW, its FW, and kernel + user-space drivers. All metrics supported by XPU Manager, are not provided by all HW, or their driver stacks.

eero-t commented 6 months ago

XPU Manager mainly targets Intel data center GPU.

While XPUM is validated only for those, it uses LevelZero Sysman API to query the metrics: https://spec.oneapi.io/level-zero/latest/sysman/api.html

And Intel GPU L0 backend releases do list ARC (DG2) as having "production" level support: https://github.com/intel/compute-runtime/


PS. Release testing for the Sysman part of L0 seems somewhat spotty still, as during the years I've noticed couple of regressions, with latest one being: https://github.com/intel/compute-runtime/issues/707

There being 3 Intel kernel GPU driver uAPIs that the user-space driver tries to support at the same time, may have something to do with it:

Driver releases are currently built with support for the first uAPIs two, but it's possible that the changes to support last one could regress them => In addition to latest driver, one could also try one or two older ones, especially for HW that's been out for a while, like ARC is.

eero-t commented 6 months ago

There's currently no way to get most performance statistics on ARC GPUs. intel_gpu_top doesn't have memory usage, and while it appears xpu-smi has some metrics it's missing a lot on ARC.

On a quick test with A770 (0x56a0) on TGL-H host, with GuC 70.8.0 FW, using "6.5.0-18-generic" HWE kernel (=upstream with Ubuntu patches) on Ubuntu 22.04.4 LTS distro, with compute-runtime "23.48.27912.11" (own build), I get following GPU metrics from the driver:

(There may be some kernel DKMS drivers + user-space driver combo which would provide also GPU memory BW, temperature and maybe also error counters, but at least one of those will need out-of-band metrics kernel driver instead of GPU one.)

PS. I'm checking these with the tester in the corresponding compute-runtime version (after installing level-zero frontend devel package):

$ DRIVER_TAG=23.48.27912.11
$ wget --no-verbose https://raw.githubusercontent.com/intel/compute-runtime/$DRIVER_TAG/level_zero/tools/test/black_box_tests/zello_sysman.cpp
$ g++ -O2 -Wall -o zello_sysman zello_sysman.cpp -lze_loader
$ zello_sysman --engine --frequency --memory --temperature --ras --power

(--power needs to be last option as it has optional args.)

QiXuanWang commented 4 months ago

This is very much needed feature. the zello_sysman command provided is not that friendly.

eero-t commented 4 months ago

This is very much needed feature. the zello_sysman command provided is not that friendly.

@QiXuanWang Just use XPUM then?

If zello_sysman shows some metric for your HW/FW/KMD/UMD combo, I do not see any reason why XPUM would not show it too with the same HW/SW stack...

qnixsynapse commented 3 months ago

There's currently no way to get most performance statistics on ARC GPUs. intel_gpu_top doesn't have memory usage, and while it appears xpu-smi has some metrics it's missing a lot on ARC.

On a quick test with A770 (0x56a0) on TGL-H host, with GuC 70.8.0 FW, using "6.5.0-18-generic" HWE kernel (=upstream with Ubuntu patches) on Ubuntu 22.04.4 LTS distro, with compute-runtime "23.48.27912.11" (own build), I get following GPU metrics from the driver:

* Engine utilization

* Frequency

* Memory usage

* Power usage

(There may be some kernel DKMS drivers + user-space driver combo which would provide also GPU memory BW, temperature and maybe also error counters, but at least one of those will need out-of-band metrics kernel driver instead of GPU one.)

PS. I'm checking these with the tester in the corresponding compute-runtime version (after installing level-zero frontend devel package):

$ DRIVER_TAG=23.48.27912.11
$ wget --no-verbose https://raw.githubusercontent.com/intel/compute-runtime/$DRIVER_TAG/level_zero/tools/test/black_box_tests/zello_sysman.cpp
$ g++ -O2 -Wall -o zello_sysman zello_sysman.cpp -lze_loader
$ zello_sysman --engine --frequency --memory --temperature --ras --power

(--power needs to be last option as it has optional args.)

I gave it a try... It seems the ras and temperature is currently not supported. Temperature metrics is such a needed feature imo. I opened a report on i915's kernel driver repository.

$ sudo ./zello_sysman --engine --frequency --memory --temperature --ras --power
ZES_ENABLE_SYSMAN environment variable Not Set
Setting the environment variable ZES_ENABLE_SYSMAN 
ZES_ENABLE_SYSMAN environment variable Set
Device Name = Intel(R) Arc(TM) A750 Graphics
UUID: 
134 128 161 86 8 0 0 0 3 0 0 0 0 0 0 0 
Sysman Initialization done via zeInit

 ----  Frequency tests ---- 
freqProperties.type = 0
freqProperties.canControl = 1
freqProperties.isThrottleEventSupported = 0
freqProperties.min = 300
freqProperties.max = 2400
freqState.currentVoltage = -1
freqState.request = 2400
freqState.tdp = -1
freqState.efficient = 600
freqState.actual = 2400
freqState.throttleReasons = 0
freqRange.min = 300
freqRange.max = 2400
 frequency = 300
...
...
 frequency = 2400
Setting Frequency Range . min 300
Setting Frequency Range . max 300
After Setting Getting Frequency Range . min 300
After Setting Getting Frequency Range . max 300
Setting Frequency Range . min 300
Setting Frequency Range . max 2400
After Setting Getting Frequency Range . min 300
After Setting Getting Frequency Range . max 2400

 ----  Engine tests ---- 
Device UUID: 
134 128 161 86 8 0 0 0 3 0 0 0 0 0 0 0 
[0]
Engine Type = ZES_ENGINE_GROUP_RENDER_SINGLE || Active Time = 0 || Timestamp = 427
Engine Type = ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE || Active Time = 0 || Timestamp = 417
Engine Type = ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE || Active Time = 0 || Timestamp = 401
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE || Active Time = 0 || Timestamp = 372
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE || Active Time = 0 || Timestamp = 336
Engine Type = ZES_ENGINE_GROUP_COPY_SINGLE || Active Time = 0 || Timestamp = 293
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE || Active Time = 0 || Timestamp = 241
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE || Active Time = 0 || Timestamp = 178
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[1]
....
....

 ----  Temperature tests ---- 
Could not retrieve Temperature domains

 ----  Power tests ---- 
properties.onSubdevice = 0
properties.subdeviceId = 0
properties.canControl = 1
properties.isEnergyThresholdSupported= 0
properties.defaultLimit= -1
properties.maxLimit =-1
properties.minLimit =-1
CurrentPower = 9.45378 W forrootDevice
CurrentPower = 8.96061 W forrootDevice
CurrentPower = 9.37615 W forrootDevice
CurrentPower = 9.49795 W forrootDevice
CurrentPower = 8.42323 W forrootDevice
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesPowerGetLimits(handle, &sustainedGetDefault, nullptr, &peakGetDefault): testSysmanPower: 336
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesPowerSetLimits(handle, &sustainedGetDefault, nullptr, &peakGetDefault): testSysmanPower: 338
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesPowerGetLimitsExt(handle, &limitCount, nullptr): getPowerLimits: 201
powerLimitDesc.count = 0

 ----  Memory tests ---- 
Memory Type = ZES_MEM_TYPE_DDR
On Subdevice = 0
Subdevice Id = 0
Memory Size = 0
Number of channels = -1
Memory Health = ZES_MEM_HEALTH_OK
The total allocatable memory in bytes = 8522825728
The free memory in bytes = 1785864192
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesMemoryGetBandwidth(handle, &memoryBandwidth): testSysmanMemory: 1061
Memory Read Counter = 0
Memory Write Counter = 0
Memory Maximum Bandwidth = 0
Memory Timestamp = 0

 ----  Ras tests ---- 
Could not retrieve Ras Error Sets

Also, power is a suprise here. Thankfully no more idle 30W power usage.

eero-t commented 2 months ago

I gave it a try... It seems the ras and temperature is currently not supported. Temperature metrics is such a needed feature imo. I opened a report on i915's kernel driver repository.

Those are OoB (Out of Band) metrics, i.e. not provided by i915 kernel (GPU) driver, but by intel_pmt (PMT) driver.

I get temperature metrics for A770 both with drm-tip 6.9 kernel [1] with latest (self-built) compute-runtime release, and when using Ubuntu 6.5 HWE kernel with kernel DKMS package(s) from Intel repo [2].

[1] https://cgit.freedesktop.org/drm-tip/ [2] https://dgpu-docs.intel.com/

qnixsynapse commented 2 months ago

Unfortunately, I still can't get temperature metrics even with the 6.9.5 kernel. I am using Arch Linux and I don't mind compiling a kernel with the patches which enables the metrics.

I get temperature metrics for A770 both with drm-tip 6.9 kernel [1] with latest (self-built) compute-runtime release, and when using Ubuntu 6.5 HWE kernel with kernel DKMS package(s) from Intel repo [2].

This feels like the support is on that (i915) drm kernel driver rather than intel_pmt driver to me(Unless the dkms driver from Intel repo adds new intel_pmt driver). I am trying to find the commit which enables it.

edit. And this is where it should have been but it isn't.

eero-t commented 2 months ago

Unfortunately, I still can't get temperature metrics even with the 6.9.5 kernel. I am using Arch Linux and I don't mind compiling a kernel with the patches which enables the metrics.

Are these enabled in your kernel builds?

# grep PMT /boot/config-<kernelversion>
CONFIG_INTEL_PMT_CLASS=m
CONFIG_INTEL_PMT_TELEMETRY=m
CONFIG_INTEL_PMT_CRASHLOG=m

I get temperature metrics for A770 both with drm-tip 6.9 kernel [1] with latest (self-built) compute-runtime release, and when using Ubuntu 6.5 HWE kernel with kernel DKMS package(s) from Intel repo [2].

...(Unless the dkms driver from Intel repo adds new intel_pmt driver)....

It does:

# dpkg -L intel-i915-dkms | grep pmt
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/mfd/intel_pmt.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/Kconfig
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/Makefile
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/class.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/class.h
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/crashlog.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/intel_pmt_class.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/intel_pmt_class.h
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/intel_pmt_crashlog.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/intel_pmt_telemetry.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/telemetry.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/telemetry.h

I am trying to find the commit which enables it.

Note that drm-tip (which I'm using) is upstream DRM infra integration tree. It gets DRM driver (i915 etc) changes before they go to Linus' upstream tree.

qnixsynapse commented 2 months ago

Are these enabled in your kernel builds?

Yes:

$ grep PMT config
CONFIG_INTEL_PMT_CLASS=m
CONFIG_INTEL_PMT_TELEMETRY=m
CONFIG_INTEL_PMT_CRASHLOG=m

It does:

hmm... I will try to build an arch linux package later. Thank you for your help!

sumseq commented 1 month ago

Just adding here that having ARC support would be greatly appreciated as I use an Arc card to develop on before trying to run on the MAX 1550. Or, maybe at least have plans to support BattleMage GPUs whenever they are released?

eero-t commented 3 weeks ago

Just adding here that having ARC support would be greatly appreciated

As commented above, XPUM should work fine with Arc. What metrics are available depends on what FW / kernel / L0 driver versions are installed.

as I use an Arc card to develop on before trying to run on the MAX 1550.

For Max, you need to use kernel and user-space drivers from Intel's driver repository: https://dgpu-docs.intel.com/driver/installation.html

Or, maybe at least have plans to support BattleMage GPUs whenever they are released?

They should also work with XPUM as long, as you have correct kernel + user-space driver installed.