Closed fredlarochelle closed 1 year ago
XPUM reports metrics data provided by the HW -> FW -> KMD -> UMD GPU driver stack.
Is your KMD (i915
kernel driver) from your distribution kernel [1], or DKMS from Intel drivers repository: https://dgpu-docs.intel.com/installation-guides/index.html
What about the user-space driver stack; does it come from distribution repository, Intel repository, or are you using XPUM container version with its own user-space drivers?
[1] Upstream kernel is still missing some features that are in Intel DKMS even in v6.3. And if you use 5.19 upstream version with force-probing, please don't.
@fredlarochelle With the public repository that Eero mentioned, the GPU power, temperature, frequency, GPU/GPU engine utilization, GPU memory used look good. I think that they are helpful to track your GPU and GPU workload status.
Running Ubuntu 22.04 with kernel 5.19.0-41-generic with an Intel Arc A770, XPU-SMI is not working. It mostly reports empty fields when running
xpu-smi stats -d 0
and when it does report something, the values don't make sense. For example,GPU Memory Used
doesn't concord with the values I am getting from IPEX (more than an order of magnitude of difference...).It's probably not a driver issue on my system, XPU Manager is somewhat working and I have no trouble with IPEX.
If XPU Manager/XPU-SMI is not planning on any more comprehensive support for Arc cards, is there any other tools from Intel that would offer basic support for things like checking temperatures, memory usage, ... Also, not necessarily something concerning XPU Manager, but in general more documentation would be useful. For example, the documentation for XPU Manager is the only place I can find refering to updating the device firmware, is it something that needs to be done on Arc card? Or only on data center gpus?