intel / xpumanager

MIT License
87 stars 18 forks source link

XPU-SMI not working with A770 #50

Closed fredlarochelle closed 1 year ago

fredlarochelle commented 1 year ago

Running Ubuntu 22.04 with kernel 5.19.0-41-generic with an Intel Arc A770, XPU-SMI is not working. It mostly reports empty fields when running xpu-smi stats -d 0 and when it does report something, the values don't make sense. For example, GPU Memory Used doesn't concord with the values I am getting from IPEX (more than an order of magnitude of difference...).

It's probably not a driver issue on my system, XPU Manager is somewhat working and I have no trouble with IPEX.

If XPU Manager/XPU-SMI is not planning on any more comprehensive support for Arc cards, is there any other tools from Intel that would offer basic support for things like checking temperatures, memory usage, ... Also, not necessarily something concerning XPU Manager, but in general more documentation would be useful. For example, the documentation for XPU Manager is the only place I can find refering to updating the device firmware, is it something that needs to be done on Arc card? Or only on data center gpus?

eero-t commented 1 year ago

XPUM reports metrics data provided by the HW -> FW -> KMD -> UMD GPU driver stack.

Is your KMD (i915 kernel driver) from your distribution kernel [1], or DKMS from Intel drivers repository: https://dgpu-docs.intel.com/installation-guides/index.html

What about the user-space driver stack; does it come from distribution repository, Intel repository, or are you using XPUM container version with its own user-space drivers?

[1] Upstream kernel is still missing some features that are in Intel DKMS even in v6.3. And if you use 5.19 upstream version with force-probing, please don't.

taotod commented 1 year ago

@fredlarochelle With the public repository that Eero mentioned, the GPU power, temperature, frequency, GPU/GPU engine utilization, GPU memory used look good. I think that they are helpful to track your GPU and GPU workload status.