Closed pbchekin closed 6 months ago
Please use zello_sysman tool to firstly check whether Sysman temperature API can work not. Most missing metric issues are caused by the underlying driver or firmware.
wget https://raw.githubusercontent.com/intel/compute-runtime/releases/23.48/level_zero/tools/test/black_box_tests/zello_sysman.cpp
g++ -O2 -Wall -o zello_sysman zello_sysman.cpp -lze_loader
sudo ./zello_sysman --temperature
Please use zello_sysman tool to firstly check whether Sysman temperature API can work not.
$ sudo ./zello_sysman --temperature
ZES_ENABLE_SYSMAN environment variable Not Set
Setting the environment variable ZES_ENABLE_SYSMAN
ZES_ENABLE_SYSMAN environment variable Set
Device Name = Intel(R) Data Center GPU Max 1100
UUID:
� � �
/ �
Sysman Initialization done via zeInit
---- Temperature tests ----
ZE_RESULT_ERROR_UNINITIALIZED returned by zesDeviceEnumTemperatureSensors(device, &count, nullptr): testSysmanTemperature: 448
Could not retrieve Temperature domains
Try the following steps and re-run zello_sysman
1. add modprobe.blacklist=intel_pmt in the /etc/default/grub file at GRUB_CMDLINE_LINUX_DEFAULT line
2. run update-grub command
3. reboot
4. sudo ./zello_sysman --temperature
I cannot reboot the server right now, but I did remove intel_pmt
with rmmod intel_pmt
and re-run zello_sysman
. It shows the same result: "Could not retrieve Temperature domains".
I'm assuming this is with the (public) i915-backport (out-of-tree DKMS) kernel driver.
Which kernel and what GPU related kernel driver (package) versions you're using?
(AFAIK temperature is out-of-band data and needs separate kernel module from the main kernel i915
GPU driver. I think intel_pmt
was old name for it, but I do not remember what's its replacement. @fmiao2372?)
Which kernel and what GPU related kernel driver (package) versions you're using?
Kernel: 5.15.0-97-generic #107-Ubuntu SMP Driver: intel-i915-dkms 1.23.9.11.231003.15+i19-1
Tested somewhat older multi-GPU setup, and temperature works with it.
Kernel / HW related components that setup had:
/proc/modules
"compat" line lists: mei_iaf,mei_gsc,pmt_telemetry,pmt_crashlog,pmt_class,i915,mei_me,intel_vsec,mei
I have no idea which driver version is "Agama 775.20" but my user-space driver stacks are built from public releases in: https://github.com/intel/
Both my latest build from month ago:
And e.g. an older build from last May:
provide temperature data with zello_sysman --temperature
.
=> Therefore I would expect any compute-runtime
(providing the Sysman backend providing the GPU metrics) driver version between them (or newer) also to work fine, as long as you have the required FW & kernel drivers installed.
As you're using Ubuntu intel-i915-dkms
, I assume you installed it from Intel driver repos: https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps
Sysman backend binary package in that repo is named intel-level-zero-gpu
, and XPU Manager command line tool package is named as xpu-smi
.
I have no idea which driver version is "Agama 775.20"
This is public rolling release: https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html
Looks like the temperature readings do not work with the driver from public rolling release, but works with public LTS release. Closing this issue since it is more likely a driver issue.
I have no idea which driver version is "Agama 775.20"
This is public rolling release: https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html
It says "stable release" (not "rolling" one)?
Page does list package versions, but only the intel-opencl-icd
OpenCL driver package version 23.35.27191.42
corresponds to compute-runtime
project release numbers. L0 / Sysman intel-level-zero-gpu
driver package version does use the same 27191
sub-number, so I think it's built from the compute-runtime sources (as expected).
23.35.27191.42
version is somewhere between compute-runtime
versions that I had tested, therefore I would assume Sysman part to be OK (unless there was some fixed regression in between that just happens to be in this release). I.e. issue would be somewhere lower in the stack (kernel side).
Looks like the temperature readings do not work with the driver from public rolling release, but works with public LTS release. Closing this issue since it is more likely a driver issue.
I'm sure it's not a XPUM issue, but unless you filed separate driver bug, I think it could be discussed here.
Did you get temperature metrics working with some driver version?
(I'm interested what was not working, and what fixed it.)
It says "stable release" (not "rolling" one)?
This page https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps calls it "rolling stable release".
Did you get temperature metrics working with some driver version?
Yes, as I wrote previously the temperature metrics work with LTS driver.
(I'm interested what was not working, and what fixed it.)
Let's see if the issue will be fixed in the next rolling stable driver release.
Ah, this has links for them: https://dgpu-docs.intel.com/releases/index.html
Comparing the compute-runtime (Sysman) driver versions in them:
And i915 kernel DKMS versions:
=> LTS has the latest kernel & user-space driver versions...
Did you install the whole stack, including kernel DKMS?
Did you install the whole stack, including kernel DKMS?
I hope so. I just followed the installation instructions: apt install -y linux-headers-$(uname -r) flex bison intel-fw-gpu intel-i915-dkms xpu-smi
plus level-zero, compute, and media runtimes.
Ok, so it's unclear whether kernel or user-space side update fixed it.
(Especially as I had it working with components that were all older than ones in rolling stable, which were older than ones in LTS...)
Ok, so it's unclear whether kernel or user-space side update fixed it.
(Especially as I had it working with components that were all older than ones in rolling stable, which were older than ones in LTS...)
Yes, but since XPUM works in a container with its own user-space environment (i have installed it as a Kubernetes DaemonSet) the only component on the host that matters is the driver, IMHO.
Ok, if XPUM is in container, than indeed only kernel driver matters. And you updated drivers only to the host, and did not touch XPUM container contents?
(In my tests, user-space drivers are built or installed to a container, not host.)
And you updated drivers only to the host, and did not touch XPUM container contents?
Yes
Thanks, then the issue was indeed the kernel driver.
Steps to reproduce:
Result:
Grafana shows metrics such as "GPU Utilization"
Grafana shows "No data" for "GPU Temperature"
GPU: Max 1100 Driver: Agama 775.20 xpumanager: 1.2.29