intel / xpumanager

MIT License
87 stars 18 forks source link

GPU temperature is not reported by Prometheus exporter fox Max 1100 #75

Closed pbchekin closed 6 months ago

pbchekin commented 6 months ago

Steps to reproduce:

Result:

GPU: Max 1100 Driver: Agama 775.20 xpumanager: 1.2.29

fmiao2372 commented 6 months ago

Please use zello_sysman tool to firstly check whether Sysman temperature API can work not. Most missing metric issues are caused by the underlying driver or firmware.

wget https://raw.githubusercontent.com/intel/compute-runtime/releases/23.48/level_zero/tools/test/black_box_tests/zello_sysman.cpp
g++ -O2 -Wall -o zello_sysman zello_sysman.cpp -lze_loader
sudo ./zello_sysman --temperature
pbchekin commented 6 months ago

Please use zello_sysman tool to firstly check whether Sysman temperature API can work not.

$ sudo ./zello_sysman --temperature
ZES_ENABLE_SYSMAN environment variable Not Set
Setting the environment variable ZES_ENABLE_SYSMAN 
ZES_ENABLE_SYSMAN environment variable Set
Device Name = Intel(R) Data Center GPU Max 1100
UUID: 
� � � 
       /    �        
Sysman Initialization done via zeInit

 ----  Temperature tests ---- 
ZE_RESULT_ERROR_UNINITIALIZED returned by zesDeviceEnumTemperatureSensors(device, &count, nullptr): testSysmanTemperature: 448
Could not retrieve Temperature domains
fmiao2372 commented 6 months ago

Try the following steps and re-run zello_sysman

1. add modprobe.blacklist=intel_pmt in the /etc/default/grub file at GRUB_CMDLINE_LINUX_DEFAULT line 
2. run update-grub command  
3. reboot
4. sudo ./zello_sysman --temperature
pbchekin commented 6 months ago

I cannot reboot the server right now, but I did remove intel_pmt with rmmod intel_pmt and re-run zello_sysman. It shows the same result: "Could not retrieve Temperature domains".

eero-t commented 6 months ago

I'm assuming this is with the (public) i915-backport (out-of-tree DKMS) kernel driver.

Which kernel and what GPU related kernel driver (package) versions you're using?

(AFAIK temperature is out-of-band data and needs separate kernel module from the main kernel i915 GPU driver. I think intel_pmt was old name for it, but I do not remember what's its replacement. @fmiao2372?)

pbchekin commented 6 months ago

Which kernel and what GPU related kernel driver (package) versions you're using?

Kernel: 5.15.0-97-generic #107-Ubuntu SMP Driver: intel-i915-dkms 1.23.9.11.231003.15+i19-1

eero-t commented 6 months ago

Tested somewhat older multi-GPU setup, and temperature works with it.

Kernel / HW related components that setup had:

I have no idea which driver version is "Agama 775.20" but my user-space driver stacks are built from public releases in: https://github.com/intel/

Both my latest build from month ago:

And e.g. an older build from last May:

provide temperature data with zello_sysman --temperature.

=> Therefore I would expect any compute-runtime (providing the Sysman backend providing the GPU metrics) driver version between them (or newer) also to work fine, as long as you have the required FW & kernel drivers installed.

As you're using Ubuntu intel-i915-dkms, I assume you installed it from Intel driver repos: https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps

Sysman backend binary package in that repo is named intel-level-zero-gpu, and XPU Manager command line tool package is named as xpu-smi.

pbchekin commented 6 months ago

I have no idea which driver version is "Agama 775.20"

This is public rolling release: https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html

pbchekin commented 6 months ago

Looks like the temperature readings do not work with the driver from public rolling release, but works with public LTS release. Closing this issue since it is more likely a driver issue.

eero-t commented 6 months ago

I have no idea which driver version is "Agama 775.20"

This is public rolling release: https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html

It says "stable release" (not "rolling" one)?

Page does list package versions, but only the intel-opencl-icd OpenCL driver package version 23.35.27191.42 corresponds to compute-runtime project release numbers. L0 / Sysman intel-level-zero-gpu driver package version does use the same 27191 sub-number, so I think it's built from the compute-runtime sources (as expected).

23.35.27191.42 version is somewhere between compute-runtime versions that I had tested, therefore I would assume Sysman part to be OK (unless there was some fixed regression in between that just happens to be in this release). I.e. issue would be somewhere lower in the stack (kernel side).

Looks like the temperature readings do not work with the driver from public rolling release, but works with public LTS release. Closing this issue since it is more likely a driver issue.

I'm sure it's not a XPUM issue, but unless you filed separate driver bug, I think it could be discussed here.

Did you get temperature metrics working with some driver version?

(I'm interested what was not working, and what fixed it.)

pbchekin commented 6 months ago

It says "stable release" (not "rolling" one)?

This page https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps calls it "rolling stable release".

Did you get temperature metrics working with some driver version?

Yes, as I wrote previously the temperature metrics work with LTS driver.

(I'm interested what was not working, and what fixed it.)

Let's see if the issue will be fixed in the next rolling stable driver release.

eero-t commented 6 months ago

Ah, this has links for them: https://dgpu-docs.intel.com/releases/index.html

Comparing the compute-runtime (Sysman) driver versions in them:

And i915 kernel DKMS versions:

=> LTS has the latest kernel & user-space driver versions...

Did you install the whole stack, including kernel DKMS?

pbchekin commented 6 months ago

Did you install the whole stack, including kernel DKMS?

I hope so. I just followed the installation instructions: apt install -y linux-headers-$(uname -r) flex bison intel-fw-gpu intel-i915-dkms xpu-smi plus level-zero, compute, and media runtimes.

eero-t commented 6 months ago

Ok, so it's unclear whether kernel or user-space side update fixed it.

(Especially as I had it working with components that were all older than ones in rolling stable, which were older than ones in LTS...)

pbchekin commented 6 months ago

Ok, so it's unclear whether kernel or user-space side update fixed it.

(Especially as I had it working with components that were all older than ones in rolling stable, which were older than ones in LTS...)

Yes, but since XPUM works in a container with its own user-space environment (i have installed it as a Kubernetes DaemonSet) the only component on the host that matters is the driver, IMHO.

eero-t commented 6 months ago

Ok, if XPUM is in container, than indeed only kernel driver matters. And you updated drivers only to the host, and did not touch XPUM container contents?

(In my tests, user-space drivers are built or installed to a container, not host.)

pbchekin commented 6 months ago

And you updated drivers only to the host, and did not touch XPUM container contents?

Yes

eero-t commented 6 months ago

Thanks, then the issue was indeed the kernel driver.