intel / xpumanager

MIT License
87 stars 18 forks source link

xpu-smi dump -m 1,2,3,4,5 not reporting temperature. #37

Closed azunixguy closed 1 year ago

azunixguy commented 1 year ago

As a non-root user on a Ubuntu 22.04.2 LT system, xpu-smi dump -m 1,2,3,4,5 does not report temperatures for GPU or GPU Memory. Refer to output below. xpu-smi -v CLI: Version: 1.2.5.20230313 Build ID: f458af77

Service: Version: 1.2.5.20230313 Build ID: f458af77 Level Zero Version: 1.8.8

randalls@DUT5169PVC: xpu-smi dump -m 1,2,3,4,5 Timestamp, DeviceId, GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%) 18:10:25.000, 0, 265.02, 1600.00, , , 0.06 18:10:25.000, 1, 270.21, 1600.00, , , 0.06 18:10:26.000, 0, 265.16, 1600.00, , , 0.06 18:10:26.000, 1, 269.84, 1600.00, , , 0.06

fmiao2372 commented 1 year ago

Can you provide the results of dump and stats with the root privilege ?

azunixguy commented 1 year ago

Output with root privilege: Last login: Mon Apr 10 17:16:45 2023 from 10.23.233.93 root@DUT5169PVC:~# xpu-smi dump -m 1,2,3,4,5 Timestamp, DeviceId, GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%) 01:20:23.000, 0, 264.93, 1600.00, 43.50, 36.00, 0.06 01:20:23.000, 1, 269.64, 1600.00, 38.00, 31.00, 0.06 01:20:24.000, 0, 264.63, 1600.00, 43.50, 36.00, 0.06 01:20:24.000, 1, 269.98, 1600.00, 37.50, 31.00, 0.06 01:20:25.000, 0, 264.86, 1600.00, 43.50, 36.00, 0.06 01:20:25.000, 1, 269.96, 1600.00, 38.00, 31.50, 0.06 01:20:26.000, 0, 264.83, 1600.00, 43.50, 36.00, 0.06 01:20:26.000, 1, 269.69, 1600.00, 38.00, 31.50, 0.06

fmiao2372 commented 1 year ago

The permission for temperature is determined by Sysman API. So please make sure you have root permission when dumping GPU temperature. Thanks.

eero-t commented 1 year ago

The permission for temperature is determined by Sysman API. So please make sure you have root permission when dumping GPU temperature. Thanks.

To be exact, Sysman just queries the data from kernel and provides it to XPUM. Linux kernel requires root user for some of the metrics, PERFMON (or SYS_ADMIN on old kernels) capability for some other metrics, and some metrics can be read as normal user without extra capabilities (as long one has write access to GPU device).

I think other capabilities than PERFMON (or SYS_ADMIN) can be dropped before running XPUM, so that its process does not have all the power of root user.

(Reason for the additional access rights being required for them by kernel is security.)