Closed pavelis closed 3 years ago
Anything in the dmesg when this "freeze" happens?
What happens if you try writing any custom value in the /sys/class/drm/card1/device/pp_od_clk_voltage
and than committing the changes manually with c
?
In general, the amdgpu-pro driver causes much more issues than in-kernel driver, most of the time it has nothing to do with amdgpu-clocks, so if you really want to understand what is happening get ready to dive deeper into the amdgpu-pro DKMS sysfs shenanigans.
Anything in the dmesg when this "freeze" happens?
Nothing
What happens if you try writing any custom value in the
/sys/class/drm/card1/device/pp_od_clk_voltage
and than committing the changes manually withc
?
If I try change anything there I get: Error writing /sys/class/drm/card1/device/pp_od_clk_voltage: Invalid argument
In general, the amdgpu-pro driver causes much more issues than in-kernel driver, most of the time it has nothing to do with amdgpu-clocks, so if you really want to understand what is happening get ready to dive deeper into the amdgpu-pro DKMS sysfs shenanigans.
But in-kernel driver doesn't support compute mode, AFAIK.
If I try change anything there I get: Error writing /sys/class/drm/card1/device/pp_od_clk_voltage: Invalid argument
Care to elaborate? Which particular command(s) do you submit to the /sys/class/drm/card1/device/pp_od_clk_voltage
when you get such error message? Are you running multiple cards? Are you 100% sure your RX5700XT is card1
and not card0
?
If I try change anything there I get: Error writing /sys/class/drm/card1/device/pp_od_clk_voltage: Invalid argument
Care to elaborate? Which particular command(s) do you submit to the
/sys/class/drm/card1/device/pp_od_clk_voltage
when you get such error message? Are you running multiple cards? Are you 100% sure your RX5700XT iscard1
and notcard0
?
100% sudo nano /sys/class/drm/card1/device/pp_od_clk_voltage
The same result I got on another unit with Ubuntu Server 20.04.2, amdgpu 20.40 (not pro) and RX580 (settings for each card are different). But when I open pp_od_clk_voltage for it (sudo nano /sys/class/drm/card0/device/pp_od_clk_voltage), I immediately get a message: Error writing lock file /sys/class/drm/card0/device/.pp_od_clk_voltage.swp: Permission denied
I had the exact same problem on 20.40, so I installed 20.30 instead. Works just fine on that version.
But when I open pp_od_clk_voltage for it (sudo nano /sys/class/drm/card0/device/pp_od_clk_voltage), I immediately get a message: Error writing lock file /sys/class/drm/card0/device/.pp_od_clk_voltage.swp: Permission denied
stat /sys/class/drm/card0/device/pp_od_clk_voltage
on that system show?This issue is about out-of-the-tree amdgpu kernel driver, it has nothing to do with amdgpu-clocks itself. I am willing to check the kernel sources that you point me to to try to understand why is this happening, the only way to know what is this is to dig deeper in the driver code. I can't promise anything...
I have the same issue of "Invalid argument", but pp_od_clk_voltage
is actually modified successfully and the GPU (Radeon VII) runs at the modified frequency and voltage. Running by service works as well.
My system is 20.04, kernel 5.4.0-58-generic, AMDGPU version 5.6.19.
I changed two parameters only, /etc/default/amdgpu-custom-states.card0
written as
OD_SCLK:
1: 1600MHz
OD_VDDC_CURVE:
2: 1600MHz 900mV
@YWangScience please share the full console output when invoking the script.
@pavelis you can not edit /sys/class/drm/cardX/device/pp_od_clk_voltage
with nano or any other editor, that does not make much sense. This file is a part of kernel sysfs interface, you can only use it to read current driver settings from it or to send some parameters to the driver, for example:
echo "vc 0 800MHz 730mV" > /sys/class/drm/card1/device/pp_od_clk_voltage
echo c > /sys/class/drm/card1/device/pp_od_clk_voltage
You need to determine which particular parameter that you are sending to your card's sysfs file is making the driver stuck.
I have two test systems:
First: Ubuntu Desktop 20.04.2, kernel 5.4.0-54-generic
$ dpkg --list | grep amdgpu
ii amdgpu-core 20.40-1147286 all Core meta package for unified amdgpu driver.
ii amdgpu-dkms 1:5.6.14.224-1147286 all amdgpu driver in DKMS format.
ii amdgpu-dkms-firmware 1:5.6.14.224-1147286 all firmware blobs used by amdgpu driver in DKMS format
ii amdgpu-pin 20.40-1147286 all Meta package to pin a specific amdgpu driver version.
ii amdgpu-pro-core 20.40-1147286 all Core meta package for Pro components of the unified amdgpu driver.
ii amdgpu-pro-pin 20.40-1147286 all Meta package to pin a specific amdgpu pro driver version.
ii clinfo-amdgpu-pro 20.40-1147286 amd64 AMD OpenCL info utility
ii libdrm-amdgpu-amdgpu1:amd64 1:2.4.100-1147286 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii libdrm-amdgpu-common 1.0.0-1147286 all List of AMD/ATI cards' device IDs, revision IDs and marketing names
ii libdrm-amdgpu1:amd64 2.4.104+git2101120630.10dd3e~oibaf~f amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii libdrm2-amdgpu:amd64 1:2.4.100-1147286 amd64 Userspace interface to kernel DRM services -- runtime
ii ocl-icd-libopencl1-amdgpu-pro:amd64 20.40-1147286 amd64 AMD OpenCL ICD Loader library
ii opencl-amdgpu-pro-comgr 20.40-1147286 amd64 non-free AMD OpenCL ICD Loaders
ii opencl-amdgpu-pro-icd 20.40-1147286 amd64 non-free AMD OpenCL ICD Loaders
ii opencl-orca-amdgpu-pro-icd:amd64 20.40-1147286 amd64 non-free AMD OpenCL ICD Loaders
ii xserver-xorg-video-amdgpu 19.1.0-1 amd64 X.Org X server -- AMDGPU display driver
Where was an RX 5700 XT installed on it.
Second: Ubuntu Server 20.04.2, kernel 5.4.0-54-generic
$ dpkg --list | grep amdgpu
ii amdgpu-core 20.40-1147286 all Core meta package for unified amdgpu driver.
iF amdgpu-dkms 1:5.6.14.224-1147286 all amdgpu driver in DKMS format.
ii amdgpu-dkms-firmware 1:5.6.14.224-1147286 all firmware blobs used by amdgpu driver in DKMS format
ii amdgpu-pin 20.40-1147286 all Meta package to pin a specific amdgpu driver version.
ii amdgpu-pro-core 20.40-1147286 all Core meta package for Pro components of the unified amdgpu driver.
ii amdgpu-pro-pin 20.40-1147286 all Meta package to pin a specific amdgpu pro driver version.
ii clinfo-amdgpu-pro 20.40-1147286 amd64 AMD OpenCL info utility
ii libdrm-amdgpu-amdgpu1:amd64 1:2.4.100-1147286 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii libdrm-amdgpu-common 1.0.0-1147286 all List of AMD/ATI cards' device IDs, revision IDs and marketing names
ii libdrm2-amdgpu:amd64 1:2.4.100-1147286 amd64 Userspace interface to kernel DRM services -- runtime
ii ocl-icd-libopencl1-amdgpu-pro:amd64 20.40-1147286 amd64 AMD OpenCL ICD Loader library
ii opencl-amdgpu-pro-comgr 20.40-1147286 amd64 non-free AMD OpenCL ICD Loaders
ii opencl-amdgpu-pro-icd 20.40-1147286 amd64 non-free AMD OpenCL ICD Loaders
ii opencl-orca-amdgpu-pro-icd:amd64 20.40-1147286 amd64 non-free AMD OpenCL ICD Loaders
An RX 580 is installed there. After a series of experiments with installation of different versions, I've installed AMDGPU-Pro 20.40 again, and, voila, settings have been applied!
The difference between these systems is iF in amdgpu-dkms (not configured). Maybe the problem is here?
@YWangScience please share the full console output when invoking the script.
ok, here is the output
sudo amdgpu-clocks
Won't write initial state to /tmp/amdgpu-custom-states.card0.initial, it already exists.
Detecting the state values at /sys/class/drm/card0/device/pp_od_clk_voltage:
SCLK state 0: 808Mhz
SCLK state 1: 1600Mhz
MCLK state 1: 1000Mhz
VDDC Curve state 0: 808Mhz 714mV
VDDC Curve state 1: 1304Mhz 811mV
VDDC Curve state 2: 1600Mhz 900mV
Maximum clocks & voltages:
SCLK clock 2200Mhz
MCLK clock 1200Mhz
Curent power cap: 250W
Verifying user state values at /etc/default/amdgpu-custom-states.card0:
SCLK state 1: 1600MHz
VDDC Curve state 2: 1600MHz 900mV
Committing custom states to /sys/class/drm/card0/device/pp_od_clk_voltage:
/usr/local/bin/amdgpu-clocks: line 147: echo: write error: Invalid argument
ERROR: echo vc 0 808 714 > /sys/class/drm/card0/device/pp_od_clk_voltage
Done
The difference between these systems is iF in amdgpu-dkms (not configured). Maybe the problem is here?
Likely. amdgpu-dkms
left unconfigured probably means that the "pro" driver module was not compiled and loaded on boot.
You can check it out with modinfo amdgpu | grep "file\|vers\|magic\|desc"
. I guess this solves your issue then, I will be closing it...
@YWangScience so, this is your issue:
echo vc 0 808 714 > /sys/class/drm/card0/device/pp_od_clk_voltage
Very likely that you'd get the same message if you run that command manually, as root. The error means that the driver/SMU does not accept the values specified, which is weird because obviously these same values are the card's HW defaults. I've seen this before, as some cards have inconsistent power, clock & voltage limits set in their PowerPlay table, meaning that the defaults would work fine, but if you try to manually set to the same voltage as default, the SMU will reject it, because the default value is ether under min or over max limit!
If you really care about getting to the bottom of this you can try experimenting with different voltages in the command above, starting with 750 or even 800. Or you could try checking and setting those voltage limits with https://github.com/sibradzic/upp.
But regardless, I wouldn't worry too much, amdgpu-clocks actually works for you pretty well...
The difference between these systems is iF in amdgpu-dkms (not configured). Maybe the problem is here?
Likely.
amdgpu-dkms
left unconfigured probably means that the "pro" driver module was not compiled and loaded on boot. You can check it out withmodinfo amdgpu | grep "file\|vers\|magic\|desc"
. I guess this solves your issue then, I will be closing it...
System with ii amdgpu-dkms
:
filename: /lib/modules/5.4.0-54-generic/updates/dkms/amdgpu.ko
version: 5.6.16.20.40
description: AMD GPU
srcversion: 533BB7E5866E52F63B9ACCB
vermagic: 5.4.0-54-generic SMP mod_unload
parm: hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool)
System with iF amdgpu-dkms
:
filename: /lib/modules/5.4.0-54-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko
description: AMD GPU
srcversion: 533BB7E5866E52F63B9ACCB
vermagic: 5.4.0-54-generic SMP mod_unload
I suppose, the solution would be to use amdgpu-pro, but with in-kernel amdgpu driver.
@YWangScience so, this is your issue:
echo vc 0 808 714 > /sys/class/drm/card0/device/pp_od_clk_voltage
Very likely that you'd get the same message if you run that command manually, as root. The error means that the driver/SMU does not accept the values specified, which is weird because obviously these same values are the card's HW defaults. I've seen this before, as some cards have inconsistent power, clock & voltage limits set in their PowerPlay table, meaning that the defaults would work fine, but if you try to manually set to the same voltage as default, the SMU will reject it, because the default value is ether under min or over max limit!
If you really care about getting to the bottom of this you can try experimenting with different voltages in the command above, starting with 750 or even 800. Or you could try checking and setting those voltage limits with https://github.com/sibradzic/upp.
But regardless, I wouldn't worry too much, amdgpu-clocks actually works for you pretty well...
@sibradzic Thank you for your prompt and detailed explanation, as you said your program works smoothly on my device, I am not going to tackle with the parameters anymore, thumb up for amdgpu-clocks.
I suppose, the solution would be to use amdgpu-pro, but with in-kernel amdgpu driver.
Yes, upstream amdgpu driver seems to have much more stable pp_od_clk_voltage
(aka OverDrive) interface.
The amdgpu-pro amdgpu driver seems to break it from time to time, but you may get different result with it depending on which underlying kernel version you are compiling it against. The kernel 5.4 + amdgpu-pro 20.40 combination seems to end up with broken OverDrive interface, but you could try compiling it against 5.8 or newer if you feel adventurous...
I found the solution that works great for me: 20.04's latest kernel 5.4.0-66 + AMDGPU-Pro 20.45. I would also try 5.10 kernel with upstream driver in a future. My current modules:
$ modinfo amdgpu | grep "file\|vers\|magic\|desc"
filename: /lib/modules/5.4.0-66-generic/updates/dkms/amdgpu.ko
version: 5.6.20.20.45
description: AMD GPU
srcversion: 533BB7E5866E52F63B9ACCB
vermagic: 5.4.0-66-generic SMP mod_unload
$ dpkg --list | grep amdgpu-pro
ii amdgpu-pro-core 20.45-1188099 all Core meta package for Pro components of the unified amdgpu driver.
ii amdgpu-pro-pin 20.45-1188099 all Meta package to pin a specific amdgpu pro driver version.
ii amdgpu-pro-rocr-opencl 20.45-1188099 amd64 Meta package to install ROCm OpenCL Pro components.
ii clinfo-amdgpu-pro 20.45-1188099 amd64 AMD OpenCL info utility
ii comgr-amdgpu-pro:amd64 1.7.0-1188099 amd64 Development files for ROCm ROCm Code Object Manager
ii hip-rocr-amdgpu-pro 20.45-1188099 amd64 ROCr HIP Clang Runtime
ii ocl-icd-libopencl1-amdgpu-pro:amd64 20.45-1188099 amd64 AMD OpenCL ICD Loader library
ii opencl-rocr-amdgpu-pro:amd64 20.45-1188099 amd64 ROCr OpenCL Runtime
@sibradzic Thank you for your help and amdgpu-clocks that I continue to use!
I've upgraded amdgpu-pro driver from 20.20 to 20.40. After that amdgpu-clock freezes while committing custom states:
GPU: Gigabyte RX 5700 XT OS: Ubuntu 20.04.2 Kernel: 5.4.0-54-generic amdgpu-pro: 20.40-1147286 amdgpu.ppfeaturemask=0xffffffff is set in GRUB