sibradzic / amdgpu-clocks

Simple script to control power states of amdgpu driven GPUs
GNU General Public License v2.0
394 stars 44 forks source link

Can't commit states to /sys/class/drm/card1/device/pp_od_clk_voltage after amdgpu-pro upgrade to 20.40 #28

Closed pavelis closed 3 years ago

pavelis commented 3 years ago

I've upgraded amdgpu-pro driver from 20.20 to 20.40. After that amdgpu-clock freezes while committing custom states:

Detecting the state values at /sys/class/drm/card1/device/pp_od_clk_voltage:
  SCLK state 0: 800Mhz
  SCLK state 1: 2039Mhz
  MCLK state 1: 875MHz
  VDDC Curve state 0: 800MHz 706mV
  VDDC Curve state 1: 1419MHz 820mV
  VDDC Curve state 2: 2039MHz 1198mV
  Maximum clocks & voltages:
    SCLK clock 2150Mhz
    MCLK clock 950Mhz
  Curent power cap: 200W
Verifying user state values at /etc/default/amdgpu-custom-states.card1:
  SCLK state 1: 1300MHz
  MCLK state 1: 890MHz
  VDDC Curve state 0: 800MHz @ 730mV
  VDDC Curve state 1: 1300MHz @ 780mV
  VDDC Curve state 2: 2079MHz @ 1188mV
  Force power cap to 140W
  Force performance level to manual
Committing custom states to /sys/class/drm/card1/device/pp_od_clk_voltage:
^C

GPU: Gigabyte RX 5700 XT OS: Ubuntu 20.04.2 Kernel: 5.4.0-54-generic amdgpu-pro: 20.40-1147286 amdgpu.ppfeaturemask=0xffffffff is set in GRUB

sibradzic commented 3 years ago

Anything in the dmesg when this "freeze" happens? What happens if you try writing any custom value in the /sys/class/drm/card1/device/pp_od_clk_voltage and than committing the changes manually with c?

In general, the amdgpu-pro driver causes much more issues than in-kernel driver, most of the time it has nothing to do with amdgpu-clocks, so if you really want to understand what is happening get ready to dive deeper into the amdgpu-pro DKMS sysfs shenanigans.

pavelis commented 3 years ago

Anything in the dmesg when this "freeze" happens?

Nothing

What happens if you try writing any custom value in the /sys/class/drm/card1/device/pp_od_clk_voltage and than committing the changes manually with c?

If I try change anything there I get: Error writing /sys/class/drm/card1/device/pp_od_clk_voltage: Invalid argument

In general, the amdgpu-pro driver causes much more issues than in-kernel driver, most of the time it has nothing to do with amdgpu-clocks, so if you really want to understand what is happening get ready to dive deeper into the amdgpu-pro DKMS sysfs shenanigans.

But in-kernel driver doesn't support compute mode, AFAIK.

sibradzic commented 3 years ago

If I try change anything there I get: Error writing /sys/class/drm/card1/device/pp_od_clk_voltage: Invalid argument

Care to elaborate? Which particular command(s) do you submit to the /sys/class/drm/card1/device/pp_od_clk_voltage when you get such error message? Are you running multiple cards? Are you 100% sure your RX5700XT is card1 and not card0?

pavelis commented 3 years ago

If I try change anything there I get: Error writing /sys/class/drm/card1/device/pp_od_clk_voltage: Invalid argument

Care to elaborate? Which particular command(s) do you submit to the /sys/class/drm/card1/device/pp_od_clk_voltage when you get such error message? Are you running multiple cards? Are you 100% sure your RX5700XT is card1 and not card0?

100% sudo nano /sys/class/drm/card1/device/pp_od_clk_voltage

The same result I got on another unit with Ubuntu Server 20.04.2, amdgpu 20.40 (not pro) and RX580 (settings for each card are different). But when I open pp_od_clk_voltage for it (sudo nano /sys/class/drm/card0/device/pp_od_clk_voltage), I immediately get a message: Error writing lock file /sys/class/drm/card0/device/.pp_od_clk_voltage.swp: Permission denied

BitwiseMaster commented 3 years ago

I had the exact same problem on 20.40, so I installed 20.30 instead. Works just fine on that version.

sibradzic commented 3 years ago

But when I open pp_od_clk_voltage for it (sudo nano /sys/class/drm/card0/device/pp_od_clk_voltage), I immediately get a message: Error writing lock file /sys/class/drm/card0/device/.pp_od_clk_voltage.swp: Permission denied

  1. What does stat /sys/class/drm/card0/device/pp_od_clk_voltage on that system show?
  2. What does it show for card1 RX5700XT on your other system?
  3. Can you point me to a particular amdgpu-pro DKMS deb package that installed 20.40 driver for you?

This issue is about out-of-the-tree amdgpu kernel driver, it has nothing to do with amdgpu-clocks itself. I am willing to check the kernel sources that you point me to to try to understand why is this happening, the only way to know what is this is to dig deeper in the driver code. I can't promise anything...

YWangScience commented 3 years ago

I have the same issue of "Invalid argument", but pp_od_clk_voltage is actually modified successfully and the GPU (Radeon VII) runs at the modified frequency and voltage. Running by service works as well. My system is 20.04, kernel 5.4.0-58-generic, AMDGPU version 5.6.19.

I changed two parameters only, /etc/default/amdgpu-custom-states.card0 written as

OD_SCLK:
1: 1600MHz
OD_VDDC_CURVE:
2: 1600MHz   900mV
sibradzic commented 3 years ago

@YWangScience please share the full console output when invoking the script.

sibradzic commented 3 years ago

@pavelis you can not edit /sys/class/drm/cardX/device/pp_od_clk_voltage with nano or any other editor, that does not make much sense. This file is a part of kernel sysfs interface, you can only use it to read current driver settings from it or to send some parameters to the driver, for example:

echo "vc 0 800MHz 730mV" > /sys/class/drm/card1/device/pp_od_clk_voltage
echo c > /sys/class/drm/card1/device/pp_od_clk_voltage

You need to determine which particular parameter that you are sending to your card's sysfs file is making the driver stuck.

pavelis commented 3 years ago

I have two test systems:

First: Ubuntu Desktop 20.04.2, kernel 5.4.0-54-generic

$ dpkg --list | grep amdgpu
ii  amdgpu-core                                20.40-1147286                         all          Core meta package for unified amdgpu driver.
ii  amdgpu-dkms                                1:5.6.14.224-1147286                  all          amdgpu driver in DKMS format.
ii  amdgpu-dkms-firmware                       1:5.6.14.224-1147286                  all          firmware blobs used by amdgpu driver in DKMS format
ii  amdgpu-pin                                 20.40-1147286                         all          Meta package to pin a specific amdgpu driver version.
ii  amdgpu-pro-core                            20.40-1147286                         all          Core meta package for Pro components of the unified amdgpu driver.
ii  amdgpu-pro-pin                             20.40-1147286                         all          Meta package to pin a specific amdgpu pro driver version.
ii  clinfo-amdgpu-pro                          20.40-1147286                         amd64        AMD OpenCL info utility
ii  libdrm-amdgpu-amdgpu1:amd64                1:2.4.100-1147286                     amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii  libdrm-amdgpu-common                       1.0.0-1147286                         all          List of AMD/ATI cards' device IDs, revision IDs and marketing names
ii  libdrm-amdgpu1:amd64                       2.4.104+git2101120630.10dd3e~oibaf~f  amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii  libdrm2-amdgpu:amd64                       1:2.4.100-1147286                     amd64        Userspace interface to kernel DRM services -- runtime
ii  ocl-icd-libopencl1-amdgpu-pro:amd64        20.40-1147286                         amd64        AMD OpenCL ICD Loader library
ii  opencl-amdgpu-pro-comgr                    20.40-1147286                         amd64        non-free AMD OpenCL ICD Loaders
ii  opencl-amdgpu-pro-icd                      20.40-1147286                         amd64        non-free AMD OpenCL ICD Loaders
ii  opencl-orca-amdgpu-pro-icd:amd64           20.40-1147286                         amd64        non-free AMD OpenCL ICD Loaders
ii  xserver-xorg-video-amdgpu                  19.1.0-1                              amd64        X.Org X server -- AMDGPU display driver

Where was an RX 5700 XT installed on it.

Second: Ubuntu Server 20.04.2, kernel 5.4.0-54-generic

$ dpkg --list | grep amdgpu
ii  amdgpu-core                          20.40-1147286                     all          Core meta package for unified amdgpu driver.
iF  amdgpu-dkms                          1:5.6.14.224-1147286              all          amdgpu driver in DKMS format.
ii  amdgpu-dkms-firmware                 1:5.6.14.224-1147286              all          firmware blobs used by amdgpu driver in DKMS format
ii  amdgpu-pin                           20.40-1147286                     all          Meta package to pin a specific amdgpu driver version.
ii  amdgpu-pro-core                      20.40-1147286                     all          Core meta package for Pro components of the unified amdgpu driver.
ii  amdgpu-pro-pin                       20.40-1147286                     all          Meta package to pin a specific amdgpu pro driver version.
ii  clinfo-amdgpu-pro                    20.40-1147286                     amd64        AMD OpenCL info utility
ii  libdrm-amdgpu-amdgpu1:amd64          1:2.4.100-1147286                 amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii  libdrm-amdgpu-common                 1.0.0-1147286                     all          List of AMD/ATI cards' device IDs, revision IDs and marketing names
ii  libdrm2-amdgpu:amd64                 1:2.4.100-1147286                 amd64        Userspace interface to kernel DRM services -- runtime
ii  ocl-icd-libopencl1-amdgpu-pro:amd64  20.40-1147286                     amd64        AMD OpenCL ICD Loader library
ii  opencl-amdgpu-pro-comgr              20.40-1147286                     amd64        non-free AMD OpenCL ICD Loaders
ii  opencl-amdgpu-pro-icd                20.40-1147286                     amd64        non-free AMD OpenCL ICD Loaders
ii  opencl-orca-amdgpu-pro-icd:amd64     20.40-1147286                     amd64        non-free AMD OpenCL ICD Loaders

An RX 580 is installed there. After a series of experiments with installation of different versions, I've installed AMDGPU-Pro 20.40 again, and, voila, settings have been applied!

The difference between these systems is iF in amdgpu-dkms (not configured). Maybe the problem is here?

YWangScience commented 3 years ago

@YWangScience please share the full console output when invoking the script.

ok, here is the output

sudo amdgpu-clocks

Won't write initial state to /tmp/amdgpu-custom-states.card0.initial, it already exists.
Detecting the state values at /sys/class/drm/card0/device/pp_od_clk_voltage:
  SCLK state 0: 808Mhz
  SCLK state 1: 1600Mhz
  MCLK state 1: 1000Mhz
  VDDC Curve state 0: 808Mhz 714mV
  VDDC Curve state 1: 1304Mhz 811mV
  VDDC Curve state 2: 1600Mhz 900mV
  Maximum clocks & voltages:
    SCLK clock 2200Mhz
    MCLK clock 1200Mhz
  Curent power cap: 250W
Verifying user state values at /etc/default/amdgpu-custom-states.card0:
  SCLK state 1: 1600MHz
  VDDC Curve state 2: 1600MHz 900mV
Committing custom states to /sys/class/drm/card0/device/pp_od_clk_voltage:
/usr/local/bin/amdgpu-clocks: line 147: echo: write error: Invalid argument
ERROR: echo vc 0 808 714 > /sys/class/drm/card0/device/pp_od_clk_voltage
  Done
sibradzic commented 3 years ago

The difference between these systems is iF in amdgpu-dkms (not configured). Maybe the problem is here?

Likely. amdgpu-dkms left unconfigured probably means that the "pro" driver module was not compiled and loaded on boot. You can check it out with modinfo amdgpu | grep "file\|vers\|magic\|desc". I guess this solves your issue then, I will be closing it...

sibradzic commented 3 years ago

@YWangScience so, this is your issue:

echo vc 0 808 714 > /sys/class/drm/card0/device/pp_od_clk_voltage

Very likely that you'd get the same message if you run that command manually, as root. The error means that the driver/SMU does not accept the values specified, which is weird because obviously these same values are the card's HW defaults. I've seen this before, as some cards have inconsistent power, clock & voltage limits set in their PowerPlay table, meaning that the defaults would work fine, but if you try to manually set to the same voltage as default, the SMU will reject it, because the default value is ether under min or over max limit!

If you really care about getting to the bottom of this you can try experimenting with different voltages in the command above, starting with 750 or even 800. Or you could try checking and setting those voltage limits with https://github.com/sibradzic/upp.

But regardless, I wouldn't worry too much, amdgpu-clocks actually works for you pretty well...

pavelis commented 3 years ago

The difference between these systems is iF in amdgpu-dkms (not configured). Maybe the problem is here?

Likely. amdgpu-dkms left unconfigured probably means that the "pro" driver module was not compiled and loaded on boot. You can check it out with modinfo amdgpu | grep "file\|vers\|magic\|desc". I guess this solves your issue then, I will be closing it...

System with ii amdgpu-dkms:

filename:       /lib/modules/5.4.0-54-generic/updates/dkms/amdgpu.ko
version:        5.6.16.20.40
description:    AMD GPU
srcversion:     533BB7E5866E52F63B9ACCB
vermagic:       5.4.0-54-generic SMP mod_unload 
parm:           hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool)

System with iF amdgpu-dkms:

filename:       /lib/modules/5.4.0-54-generic/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko
description:    AMD GPU
srcversion:     533BB7E5866E52F63B9ACCB
vermagic:       5.4.0-54-generic SMP mod_unload

I suppose, the solution would be to use amdgpu-pro, but with in-kernel amdgpu driver.

YWangScience commented 3 years ago

@YWangScience so, this is your issue:

echo vc 0 808 714 > /sys/class/drm/card0/device/pp_od_clk_voltage

Very likely that you'd get the same message if you run that command manually, as root. The error means that the driver/SMU does not accept the values specified, which is weird because obviously these same values are the card's HW defaults. I've seen this before, as some cards have inconsistent power, clock & voltage limits set in their PowerPlay table, meaning that the defaults would work fine, but if you try to manually set to the same voltage as default, the SMU will reject it, because the default value is ether under min or over max limit!

If you really care about getting to the bottom of this you can try experimenting with different voltages in the command above, starting with 750 or even 800. Or you could try checking and setting those voltage limits with https://github.com/sibradzic/upp.

But regardless, I wouldn't worry too much, amdgpu-clocks actually works for you pretty well...

@sibradzic Thank you for your prompt and detailed explanation, as you said your program works smoothly on my device, I am not going to tackle with the parameters anymore, thumb up for amdgpu-clocks.

sibradzic commented 3 years ago

I suppose, the solution would be to use amdgpu-pro, but with in-kernel amdgpu driver.

Yes, upstream amdgpu driver seems to have much more stable pp_od_clk_voltage (aka OverDrive) interface. The amdgpu-pro amdgpu driver seems to break it from time to time, but you may get different result with it depending on which underlying kernel version you are compiling it against. The kernel 5.4 + amdgpu-pro 20.40 combination seems to end up with broken OverDrive interface, but you could try compiling it against 5.8 or newer if you feel adventurous...

pavelis commented 3 years ago

I found the solution that works great for me: 20.04's latest kernel 5.4.0-66 + AMDGPU-Pro 20.45. I would also try 5.10 kernel with upstream driver in a future. My current modules:

$ modinfo amdgpu | grep "file\|vers\|magic\|desc"
filename:       /lib/modules/5.4.0-66-generic/updates/dkms/amdgpu.ko
version:        5.6.20.20.45
description:    AMD GPU
srcversion:     533BB7E5866E52F63B9ACCB
vermagic:       5.4.0-66-generic SMP mod_unload 
$ dpkg --list | grep amdgpu-pro
ii  amdgpu-pro-core                            20.45-1188099                         all          Core meta package for Pro components of the unified amdgpu driver.
ii  amdgpu-pro-pin                             20.45-1188099                         all          Meta package to pin a specific amdgpu pro driver version.
ii  amdgpu-pro-rocr-opencl                     20.45-1188099                         amd64        Meta package to install ROCm OpenCL Pro components.
ii  clinfo-amdgpu-pro                          20.45-1188099                         amd64        AMD OpenCL info utility
ii  comgr-amdgpu-pro:amd64                     1.7.0-1188099                         amd64        Development files for ROCm ROCm Code Object Manager
ii  hip-rocr-amdgpu-pro                        20.45-1188099                         amd64        ROCr HIP Clang Runtime
ii  ocl-icd-libopencl1-amdgpu-pro:amd64        20.45-1188099                         amd64        AMD OpenCL ICD Loader library
ii  opencl-rocr-amdgpu-pro:amd64               20.45-1188099                         amd64        ROCr OpenCL Runtime

@sibradzic Thank you for your help and amdgpu-clocks that I continue to use!