Closed Kenzo95 closed 2 years ago
Ciao Kenzo from Italy :)
Likely that this issue has nothing to do with upp, if you use set
and get
commands against the entries you are setting you'll see that the value is set correctly, not that it mistakenly sets some other values without you knowing...
So, blame this behaviour on driver, card firmware or VBIOS. Speaking of which:
sleep 5s
?amd-info
and where did you get it from?/sys/class/drm/cardX/device/pp_od_clk_voltage
, /sys/class/drm/cardX/device/pp_dpm_sclk
and /sys/class/drm/cardX/device/pp_dpm_mclk
, before and after changing PP each table value with upp?
- What's the kernel / driver that you are using right now?
- You using mainline driver or amdgpu-pro or something?
- Does the same issue happen on all cards?
- Why are you using 7 different upp commands, each followed by
sleep 5s
?- What is
amd-info
and where did you get it from?- Can you share output of
/sys/class/drm/cardX/device/pp_od_clk_voltage
,/sys/class/drm/cardX/device/pp_dpm_sclk
and/sys/class/drm/cardX/device/pp_dpm_mclk
, before and after changing PP each table value with upp?
1 Linux HiveOS (Ubuntu Distro) 5.10.0-hiveos #72 AMD Driver A20.40 (5.11.1001) is known that this driver are heavely modded for the work that machines does, I'm procured a RX 6600(Hynix) that works up to 950 Mhz without problem, it is possible for some RX, depends on type of memory to arrive up to 1075 Mhz (Samsung) and I have seen plently of RX 6600 XT working with memorie up to 1200 MHz. At last I've created a SPPT with MorePowerTool on Windows 10 Latest AMD Driver 21.1x.x that reach to work with W6600 to clock 1075, but after installing more than 3 GPU, in some way, all GPU fall to DPM 2 state of memory.
2 You using mainline driver or amdgpu-pro or something? AMD Modded Driver 20.40 with AMD Kernel 5.11.1001 for support latest Navi 23 GPUs
sleep 5s
?
Because I command all machines from remote shell, and i can send multiple commands at once, but if I change too much parameters in the GPU instantly, the GPU risk to go in single state GFX 500 Mhz(safe state?), I'm pretty sure that some pauses are not needed, but anyway, why not include them in this testing phase?
the first 4 commands are not needed, are intended for reach the scope of work with the high memory clock of the GPU, this was because on windows testing, for reach a state where the GPU doesn't go in Safe State we had to downclock the Soc Clock below 1000 Mhz and Voltage below 735mV.amd-info
and where did you get it from?
amd-info
is a tool preinstalled in Hive-os is pretty usefull for know the global data of GPUs in any moment.a little note before read: I've forced GFX DPM state to 1, I can unset the parameter without problem if needed.
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/TdcLimit/1=35 --write
Changing smc_pptable.TdcLimit.1 from 18 to 35 at 0x350
Commiting changes to '/sys/class/drm/card1/device/pp_table'.
root@rig9CF1BB:/#
root@rig9CF1BB:/# sleep 5s
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz
1: 800Mhz *
2: 950Mhz
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk
0: 96Mhz
1: 541Mhz
2: 675Mhz
3: 875Mhz *
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/FreqTableSocclk/1=980 --write
Changing smc_pptable.FreqTableSocclk.1 from 1280 to 980 at 0x570
Commiting changes to '/sys/class/drm/card1/device/pp_table'.
root@rig9CF1BB:/#
root@rig9CF1BB:/# sleep 5s
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz
1: 800Mhz *
2: 950Mhz
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk
0: 96Mhz
1: 541Mhz
2: 675Mhz
3: 875Mhz *
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/MinVoltageSoc=2720 --write
Changing smc_pptable.MinVoltageSoc from 3224 to 2720 at 0x3a8
Commiting changes to '/sys/class/drm/card1/device/pp_table'.
root@rig9CF1BB:/#
root@rig9CF1BB:/# sleep 5s
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz
1: 800Mhz *
2: 950Mhz
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk
0: 96Mhz
1: 541Mhz
2: 675Mhz
3: 875Mhz *
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/MaxVoltageSoc=3200 --write
Changing smc_pptable.MaxVoltageSoc from 4200 to 3200 at 0x3ac
Commiting changes to '/sys/class/drm/card1/device/pp_table'.
root@rig9CF1BB:/#
root@rig9CF1BB:/# sleep 5s
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz
1: 800Mhz *
2: 950Mhz
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk
0: 96Mhz
1: 541Mhz
2: 675Mhz
3: 875Mhz *
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set /power_saving_clock/max/2=910 --write
Changing power_saving_clock.max.2 from 875 to 910 at 0x03e
Commiting changes to '/sys/class/drm/card1/device/pp_table'.
root@rig9CF1BB:/#
root@rig9CF1BB:/# sleep 5s
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz
1: 800Mhz *
2: 950Mhz
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk
0: 96Mhz
1: 541Mhz
2: 675Mhz
3: 875Mhz *
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/DcModeMaxFreq/2=910 --write
Changing smc_pptable.DcModeMaxFreq.2 from 875 to 910 at 0x62e
Commiting changes to '/sys/class/drm/card1/device/pp_table'.
root@rig9CF1BB:/#
root@rig9CF1BB:/# sleep 5s
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz
1: 800Mhz *
2: 950Mhz
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk
0: 96Mhz
1: 541Mhz
2: 675Mhz
3: 875Mhz *
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/FreqTableUclk/3=900 --write
Changing smc_pptable.FreqTableUclk.3 from 875 to 900 at 0x584
Commiting changes to '/sys/class/drm/card1/device/pp_table'.
root@rig9CF1BB:/#
root@rig9CF1BB:/# sleep 5s
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz
1: 800Mhz *
2: 950Mhz
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk
0: 96Mhz
1: 541Mhz
2: 675Mhz *
3: 900Mhz
root@rig9CF1BB:/#
root@rig9CF1BB:/# echo high > /sys/class/drm/card1/device/power_dpm_force_performance_level
root@rig9CF1BB:/#
root@rig9CF1BB:/# sleep 5s
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz
1: 800Mhz *
2: 950Mhz
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk
0: 96Mhz
1: 541Mhz
2: 675Mhz *
3: 900Mhz
root@rig9CF1BB:/#
root@rig9CF1BB:/# echo c > /sys/class/drm/card1/device/pp_od_clk_voltage
root@rig9CF1BB:/#
root@rig9CF1BB:/# sleep 5s
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
root@rig9CF1BB:/#
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz *
1: 500Mhz *
AMD Driver A20.40 (5.11.1001) is known that this driver are heavely modded for the work that machines does, I'm procured a RX 6600(Hynix) that works up to 950 Mhz without problem, it is possible for some RX, depends on type of memory to arrive up to 1075 Mhz (Samsung) and I have seen plently of RX 6600 XT working with memorie up to 1200 MHz.
First of all, none of your problems are related to upp, as far as I can tell this is all due to the kernel and card firmware / BIOS. Non-mainline AMD (pro?) driver is known to have issues with power management, over/under clocking/volting included, especially when compiled against not so fresh Linux kernels. Pro W6600 is very recent card, if you want to try to improve and test some things, please stick one of your cards into some machine running some recent distro an upstream kernel driver (such as Ubuntu 21.10+latest stable kernel ppa or Manjaro using 'open-source' graphics) and try changing the table with upp.
At last I've created a SPPT with MorePowerTool on Windows 10 Latest AMD Driver 21.1x.x that reach to work with W6600 to clock 1075, but after installing more than 3 GPU, in some way, all GPU fall to DPM 2 state of memory.
If you still have Windows installed on some partition you can use upp to read the power-play table exactly how MorePowerTool set it. It may be good for reference on how to set everything correctly with upp in Linux. If one card works as expected, all should work in a same way, unless there is some driver problem, which I can not help you with...
I command all machines from remote shell, and i can send multiple commands at once, but if I change too much parameters in the GPU instantly, the GPU risk to go in single state GFX 500 Mhz(safe state?)
There is no need to change power-play parameters one by one. You actually expose the driver to more risks that way, as the driver have to restart the power management every time there is an update in power-play table, so it is actually safer to update the table just one time.
according to the bottom of your output:
. . .
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz
1: 800Mhz *
2: 950Mhz
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk
0: 96Mhz
1: 541Mhz
2: 675Mhz *
3: 900Mhz
root@rig9CF1BB:/# echo c > /sys/class/drm/card1/device/pp_od_clk_voltage
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz *
1: 500Mhz *
it is obvious that none of the upp commands are causing your issue. The issue is caused by echo c > /sys/class/drm/card1/device/pp_od_clk_voltage
, which seems to mess up the card's clock & power management. This is a driver issue, please check the dmesg
output for details, and consider reporting this to AMD. Have you checked what happens when you don't run any upp commands at all, but only (followed by reboot):
echo high > /sys/class/drm/card1/devicepower_dpm_force_performance_level
echo c > /sys/class/drm/card1/device/pp_od_clk_voltage
?
Also, if you really want to use pp_od_clk_voltage
SYSFS API to control the card, please use latest mainline kernel driver & latest linux-firmware, and consider using dedicated tools for controlling such interface, such as https://github.com/sibradzic/amdgpu-clocks.
If none of the suggestions are feasible for you, you can always send me the one of the cards so I test everything myself ;)
@Kenzo95 ping
I can reproduce this issue in navi22 and navi21.
the kernel version is 5.11.22 plus several commits from the 5.12 kernel to open the pp_od_clk_voltage interface.
other kernel versions include the 5.10.y plus the dkms sources and firmware contained in the 21.40.1 offical drivers, all can’t let me make the navi22 and navi21 work well at the same time on one pc.
when I trigger a gpu error ( like wrongly modify other parts of pptable or use another kernel version ), it seems that the gpu protection mechanism will be activated, and this mechanism cannot be deactivated by restarting the whole system.
the default pp_dpm_mclk interface on navi22 is
0: 96Mhz
1: 456Mhz
2: 675Mhz
3: 1000Mhz *
if the gpu trigger the protection mechanism, even if "smc_pptable/FreqTableUclk/3" is changed from 1000 to 1001, pp_dpm_mclk will be forced to be set to 675 and cannot be forced to 1001 by echo "3" > pp_dpm_mclk
but if I executed the commands in the following order, I can successfully set the correct memory frequency:
echo "m 1 1075" > /sys/class/drm/card$i/device/pp_od_clk_voltage
echo "c" > /sys/class/drm/card$i/device/pp_od_clk_voltage
upp -p /sys/class/drm/card$i/device/pp_table set --write smc_pptable/FreqTableGfx/1=1250 smc_pptable/FreqTableUclk/3=1075
echo "r" > /sys/class/drm/card$i/device/pp_od_clk_voltage
sorry for not having called you for some weeks, I've also conducted my tests without success, trying to hard flash some others Navi 23 Bioses in that card, I'll try @oiG8Uchi solution today and let you know the results, if you want try with remote access one of our machines you are welcome. thanks.
After testing @oiG8Uchi the method, the GPU goes in protection anyway with this result:
=== GPU 0, 03:00.0 Radeon Pro W6600 8176 MB ===
Bios: 113-D5330100-100
Core: 500 MHz 625mV, Mem: 1074 MHz
PerfCtrl: manual, Load: 99%, MemLoad: 21%, Power: 32.0 W, Cap: 65 W
Core: 50°C, HotSpot: 52°C, Mem: 58°C, Fan: 29%, RPM: 1710
Core state: 0, clocks: 500* 500
Mem state: 3, clocks: 96 541 675 1074*
SOC state: 1, clocks: 872 1200*
DCEF state: 1, clocks: 417 960* 1200
F state: 0, clocks: 1551* 1801
PCIE Link speed: n/a, PCIE Link width: n/a
Memory total: 8176.00 MB, used: 4786.97 MB, free: 3389.03 MB, type: Samsung GDDR6
I've tried also a bios of an RX6600 asrock that can be overclocked to 950 Mhz and the W6600/bios RX6600Asrock boots up successfully. but in any case i can't overclock more than 875 the memory clock that the protection kicks in. I've also tested RX6600 Sapphire bios, RX 6600 XT bios, every time the GPU works fine if you stay below 875 Mhz of memory clock, but if you do just 1 Mhz more the protection kicks in. maybe the protection is hardware?
strangely in windows, drivers tested 21.11.x, if I go with soc clock at this exact voltage: "731mv" I can overclock up to 990 or 1075 Mhz the memory clock. but if i add more than 3 GPUs the protection kicks in or memory goes in DPM state 2.
I also have two asrock rx6600 not xt (navi 23). but the situation is different from navi21 and navi22. If the protection mechanism I guess is triggered, the pp_dpm_sclk interface will become when the smc_pptable/FreqTableGfx/1=950 command is executed using upp:
0: 500Mhz
1: 945Mhz *
2: 950Mhz
and I can’t use the pp_dpm_sclk interface to lock the frequency to 950mhz. I can also
echo "s 1 950" > /sys/class/drm/card1/device/pp_od_clk_voltage
echo "m 1 950" > /sys/class/drm/card1/device/pp_od_clk_voltage
echo "c" > /sys/class/drm/card1/device/pp_od_clk_voltage
upp -p /sys/class/drm/card1/device/pp_table set --write smc_pptable/FreqTableGfx/1=950 smc_pptable/FreqTableUclk/3=950
echo "r" > /sys/class/drm/card1/device/pp_od_clk_voltage
to restore the sclk of navi 23 to 950mhz
but the difference is that this pc uses the 5.10.84 kernel plus the dkms sources and firmware contained in the 21.40.1 offical drivers
I think it seems inappropriate for you to use hiveos for testing. I have not seen how they modify pptable and changes the kernel source code. maybe you should use a general linux distribution plus the latest 5.15 kernel for testing. at least my rx6600 works well with the 5.15 kernel on my gentoo linux, but rx6800xt encounters a prompt that it cannot exit baco status.
After testing @oiG8Uchi the method, the GPU goes in protection anyway with this result:
=== GPU 0, 03:00.0 Radeon Pro W6600 8176 MB === Bios: 113-D5330100-100 Core: 500 MHz 625mV, Mem: 1074 MHz PerfCtrl: manual, Load: 99%, MemLoad: 21%, Power: 32.0 W, Cap: 65 W Core: 50°C, HotSpot: 52°C, Mem: 58°C, Fan: 29%, RPM: 1710 Core state: 0, clocks: 500* 500 Mem state: 3, clocks: 96 541 675 1074* SOC state: 1, clocks: 872 1200* DCEF state: 1, clocks: 417 960* 1200 F state: 0, clocks: 1551* 1801 PCIE Link speed: n/a, PCIE Link width: n/a Memory total: 8176.00 MB, used: 4786.97 MB, free: 3389.03 MB, type: Samsung GDDR6
I've tried also a bios of an RX6600 asrock that can be overclocked to 950 Mhz and the W6600/bios RX6600Asrock boots up successfully. but in any case i can't overclock more than 875 the memory clock that the protection kicks in. I've also tested RX6600 Sapphire bios, RX 6600 XT bios, every time the GPU works fine if you stay below 875 Mhz of memory clock, but if you do just 1 Mhz more the protection kicks in. maybe the protection is hardware?
strangely in windows, drivers tested 21.11.x, if I go with soc clock at this exact voltage: "731mv" I can overclock up to 990 or 1075 Mhz the memory clock. but if i add more than 3 GPUs the protection kicks in or memory goes in DPM state 2.
I‘m RX6600 user, after knowing what you have done above and would like to know how you can flash the Navi 23 VBIOS, the tools I found don't recognize the card/can't force flash.
**Hello, I'm Kenzo from Italy, I'm looking for an help with some Navi 23 AMD Radeon Pro W6600
every time i try to overclock memories this GPU the GFX Clock goes in single state at 500 Mhz, force the state 0 of GFX to an higher Clock let the GPU to stop working and then a reboot is needed.**
GPU initial Status
command list:
upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/TdcLimit/1=35 --write
sleep 5s
upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/FreqTableSocclk/1=980 --write
sleep 5s
upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/MinVoltageSoc=2720 --write
sleep 5s
upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/MaxVoltageSoc=3200 --write
sleep 5s
upp -p /sys/class/drm/card1/device/pp_table set /power_saving_clock/max/2=910 --write
sleep 5s
upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/DcModeMaxFreq/2=910 --write
sleep 5s
upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/FreqTableUclk/3=900 --write
sleep 5s
echo high > /sys/class/drm/card1/device/power_dpm_force_performance_level
sleep 5s
echo c > /sys/class/drm/card1/device/pp_od_clk_voltage
sleep is necessary from 1 command to another for writing the PPtable or the GPU automatically goes at Lock state. When is locked the GPU I need to reboot to return at previus state.
GPU status at Locked state:
=== GPU 1, 06:00.0 Radeon Pro W6600 8176 MB === Bios: 113-D5330100-100 Core: 500 MHz 675mV, Mem: 900 MHz PerfCtrl: high, Load: 99%, MemLoad: 36%, Power: 42.0 W, Cap: 100 W Core: 55°C, HotSpot: 58°C, Mem: 64°C, Fan: 23%, RPM: 1368 Core state: 0, clocks: 500 500
Mem state: 3, clocks: 96 541 675 900 SOC state: 1, clocks: 872 960 DCEF state: 1, clocks: 417 685 1200
F state: 0, clocks: 1551* 1801
PCIE Link speed: n/a, PCIE Link width: n/a Memory total: 8176.00 MB, used: 4790.97 MB, free: 3385.03 MB, type: Samsung GDDR6
Stock PP_table of Radeon Pro W6600 attached as file.txt PP_table_w6600.txt
I hope you can provide some type of help to solve the situation. Thanks Kenzo.