sibradzic / upp

A tool for parsing, dumping and modifying data in Radeon PowerPlay tables
GNU General Public License v3.0
154 stars 24 forks source link

AMD Navi 23 Radeon PRO W6600 after Trying to Overclock Memories Lock GFX Clock to 500 Mhz #25

Closed Kenzo95 closed 2 years ago

Kenzo95 commented 3 years ago

**Hello, I'm Kenzo from Italy, I'm looking for an help with some Navi 23 AMD Radeon Pro W6600

every time i try to overclock memories this GPU the GFX Clock goes in single state at 500 Mhz, force the state 0 of GFX to an higher Clock let the GPU to stop working and then a reboot is needed.**

GPU initial Status

root@rig9CF1BB:/# amd-info                                                                                                                                                                                                                                                                                                                                                                                                            
Thu Nov 18 22:15:25 EET 2021                                                                                                                                                                                                                                                                                                                                                                                                          

=== GPU 0, 03:00.0 Radeon Pro W6600 8176 MB ===
  Bios: 113-D5330100-100
  Core: 725 MHz 675mV, Mem: 875 MHz
  PerfCtrl: manual, Load: 99%, MemLoad: 45%, Power: 48.0 W, Cap: 100 W
  Core: 57°C, HotSpot: 58°C, Mem: 66°C, Fan: 23%, RPM: 1368
  Core state: 1, clocks: 500 725* 950                                                                                                                                                                                                                                                                                                                                                                                                 
  Mem  state: 3, clocks: 96 541 675 875*
  SOC  state: 1, clocks: 872 1200*
  DCEF state: 1, clocks: 417 685* 1200                                                                                                                                                                                                                                                                                                                                                                                                
  F    state: 0, clocks: 1551* 1801                                                                                                                                                                                                                                                                                                                                                                                                   
  PCIE Link speed: n/a, PCIE Link width: n/a
  Memory total: 8176.00 MB, used: 4790.97 MB, free: 3385.03 MB, type: Samsung GDDR6

=== GPU 1, 06:00.0 Radeon Pro W6600 8176 MB ===
  Bios: 113-D5330100-100
  Core: 800 MHz 675mV, Mem: 875 MHz
  PerfCtrl: manual, Load: 99%, MemLoad: 45%, Power: 47.0 W, Cap: 100 W
  Core: 55°C, HotSpot: 58°C, Mem: 66°C, Fan: 23%, RPM: 1368
  Core state: 1, clocks: 500 800* 950                                                                                                                                                                                                                                                                                                                                                                                                 
  Mem  state: 3, clocks: 96 541 675 875*
  SOC  state: 1, clocks: 872 1200*
  DCEF state: 1, clocks: 417 685* 1200                                                                                                                                                                                                                                                                                                                                                                                                
  F    state: 0, clocks: 1551* 1801                                                                                                                                                                                                                                                                                                                                                                                                   
  PCIE Link speed: n/a, PCIE Link width: n/a
  Memory total: 8176.00 MB, used: 4790.97 MB, free: 3385.03 MB, type: Samsung GDDR6

=== GPU 2, 09:00.0 Radeon Pro W6600 8176 MB ===
  Bios: 113-D5330100-100
  Core: 800 MHz 675mV, Mem: 875 MHz
  PerfCtrl: manual, Load: 99%, MemLoad: 45%, Power: 47.0 W, Cap: 100 W
  Core: 58°C, HotSpot: 62°C, Mem: 68°C, Fan: 23%, RPM: 1368
  Core state: 1, clocks: 500 800* 950                                                                                                                                                                                                                                                                                                                                                                                                 
  Mem  state: 3, clocks: 96 541 675 875*
  SOC  state: 1, clocks: 872 1200*
  DCEF state: 1, clocks: 417 685* 1200                                                                                                                                                                                                                                                                                                                                                                                                
  F    state: 0, clocks: 1551* 1801                                                                                                                                                                                                                                                                                                                                                                                                   
  PCIE Link speed: n/a, PCIE Link width: n/a
  Memory total: 8176.00 MB, used: 4790.97 MB, free: 3385.03 MB, type: Samsung GDDR6

=== GPU 3, 0c:00.0 Radeon Pro W6600 8176 MB ===
  Bios: 113-D5330100-100
  Core: 775 MHz 675mV, Mem: 875 MHz
  PerfCtrl: manual, Load: 99%, MemLoad: 45%, Power: 47.0 W, Cap: 100 W
  Core: 57°C, HotSpot: 60°C, Mem: 66°C, Fan: 23%, RPM: 1368
  Core state: 1, clocks: 500 775* 950                                                                                                                                                                                                                                                                                                                                                                                                 
  Mem  state: 3, clocks: 96 541 675 875*
  SOC  state: 1, clocks: 872 1200*
  DCEF state: 1, clocks: 417 685* 1200                                                                                                                                                                                                                                                                                                                                                                                                
  F    state: 0, clocks: 1551* 1801                                                                                                                                                                                                                                                                                                                                                                                                   
  PCIE Link speed: n/a, PCIE Link width: n/a
  Memory total: 8176.00 MB, used: 4790.97 MB, free: 3385.03 MB, type: Samsung GDDR6

=== GPU 4, 0f:00.0 Radeon Pro W6600 8176 MB ===
  Bios: 113-D5330100-100
  Core: 825 MHz 675mV, Mem: 875 MHz
  PerfCtrl: manual, Load: 99%, MemLoad: 45%, Power: 47.0 W, Cap: 100 W
  Core: 58°C, HotSpot: 61°C, Mem: 68°C, Fan: 23%, RPM: 1368
  Core state: 1, clocks: 500 825* 950                                                                                                                                                                                                                                                                                                                                                                                                 
  Mem  state: 3, clocks: 96 541 675 875*
  SOC  state: 1, clocks: 872 1200*
  DCEF state: 1, clocks: 417 685* 1200                                                                                                                                                                                                                                                                                                                                                                                                
  F    state: 0, clocks: 1551* 1801                                                                                                                                                                                                                                                                                                                                                                                                   
  PCIE Link speed: n/a, PCIE Link width: n/a
  Memory total: 8176.00 MB, used: 4790.97 MB, free: 3385.03 MB, type: Samsung GDDR6

=== GPU 5, 12:00.0 Radeon Pro W6600 8176 MB ===
  Bios: 113-D5330100-100
  Core: 800 MHz 675mV, Mem: 875 MHz
  PerfCtrl: manual, Load: 99%, MemLoad: 45%, Power: 47.0 W, Cap: 100 W
  Core: 59°C, HotSpot: 61°C, Mem: 66°C, Fan: 23%, RPM: 1368
  Core state: 1, clocks: 500 800* 950                                                                                                                                                                                                                                                                                                                                                                                                 
  Mem  state: 3, clocks: 96 541 675 875*
  SOC  state: 1, clocks: 872 1200*
  DCEF state: 1, clocks: 417 685* 1200                                                                                                                                                                                                                                                                                                                                                                                                
  F    state: 0, clocks: 1551* 1801                                                                                                                                                                                                                                                                                                                                                                                                   
  PCIE Link speed: n/a, PCIE Link width: n/a
  Memory total: 8176.00 MB, used: 4790.96 MB, free: 3385.04 MB, type: Samsung GDDR6

command list: upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/TdcLimit/1=35 --write sleep 5s upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/FreqTableSocclk/1=980 --write sleep 5s upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/MinVoltageSoc=2720 --write sleep 5s upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/MaxVoltageSoc=3200 --write sleep 5s upp -p /sys/class/drm/card1/device/pp_table set /power_saving_clock/max/2=910 --write sleep 5s upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/DcModeMaxFreq/2=910 --write sleep 5s upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/FreqTableUclk/3=900 --write sleep 5s echo high > /sys/class/drm/card1/device/power_dpm_force_performance_level sleep 5s echo c > /sys/class/drm/card1/device/pp_od_clk_voltage

sleep is necessary from 1 command to another for writing the PPtable or the GPU automatically goes at Lock state. When is locked the GPU I need to reboot to return at previus state.

GPU status at Locked state:

=== GPU 1, 06:00.0 Radeon Pro W6600 8176 MB === Bios: 113-D5330100-100 Core: 500 MHz 675mV, Mem: 900 MHz PerfCtrl: high, Load: 99%, MemLoad: 36%, Power: 42.0 W, Cap: 100 W Core: 55°C, HotSpot: 58°C, Mem: 64°C, Fan: 23%, RPM: 1368 Core state: 0, clocks: 500 500
Mem state: 3, clocks: 96 541 675 900
SOC state: 1, clocks: 872 960 DCEF state: 1, clocks: 417 685 1200
F state: 0, clocks: 1551* 1801
PCIE Link speed: n/a, PCIE Link width: n/a Memory total: 8176.00 MB, used: 4790.97 MB, free: 3385.03 MB, type: Samsung GDDR6

Stock PP_table of Radeon Pro W6600 attached as file.txt PP_table_w6600.txt

I hope you can provide some type of help to solve the situation. Thanks Kenzo.

sibradzic commented 3 years ago

Ciao Kenzo from Italy :)

Likely that this issue has nothing to do with upp, if you use set and get commands against the entries you are setting you'll see that the value is set correctly, not that it mistakenly sets some other values without you knowing... So, blame this behaviour on driver, card firmware or VBIOS. Speaking of which:

  1. What's the kernel / driver that you are using right now?
  2. You using mainline driver or amdgpu-pro or something?
  3. Does the same issue happen on all cards?
  4. Why are you using 7 different upp commands, each followed by sleep 5s?
  5. What is amd-info and where did you get it from?
  6. Can you share output of /sys/class/drm/cardX/device/pp_od_clk_voltage, /sys/class/drm/cardX/device/pp_dpm_sclk and /sys/class/drm/cardX/device/pp_dpm_mclk, before and after changing PP each table value with upp?
Kenzo95 commented 3 years ago
  1. What's the kernel / driver that you are using right now?
  2. You using mainline driver or amdgpu-pro or something?
  3. Does the same issue happen on all cards?
  4. Why are you using 7 different upp commands, each followed by sleep 5s?
  5. What is amd-info and where did you get it from?
  6. Can you share output of /sys/class/drm/cardX/device/pp_od_clk_voltage, /sys/class/drm/cardX/device/pp_dpm_sclk and /sys/class/drm/cardX/device/pp_dpm_mclk, before and after changing PP each table value with upp?

1 Linux HiveOS (Ubuntu Distro) 5.10.0-hiveos #72 AMD Driver A20.40 (5.11.1001) is known that this driver are heavely modded for the work that machines does, I'm procured a RX 6600(Hynix) that works up to 950 Mhz without problem, it is possible for some RX, depends on type of memory to arrive up to 1075 Mhz (Samsung) and I have seen plently of RX 6600 XT working with memorie up to 1200 MHz. At last I've created a SPPT with MorePowerTool on Windows 10 Latest AMD Driver 21.1x.x that reach to work with W6600 to clock 1075, but after installing more than 3 GPU, in some way, all GPU fall to DPM 2 state of memory.

2 You using mainline driver or amdgpu-pro or something? AMD Modded Driver 20.40 with AMD Kernel 5.11.1001 for support latest Navi 23 GPUs

  1. Does the same issue happen on all cards? Yes, I have now 20 Cards W6600 installed in 4 different machines, 3 are with H510 Pro BTC+ and i3 10100f, 1 with X470 Gaming Plus Max and Ryzen 3600.
  2. Why are you using 7 different upp commands, each followed by sleep 5s? Because I command all machines from remote shell, and i can send multiple commands at once, but if I change too much parameters in the GPU instantly, the GPU risk to go in single state GFX 500 Mhz(safe state?), I'm pretty sure that some pauses are not needed, but anyway, why not include them in this testing phase? the first 4 commands are not needed, are intended for reach the scope of work with the high memory clock of the GPU, this was because on windows testing, for reach a state where the GPU doesn't go in Safe State we had to downclock the Soc Clock below 1000 Mhz and Voltage below 735mV.
  3. What is amd-info and where did you get it from? amd-info is a tool preinstalled in Hive-os is pretty usefull for know the global data of GPUs in any moment.
  4. Can you share output of /sys/class/drm/cardX/device/pp_od_clk_voltage, /sys/class/drm/cardX/device/pp_dpm_sclk and /sys/class/drm/cardX/device/pp_dpm_mclk, before and after changing PP each table value with upp?

a little note before read: I've forced GFX DPM state to 1, I can unset the parameter without problem if needed.

root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/TdcLimit/1=35 --write                                                                                                                                                                                                                         
Changing smc_pptable.TdcLimit.1 from 18 to 35 at 0x350                                                                                                                                                                                                                                                                      
Commiting changes to '/sys/class/drm/card1/device/pp_table'.                                                                                                                                                                                                                                                                
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# sleep 5s                                                                                                                                                                                                                                                                                                  
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:                                                                                                                                                                                                                                                                                                           
0mV                                                                                                                                                                                                                                                                                                                         
OD_RANGE:                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk                                                                                                                                                                                                                                                               
0: 500Mhz                                                                                                                                                                                                                                                                                                                   
1: 800Mhz *                                                                                                                                                                                                                                                                                                                 
2: 950Mhz                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk                                                                                                                                                                                                                                                               
0: 96Mhz                                                                                                                                                                                                                                                                                                                    
1: 541Mhz                                                                                                                                                                                                                                                                                                                   
2: 675Mhz                                                                                                                                                                                                                                                                                                                   
3: 875Mhz *
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/FreqTableSocclk/1=980 --write                                                                                                                                                                                                                 
Changing smc_pptable.FreqTableSocclk.1 from 1280 to 980 at 0x570                                                                                                                                                                                                                                                            
Commiting changes to '/sys/class/drm/card1/device/pp_table'.                                                                                                                                                                                                                                                                
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# sleep 5s                                                                                                                                                                                                                                                                                                  
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage                                                                                                                                                                                                                                                         
OD_VDDGFX_OFFSET:
0mV                                                                                                                                                                                                                                                                                                                         
OD_RANGE:                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk                                                                                                                                                                                                                                                               
0: 500Mhz                                                                                                                                                                                                                                                                                                                   
1: 800Mhz *                                                                                                                                                                                                                                                                                                                 
2: 950Mhz                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk                                                                                                                                                                                                                                                               
0: 96Mhz                                                                                                                                                                                                                                                                                                                    
1: 541Mhz                                                                                                                                                                                                                                                                                                                   
2: 675Mhz                                                                                                                                                                                                                                                                                                                   
3: 875Mhz *  
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/MinVoltageSoc=2720 --write                                                                                                                                                                                                                    
Changing smc_pptable.MinVoltageSoc from 3224 to 2720 at 0x3a8                                                                                                                                                                                                                                                               
Commiting changes to '/sys/class/drm/card1/device/pp_table'.                                                                                                                                                                                                                                                                
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# sleep 5s                                                                                                                                                                                                                                                                                                  
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage                                                                                                                                                                                                                                                         
OD_VDDGFX_OFFSET:                                                                                                                                                                                                                                                                                                           
0mV                                                                                                                                                                                                                                                                                                                         
OD_RANGE:                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz                                                                                                                                                                                                                                                                                                                   
1: 800Mhz *                                                                                                                                                                                                                                                                                                                 
2: 950Mhz   
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk                                                                                                                                                                                                                                                               
0: 96Mhz                                                                                                                                                                                                                                                                                                                    
1: 541Mhz                                                                                                                                                                                                                                                                                                                   
2: 675Mhz                                                                                                                                                                                                                                                                                                                   
3: 875Mhz *     
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/MaxVoltageSoc=3200 --write                                                                                                                                                                                                                    
Changing smc_pptable.MaxVoltageSoc from 4200 to 3200 at 0x3ac                                                                                                                                                                                                                                                               
Commiting changes to '/sys/class/drm/card1/device/pp_table'.                                                                                                                                                                                                                                                                
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# sleep 5s                                                                                                                                                                                                                                                                                                  
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:                                                                                                                                                                                                                                                                                                           
0mV                                                                                                                                                                                                                                                                                                                         
OD_RANGE:                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk                                                                                                                                                                                                                                                               
0: 500Mhz                                                                                                                                                                                                                                                                                                                   
1: 800Mhz *                                                                                                                                                                                                                                                                                                                 
2: 950Mhz                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk                                                                                                                                                                                                                                                               
0: 96Mhz                                                                                                                                                                                                                                                                                                                    
1: 541Mhz                                                                                                                                                                                                                                                                                                                   
2: 675Mhz                                                                                                                                                                                                                                                                                                                   
3: 875Mhz *  
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set /power_saving_clock/max/2=910 --write                                                                                                                                                                                                                     
Changing power_saving_clock.max.2 from 875 to 910 at 0x03e                                                                                                                                                                                                                                                                  
Commiting changes to '/sys/class/drm/card1/device/pp_table'.                                                                                                                                                                                                                                                                
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# sleep 5s                                                                                                                                                                                                                                                                                                  
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage                                                                                                                                                                                                                                                         
OD_VDDGFX_OFFSET:                                                                                                                                                                                                                                                                                                           
0mV                                                                                                                                                                                                                                                                                                                         
OD_RANGE:                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk                                                                                                                                                                                                                                                               
0: 500Mhz                                                                                                                                                                                                                                                                                                                   
1: 800Mhz *                                                                                                                                                                                                                                                                                                                 
2: 950Mhz                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk                                                                                                                                                                                                                                                               
0: 96Mhz                                                                                                                                                                                                                                                                                                                    
1: 541Mhz                                                                                                                                                                                                                                                                                                                   
2: 675Mhz                                                                                                                                                                                                                                                                                                                   
3: 875Mhz *                                                                                                                                                                                                                                                                                                                 
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/DcModeMaxFreq/2=910 --write                                                                                                                                                                                                                   
Changing smc_pptable.DcModeMaxFreq.2 from 875 to 910 at 0x62e                                                                                                                                                                                                                                                               
Commiting changes to '/sys/class/drm/card1/device/pp_table'.                                                                                                                                                                                                                                                                
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# sleep 5s                                                                                                                                                                                                                                                                                                  
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage                                                                                                                                                                                                                                                         
OD_VDDGFX_OFFSET:                                                                                                                                                                                                                                                                                                           
0mV                                                                                                                                                                                                                                                                                                                         
OD_RANGE:                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/# 
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk                                                                                                                                                                                                                                                               
0: 500Mhz                                                                                                                                                                                                                                                                                                                   
1: 800Mhz *                                                                                                                                                                                                                                                                                                                 
2: 950Mhz                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk                                                                                                                                                                                                                                                               
0: 96Mhz                                                                                                                                                                                                                                                                                                                    
1: 541Mhz                                                                                                                                                                                                                                                                                                                   
2: 675Mhz                                                                                                                                                                                                                                                                                                                   
3: 875Mhz *
root@rig9CF1BB:/# upp -p /sys/class/drm/card1/device/pp_table set smc_pptable/FreqTableUclk/3=900 --write                                                                                                                                                                                                                   
Changing smc_pptable.FreqTableUclk.3 from 875 to 900 at 0x584                                                                                                                                                                                                                                                               
Commiting changes to '/sys/class/drm/card1/device/pp_table'.                                                                                                                                                                                                                                                                
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# sleep 5s                                                                                                                                                                                                                                                                                                  
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage                                                                                                                                                                                                                                                         
OD_VDDGFX_OFFSET:                                                                                                                                                                                                                                                                                                           
0mV                                                                                                                                                                                                                                                                                                                         
OD_RANGE:                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk                                                                                                                                                                                                                                                               
0: 500Mhz                                                                                                                                                                                                                                                                                                                   
1: 800Mhz *                                                                                                                                                                                                                                                                                                                 
2: 950Mhz                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk                                                                                                                                                                                                                                                               
0: 96Mhz                                                                                                                                                                                                                                                                                                                    
1: 541Mhz                                                                                                                                                                                                                                                                                                                   
2: 675Mhz *                                                                                                                                                                                                                                                                                                                 
3: 900Mhz                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# echo high > /sys/class/drm/card1/device/power_dpm_force_performance_level                                                                                                                                                                                                                                 
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# sleep 5s                                                                                                                                                                                                                                                                                                  
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage                                                                                                                                                                                                                                                         
OD_VDDGFX_OFFSET:                                                                                                                                                                                                                                                                                                           
0mV                                                                                                                                                                                                                                                                                                                         
OD_RANGE:                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk                                                                                                                                                                                                                                                               
0: 500Mhz                                                                                                                                                                                                                                                                                                                   
1: 800Mhz *                                                                                                                                                                                                                                                                                                                 
2: 950Mhz                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk                                                                                                                                                                                                                                                               
0: 96Mhz                                                                                                                                                                                                                                                                                                                    
1: 541Mhz                                                                                                                                                                                                                                                                                                                   
2: 675Mhz *                                                                                                                                                                                                                                                                                                                 
3: 900Mhz                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# echo c > /sys/class/drm/card1/device/pp_od_clk_voltage                                                                                                                                                                                                                                                    
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# sleep 5s                                                                                                                                                                                                                                                                                                  
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:                                                                                                                                                                                                                                                                                                           
0mV                                                                                                                                                                                                                                                                                                                         
OD_RANGE:                                                                                                                                                                                                                                                                                                                   
root@rig9CF1BB:/#                                                                                                                                                                                                                                                                                                           
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk                                                                                                                                                                                                                                                               
0: 500Mhz *                                                                                                                                                                                                                                                                                                                 
1: 500Mhz * 
sibradzic commented 3 years ago

AMD Driver A20.40 (5.11.1001) is known that this driver are heavely modded for the work that machines does, I'm procured a RX 6600(Hynix) that works up to 950 Mhz without problem, it is possible for some RX, depends on type of memory to arrive up to 1075 Mhz (Samsung) and I have seen plently of RX 6600 XT working with memorie up to 1200 MHz.

First of all, none of your problems are related to upp, as far as I can tell this is all due to the kernel and card firmware / BIOS. Non-mainline AMD (pro?) driver is known to have issues with power management, over/under clocking/volting included, especially when compiled against not so fresh Linux kernels. Pro W6600 is very recent card, if you want to try to improve and test some things, please stick one of your cards into some machine running some recent distro an upstream kernel driver (such as Ubuntu 21.10+latest stable kernel ppa or Manjaro using 'open-source' graphics) and try changing the table with upp.

At last I've created a SPPT with MorePowerTool on Windows 10 Latest AMD Driver 21.1x.x that reach to work with W6600 to clock 1075, but after installing more than 3 GPU, in some way, all GPU fall to DPM 2 state of memory.

If you still have Windows installed on some partition you can use upp to read the power-play table exactly how MorePowerTool set it. It may be good for reference on how to set everything correctly with upp in Linux. If one card works as expected, all should work in a same way, unless there is some driver problem, which I can not help you with...

I command all machines from remote shell, and i can send multiple commands at once, but if I change too much parameters in the GPU instantly, the GPU risk to go in single state GFX 500 Mhz(safe state?)

There is no need to change power-play parameters one by one. You actually expose the driver to more risks that way, as the driver have to restart the power management every time there is an update in power-play table, so it is actually safer to update the table just one time.

according to the bottom of your output:

 . . .
root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk                                                                                                              
0: 500Mhz
1: 800Mhz *
2: 950Mhz

root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_mclk
0: 96Mhz
1: 541Mhz
2: 675Mhz *
3: 900Mhz

root@rig9CF1BB:/# echo c > /sys/class/drm/card1/device/pp_od_clk_voltage

root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:

root@rig9CF1BB:/# cat /sys/class/drm/card1/device/pp_dpm_sclk
0: 500Mhz *
1: 500Mhz *

it is obvious that none of the upp commands are causing your issue. The issue is caused by echo c > /sys/class/drm/card1/device/pp_od_clk_voltage, which seems to mess up the card's clock & power management. This is a driver issue, please check the dmesg output for details, and consider reporting this to AMD. Have you checked what happens when you don't run any upp commands at all, but only (followed by reboot):

echo high > /sys/class/drm/card1/devicepower_dpm_force_performance_level
echo c > /sys/class/drm/card1/device/pp_od_clk_voltage

?

Also, if you really want to use pp_od_clk_voltage SYSFS API to control the card, please use latest mainline kernel driver & latest linux-firmware, and consider using dedicated tools for controlling such interface, such as https://github.com/sibradzic/amdgpu-clocks.

If none of the suggestions are feasible for you, you can always send me the one of the cards so I test everything myself ;)

sibradzic commented 2 years ago

@Kenzo95 ping

oiG8Uchi commented 2 years ago

I can reproduce this issue in navi22 and navi21. the kernel version is 5.11.22 plus several commits from the 5.12 kernel to open the pp_od_clk_voltage interface. other kernel versions include the 5.10.y plus the dkms sources and firmware contained in the 21.40.1 offical drivers, all can’t let me make the navi22 and navi21 work well at the same time on one pc. when I trigger a gpu error ( like wrongly modify other parts of pptable or use another kernel version ), it seems that the gpu protection mechanism will be activated, and this mechanism cannot be deactivated by restarting the whole system. the default pp_dpm_mclk interface on navi22 is 0: 96Mhz 1: 456Mhz 2: 675Mhz 3: 1000Mhz * if the gpu trigger the protection mechanism, even if "smc_pptable/FreqTableUclk/3" is changed from 1000 to 1001, pp_dpm_mclk will be forced to be set to 675 and cannot be forced to 1001 by echo "3" > pp_dpm_mclk but if I executed the commands in the following order, I can successfully set the correct memory frequency: echo "m 1 1075" > /sys/class/drm/card$i/device/pp_od_clk_voltage echo "c" > /sys/class/drm/card$i/device/pp_od_clk_voltage upp -p /sys/class/drm/card$i/device/pp_table set --write smc_pptable/FreqTableGfx/1=1250 smc_pptable/FreqTableUclk/3=1075 echo "r" > /sys/class/drm/card$i/device/pp_od_clk_voltage

Kenzo95 commented 2 years ago

sorry for not having called you for some weeks, I've also conducted my tests without success, trying to hard flash some others Navi 23 Bioses in that card, I'll try @oiG8Uchi solution today and let you know the results, if you want try with remote access one of our machines you are welcome. thanks.

Kenzo95 commented 2 years ago

After testing @oiG8Uchi the method, the GPU goes in protection anyway with this result:

=== GPU 0, 03:00.0 Radeon Pro W6600 8176 MB ===
  Bios: 113-D5330100-100
  Core: 500 MHz 625mV, Mem: 1074 MHz
  PerfCtrl: manual, Load: 99%, MemLoad: 21%, Power: 32.0 W, Cap: 65 W
  Core: 50°C, HotSpot: 52°C, Mem: 58°C, Fan: 29%, RPM: 1710
  Core state: 0, clocks: 500* 500
  Mem  state: 3, clocks: 96 541 675 1074*
  SOC  state: 1, clocks: 872 1200*
  DCEF state: 1, clocks: 417 960* 1200
  F    state: 0, clocks: 1551* 1801
  PCIE Link speed: n/a, PCIE Link width: n/a
  Memory total: 8176.00 MB, used: 4786.97 MB, free: 3389.03 MB, type: Samsung GDDR6

I've tried also a bios of an RX6600 asrock that can be overclocked to 950 Mhz and the W6600/bios RX6600Asrock boots up successfully. but in any case i can't overclock more than 875 the memory clock that the protection kicks in. I've also tested RX6600 Sapphire bios, RX 6600 XT bios, every time the GPU works fine if you stay below 875 Mhz of memory clock, but if you do just 1 Mhz more the protection kicks in. maybe the protection is hardware?

strangely in windows, drivers tested 21.11.x, if I go with soc clock at this exact voltage: "731mv" I can overclock up to 990 or 1075 Mhz the memory clock. but if i add more than 3 GPUs the protection kicks in or memory goes in DPM state 2.

oiG8Uchi commented 2 years ago

I also have two asrock rx6600 not xt (navi 23). but the situation is different from navi21 and navi22. If the protection mechanism I guess is triggered, the pp_dpm_sclk interface will become when the smc_pptable/FreqTableGfx/1=950 command is executed using upp: 0: 500Mhz 1: 945Mhz * 2: 950Mhz and I can’t use the pp_dpm_sclk interface to lock the frequency to 950mhz. I can also echo "s 1 950" > /sys/class/drm/card1/device/pp_od_clk_voltage echo "m 1 950" > /sys/class/drm/card1/device/pp_od_clk_voltage echo "c" > /sys/class/drm/card1/device/pp_od_clk_voltage upp -p /sys/class/drm/card1/device/pp_table set --write smc_pptable/FreqTableGfx/1=950 smc_pptable/FreqTableUclk/3=950 echo "r" > /sys/class/drm/card1/device/pp_od_clk_voltage to restore the sclk of navi 23 to 950mhz but the difference is that this pc uses the 5.10.84 kernel plus the dkms sources and firmware contained in the 21.40.1 offical drivers

oiG8Uchi commented 2 years ago

I think it seems inappropriate for you to use hiveos for testing. I have not seen how they modify pptable and changes the kernel source code. maybe you should use a general linux distribution plus the latest 5.15 kernel for testing. at least my rx6600 works well with the 5.15 kernel on my gentoo linux, but rx6800xt encounters a prompt that it cannot exit baco status.

asqw6677 commented 2 years ago

After testing @oiG8Uchi the method, the GPU goes in protection anyway with this result:

=== GPU 0, 03:00.0 Radeon Pro W6600 8176 MB ===
  Bios: 113-D5330100-100
  Core: 500 MHz 625mV, Mem: 1074 MHz
  PerfCtrl: manual, Load: 99%, MemLoad: 21%, Power: 32.0 W, Cap: 65 W
  Core: 50°C, HotSpot: 52°C, Mem: 58°C, Fan: 29%, RPM: 1710
  Core state: 0, clocks: 500* 500
  Mem  state: 3, clocks: 96 541 675 1074*
  SOC  state: 1, clocks: 872 1200*
  DCEF state: 1, clocks: 417 960* 1200
  F    state: 0, clocks: 1551* 1801
  PCIE Link speed: n/a, PCIE Link width: n/a
  Memory total: 8176.00 MB, used: 4786.97 MB, free: 3389.03 MB, type: Samsung GDDR6

I've tried also a bios of an RX6600 asrock that can be overclocked to 950 Mhz and the W6600/bios RX6600Asrock boots up successfully. but in any case i can't overclock more than 875 the memory clock that the protection kicks in. I've also tested RX6600 Sapphire bios, RX 6600 XT bios, every time the GPU works fine if you stay below 875 Mhz of memory clock, but if you do just 1 Mhz more the protection kicks in. maybe the protection is hardware?

strangely in windows, drivers tested 21.11.x, if I go with soc clock at this exact voltage: "731mv" I can overclock up to 990 or 1075 Mhz the memory clock. but if i add more than 3 GPUs the protection kicks in or memory goes in DPM state 2.

I‘m RX6600 user, after knowing what you have done above and would like to know how you can flash the Navi 23 VBIOS, the tools I found don't recognize the card/can't force flash.