hattedsquirrel / ryzen_monitor

Monitor power information of Ryzen processors via the PM table of the SMU
GNU Affero General Public License v3.0
95 stars 16 forks source link

Incompatible with Ryzen 9 3950X #2

Open KeithMyers opened 3 years ago

KeithMyers commented 3 years ago

Just tried this new driver and monitor in advance for a friend who is getting a Ryzen 5950X.

I ran it against my Ryzen 3950X and got this error message.

keith@Serenity:~/Downloads/ryzen_monitor/src$ sudo ./ryzen_monitor
[sudo] password for keith: 
rd_buf: 0.1.1
**SMU Driver Version Incompatible With Library Version**
keith@Serenity:~/Downloads/ryzen_monitor/src$ 
hattedsquirrel commented 3 years ago

Thanks for the info. Apparently ryzen_smu got updated 5 days ago to a new version. I'll have to look into that later. We expect to see v.0.1.0.

Regarding your CPUs: The 3950X will not work out of the box right now. You'd have to create a pm_table mapping first. The 5950X on the other hand should just work fine (given the SMU driver version matches). I tested with the 5900X which is essentially the same chip, but with 4 cores permanentely disabled.

KeithMyers commented 3 years ago

Will continue to watch this repo for updates. Thanks for the quick reply.

hattedsquirrel commented 3 years ago

I just checked in an update which now works for ryzen_smu v0.1.1 as well. You should now probably get a message about table version not supported.

If you are willing to provide pm_table dumps I can take a look and see how easy it is to guess the changes compared to the existing 3700X table.

You can create dumps by runnign the following script in bash (make sure you have read access to /sys/kernel/ryzen_smu_drv/pm_table and /sys/kernel/ryzen_smu_drv/pm_table_version first):

cat /sys/kernel/ryzen_smu_drv/pm_table_version | xxd -p > dump_pm_version
sleep 5
cat /sys/kernel/ryzen_smu_drv/pm_table > dump_idle.bin

yes > /dev/null &
sleep 0.5
cat /sys/kernel/ryzen_smu_drv/pm_table > dump_1Ta.bin
sleep 5
cat /sys/kernel/ryzen_smu_drv/pm_table > dump_1Tb.bin

yes > /dev/null &
sleep 0.5
cat /sys/kernel/ryzen_smu_drv/pm_table > dump_2Ta.bin
sleep 5
cat /sys/kernel/ryzen_smu_drv/pm_table > dump_2Tb.bin

for i in {1..30}; do (yes > /dev/null &); done
sleep 0.5
cat /sys/kernel/ryzen_smu_drv/pm_table > dump_32Ta.bin
sleep 5
cat /sys/kernel/ryzen_smu_drv/pm_table > dump_32Tb.bin

killall yes

Then pack all dump_* files and attach the archive. Thanks.

KeithMyers commented 3 years ago

OK, here is the archive of the dump* files. pm_dump.zip

hattedsquirrel commented 3 years ago

Could you test this patch? https://hattedsquirrel.net/downloads/ryzen_3950x-01.patch

KeithMyers commented 3 years ago

Your patch file is corrupted at the end. keith@Serenity:~/Downloads/ryzen_monitor/src$ patch < ryzen_3950x-01.patch patching file pm_tables.c patching file pm_tables.h patching file ryzen_monitor.c Hunk #1 succeeded at 275 (offset -2 lines). Hunk #2 succeeded at 303 with fuzz 2 (offset -2 lines). Hunk #3 succeeded at 314 (offset -4 lines). Hunk #4 FAILED at 474. Hunk #5 FAILED at 490. 2 out of 5 hunks FAILED -- saving rejects to file ryzen_monitor.c.rej

hattedsquirrel commented 3 years ago

pull the newest commits, then try again. I checked in some changes yesterday. Sorry about not mentioning that.

KeithMyers commented 3 years ago

Ok, much better. Works now. ───────────────────────────────────────────────┬────────────────────────────────────────────────╮ │ CPU Model │ AMD Ryzen 9 3950X 16-Core Processor │ │ Processor Code Name │ Matisse │ │ Cores │ 16 │ │ Core CCDs │ 2 │ │ Core CCXs │ 4 │ │ Cores Per CCX │ 4 │ │ SMU FW Version │ v46.67.0 │ │ MP1 IF Version │ v11 │ ╰───────────────────────────────────────────────┴────────────────────────────────────────────────╯ ╭─────────┬────────────┬──────────┬─────────┬──────────┬─────────────┬─────────────┬─────────────╮ │ Core 0 │ 4300 MHz | 5.843 W | 1.275 V | 66.72 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 1 │ 4300 MHz | 6.275 W | 1.275 V | 72.93 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 2 │ 4300 MHz | 5.881 W | 1.275 V | 67.52 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 3 │ 4300 MHz | 6.287 W | 1.275 V | 73.19 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 4 │ 4300 MHz | 5.805 W | 1.275 V | 66.22 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 5 │ 4300 MHz | 6.225 W | 1.275 V | 72.45 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 6 │ 4300 MHz | 5.481 W | 1.275 V | 65.69 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 7 │ 4300 MHz | 5.775 W | 1.275 V | 71.73 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 8 │ 4275 MHz | 5.385 W | 1.275 V | 70.92 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 9 │ 4275 MHz | 4.990 W | 1.275 V | 64.93 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 10 │ 4275 MHz | 5.665 W | 1.275 V | 70.95 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 11 │ 4275 MHz | 5.058 W | 1.275 V | 63.72 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 12 │ 4275 MHz | 5.639 W | 1.275 V | 72.14 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 13 │ 4275 MHz | 5.797 W | 1.275 V | 68.86 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 14 │ 4275 MHz | 5.814 W | 1.275 V | 73.11 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ │ Core 15 │ 4275 MHz | 5.864 W | 1.275 V | 69.81 C | C0: 100.0 % | C1: 0.0 % | C6: 0.0 % │ ╰─────────┴────────────┴──────────┴─────────┴──────────┴─────────────┴─────────────┴─────────────╯ ╭── Core Statistics (Calculated) ───────────────┬────────────────────────────────────────────────╮ │ Highest Effective Core Frequency │ 4300 MHz │ │ Highest Core Temperature │ 73.19 C │ │ Highest Core Voltage │ 1.275 V │ │ Average Core Voltage │ 0.000 V │ │ Average Core CC6 │ 0.00 % │ │ Total Core Power Sum │ 91.7840 W │ ├── Reported by SMU ────────────────────────────┼────────────────────────────────────────────────┤ │ Peak Core Voltage │ 1.275 V │ │ Package CC6 │ 0.00 % │ ╰───────────────────────────────────────────────┴────────────────────────────────────────────────╯ ╭── Electrical & Thermal Constraints ───────────┬────────────────────────────────────────────────╮ │ Peak Temperature │ 75.50 C │ │ SoC Temperature │ 37.55 C │ │ Voltage from Core VRM │ 1.100 V | 1.442 V | 76.27 % │ │ PPT │ 174.971 W | 142 W | 123.22 % │ │ TDC Value │ 113.832 A | 95 A | 119.82 % │ │ TDC Actual │ 90.914 A | 95 A | 95.70 % │ │ EDC │ 139.999 A | 140 A | 100.00 % │ │ THM │ 74.18 C | 95 C | 78.09 % │ │ FIT │ 0 | 258 | 0.01 % │ ╰───────────────────────────────────────────────┴────────────────────────────────────────────────╯ ╭── Memory Interface ───────────────────────────┬────────────────────────────────────────────────╮ │ Coupled Mode │ ON │ │ Fabric Clock (Average) │ 1800 MHz │ │ Fabric Clock │ 1800 MHz │ │ Uncore Clock │ 1800 MHz │ │ Memory Clock │ 1800 MHz │ │ cLDO_VDDM │ 0.9504 V │ │ cLDO_VDDP │ 0.9002 V │ │ cLDO_VDDG │ 1.0477 V │ ╰───────────────────────────────────────────────┴────────────────────────────────────────────────╯ ╭── Power Consumption ──────────────────────────┬────────────────────────────────────────────────╮ │ Total Core Power Sum │ 91.7840 W │ │ VDDCR_SOC Power │ 19.3636 W │ │ GMI2_VDDG Power │ 8.7156 W │ │ L3 Logic Power │ 0.517 W + 0.5365 W │ │ L3 Logic Power │ + 0.386 W + 0.3332 W = 1.7727 W │ │ L3 VDDM Power │ 0.350 W + 0.3510 W │ │ L3 VDDM Power │ + 0.369 W + 0.3652 W = 1.4350 W │ │ │ │ │ VDDIO_MEM Power │ 8.6723 W │ │ IOD_VDDIO_MEM Power │ 0.0000 W │ │ DDR_VDDP Power │ 5.1823 W │ │ VDD18 Power │ 0.8000 W │ │ │ │ │ Calculated Thermal Output │ 137.7255 W │ ├── Additional Reports ─────────────────────────┼────────────────────────────────────────────────┤ │ SoC Power (SVI2) │ 1.094 V | 17.704 A | 19.364 W │ │ Core Power (SVI2) │ 1.275 V | 113.817 A | 145.117 W │ │ Core Power (SMU) │ 145.117 W │ │ Socket Power (SMU) │ 174.9525 W │ │ Package Power (SMU) │ nan W │ ╰───────────────────────────────────────────────┴────────────────────────────────────────────────╯

hattedsquirrel commented 3 years ago

Okay, cool. Thanks for the help and the screenshot. It also pointed out a bug in the calculation of "Average Core Voltage", which I now fixed. I'll push all changes online now.

KeithMyers commented 3 years ago

Ok, I'll pull the newest commit and test it for the missing average voltage value.

Was reading through the commit and noticed that you are limiting the application only to Ryzen parts.

Ever consider adding Epyc parts? You are hard coding a core limit of 16. My Epyc 7402P has 24 cores.

Would be nice to have the application usable on Epyc parts also.

KeithMyers commented 3 years ago

All good. Average Core Voltage is now populated with actual value.

hattedsquirrel commented 3 years ago

The only reason Epyc isn't supported right now is that I don't know anything about them. The first step would be to find out which SMN registers to read and to see if they differ to the Ryzen series. Those registers are read to find out how many CCDs there are and which cores are disabled. If you are brave enough you can build and run the attached util and paste its output. (It also depends on the ryzen_smu kernel driver.) Maybe the registeres look simmilar enough to the Ryzen series. smn_debug.tar.gz

KeithMyers commented 3 years ago

I'll give it a shot. Glad to help developers with hardware testing.

KeithMyers commented 3 years ago

Here is the smn_debug output from my AMD Epyc 7402P cpu.

ryzen_smu version string: 0.1.1 fam: 0x17 model: 0x31 logical_cores: 48 threads_per_core: 2 read 05d218: 02850a14, ret = OK read 05d228: 95400000, ret = OK read 05d258: 00000000, ret = OK read 05d21c: 09120a14, ret = OK read 05d22c: 0000002a, ret = OK read 05d25c: 24401e81, ret = OK read 30081800: 00000000, ret = OK read 30081d98: 00000000, ret = OK read 31081800: ffffffff, ret = OK read 31081d98: ffffffff, ret = OK read 32081800: ffffffff, ret = OK read 32081d98: ffffffff, ret = OK read 33081800: ffffffff, ret = OK read 33081d98: ffffffff, ret = OK read 34081800: 00000000, ret = OK read 34081d98: 00000000, ret = OK read 35081800: ffffffff, ret = OK read 35081d98: ffffffff, ret = OK read 36081800: ffffffff, ret = OK read 36081d98: ffffffff, ret = OK

KeithMyers commented 3 years ago

Gave ryzen_monitor a what the hell shot on the Epyc.

ryzen_smu version string: 0.1.1 PM Tables are not supported on this platform.

hattedsquirrel commented 3 years ago

Oh, thats unfortunate. The error message means that the ryzen_smu doesn't know how to read the PM table from the SMU yet. I looked into the code and the reason seems to be that they don't know which function number to call. Maybe you could reach out to them and see if they can get it going with your help. Once ryzen_smu can read the PM table I'm positive we can get things working on my side as well.

KeithMyers commented 3 years ago

I will do that. Zenpower module works with my 7402P. Zen Monitor also. But it does not work on the 7502 or 7642 with the higher core counts.

level1wendell commented 3 years ago

I can provide remote access to epyc rome and Milan if that's useful. Also how do I contribute $ to fund further work here? (You should sign up for github sponsor?)

hattedsquirrel commented 3 years ago

Can you check if ryzen_smu provides /sys/kernel/ryzen_smu_drv/pm_table and /sys/kernel/ryzen_smu_drv/pm_table/pm_table_version on your machines? This underlying support needs to be in place before we can start implementing support on our end.

patrickschur commented 3 years ago

@level1wendell I would like to have access to an Epyc server. How can I reach you?

level1wendell commented 3 years ago

Email is probably the best bet. Wendell at Level1Techs dot com

On a 7742 something catastrophic happens loading ryzen smu with kernel 5.11 from the pve repo (proxmox). The kernel thinks every pcie device wants vfio-pci for the driver. And other nondeterministic behavior. Never seen anything like that!

Distro of choice? I'll prep the os image for you also and we can do this on a dedicated machine I can swap in both Rome and Milan parts.

Don't feel rushed the hw will be at your disposal whenever you need however long it's needed to help further the project.

It seems close on these parts.

On Fri, Jun 11, 2021, 3:09 AM Patrick Schur @.***> wrote:

@level1wendell https://github.com/level1wendell I would like to have access to an Epyc server. How can I reach you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hattedsquirrel/ryzen_monitor/issues/2#issuecomment-859336890, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJLXWZVNUXNSUFXO47R5L6DTSGZDZANCNFSM4XTM5GLA .

patrickschur commented 3 years ago

@level1wendell You got an email. ;)