irusanov / ZenStates-Core

ZenStates-Core
GNU General Public License v3.0
38 stars 6 forks source link

Just curious #14

Closed PJVol closed 8 months ago

PJVol commented 8 months ago

Hi! Can you tell me, is there something special I should have known about AM5 as a whole, or in particular regarding the way zen 4 responds to smu requests or some other smn interaction? Just tried to build and test an app on a remote user's PC and my app crashed PC right at the start with successive reboot. Does it with both versions of your lib - 1.6.7 and 1.69. The app build mentioned above still works flawlessly on Vermeer and Cezanne platforms. Unfortunately I'm still lacking access to am5. Thanks!

irusanov commented 8 months ago

Hi, IMO there's nothing fundamentally different. If there are changes in the SMN protocol, then I haven't noticed them.

There are the usual changes in the command IDs and registers. Here are some differences from the top of my head:

I don't know what your app is trying to do, a simple initialization of the "library" should work as I'm using it in ZenTimings and seems to be working for all SKUs, but it's not 100% tested, so there is a high chance some command ID and/or required parameter is wrong. ZenTimings is using a very small subset of available commands to identify the cpu and also transfer and refresh the power metrics table.

There are many things not working correctly in the DLL, especially core map, fuses addresses, some timings, voltages. It's too overwhelming and kind of impossible for me to keep up with all the changes with every new SKU, especially when I don't have a physical CPU/APU.

PJVol commented 8 months ago

Well, my subset is far less. Only

But I wonder about the HSMP as possible reason, which is come to my mind before. I used it to get/set the Boost limit in Vermeer. And based on your source there's only GetSupported() function that might have hinted towards that. Besides, there's the same description of HSMP like in case of Milan, in "PPR for AMD Family 19h Model 11h Rev. B1 Processors Vol. 3 of 6".

Anyway, the app seems to crash very early, there's no write operations until user starts to interact, except maybe writing to SMN Index/Data regs from within dll. So the first thing I'll try is to build without hsmp access and see, just curios should the app crash the PC otherwise? ))

Btw, the core map and fuses seems accurate according to the debug report in ZT, although in the user log I saw this (not sure if this is the ordinary thing for the 7800 X3D's, as ZT doesn't report active/total ccd count):

CpuName:           AMD Ryzen 7 7800X3D 8-Core Processor
CodeName:          Raphael
CpuId:             00A60F12
BaseModel:         1
ExtendedModel:     96
Model:             97
Stepping:          2
FusedCoreCount:    8   <=========
PhysicalCoreCount: 16  <=========
NodesPerProcessor: 1
Threads:           16
SMT:               True
SmuVersion:        84.79.223
SmuTableVersion:   00540104
...

and from the other log, from the 7950X3D, the power table seemed to me incomplete, as if its length is determined incorrectly, bu this is not what I'm 100% sure of.

irusanov commented 8 months ago

Length of the tables is certainly not correct, I have the same length for all 7xxx table versions, but they obviously vary based on core count and other specifics.

HSMP is still available on server counterparts, but seems to be disabled on AM5. At least it was when I tested.

CCD map/count and physical core count are probably inaccurate. I don't have a X3D to test and don't have many reports anyway. Would need to work on this, but that's not a priority as it is not needed for ZenTimings.

As for the crash, it seems to be some wrong command/argument, but I can't tell if trying to access HSMP would do that. SendSmuCommand should ignore the command if the mailbox is not-defined or the cmd is not defined and in Raphael's case it is not.

Could it be the core mask for the margin is incorrect and tries to access the disabled second CCD (if there's one)? But you said it crashes before any user interaction. You can also try the CO function in the debug app to see if it applies the values correctly.

The dll initialization reads CPUID functions, tries to read the fuses, but if ZT starts then the basic initialization works, so it must be something the app tries that crashes it, the question is what...

You could try to use the specific mailbox functions instead of the generic one and see if it makes any difference.

Unfortunately, I don't have access to my Raphael system atm, so I can't test.

PS: I think people were successfully using the debug tool to set CO values for their X3D's, so the command IDs seem to be the same as on normal Raphael. Maybe the PPT/TDC/EDC/THM commands fail and cause the reboot?

PJVol commented 8 months ago

I haven't access to any Raphael systems at all to test, there are just some owners who willing to participate.

Anyway you still have an access to my repo, and maybe (if you don't mind rather messy code) you can note some suspicious code there, using the fresh pair of eyes )

P.S. keeping in mind the fact, that the app crash (according to one user) before any interactions occur. I can't be sure if PPT, etc commands is the caused the crash, or even if they are correct indeed, since I get them from your lib

irusanov commented 8 months ago

I could run it on my laptop with 6800HS after commenting out the code for unsupported CPU. At least it does not crash. The only suspects in your current code that send commands are GetBoostLimit() and cpu.GetPBOScalar() in the method to update current limits. Maybe comment them out and try without them. I can see the HSMP mailbox addresses for Zen4 in the DLL and I don't remember why. I can't try at the moment if it is not active at all or it is active, but the commands get rejected. Perhaps that is the culprit and it is an oversight from my side.

I'd suggest trying the debug tool and send the same command(s) manually from the SMU tab.

The other thing to try in your code is to use the cpu.SendHsmpCommand instead of the generic SendSmuCommand. It won't execute if the Hsmp mailbox is not supported (interface version is 0). I should think about adding more checks to protect against unwanted cmd execution.

PS: I might be able to test on my own system next week.

PJVol commented 8 months ago

I've added a check for Raphael when reading pbo scalar and max boost, so there's no hsmp access anymore on the start. Commited all changes made.

And yet, the build has crashed the user's 7800X3D PC as before..

irusanov commented 8 months ago

And other tools using the dll work as expected or crash as well?

Some ideas:

To crash like that it must be some wrong register write or command. Fuses are only read and not written, so I think it is highly unlikely to be the cause, but I think we need to figure out if the problem comes from the DLL itself or something that the app does afterwards. If it is the DLL, then I need to dig in and try to figure out what could be wrong, however that will be hard without an actual X3D system.

PS: Could you try to use the RSMU commands for PSM Margins on Raphael instead of MP1 (in GetPsmMargin method)? Not sure if I have tested the MP1 commands inherited from Zen3, especially on X3D SKUs. I use the RSMU command in PBO tab of the debug tool. I think that 0x48 command might be crashing it.

PJVol commented 8 months ago

Unlikely it's the Cpu instance initialization. The very first app launch was stopped at the "unsupported cpu", when I forgot to add Raphael in "if" statement, lol, meaning the Cpu() constructor has been invoked already.

As for MP1, yeah, I also just noted that you inherit psm command IDs from the Zen3, but for example SetPPT IDs are different in zen3 and zen4, i.e.0x3D and 0x3E that fits well your assumption. Gonna try to get rid of mp1 until correct IDs is figured out.

irusanov commented 8 months ago

I have set them all to 0 for Zen4 until the correct commands are found, just in case. I'm planning to rework a large portion of it at some point, just not sure when. Current goal is to move all memory-related readings from ZT to the core dll.

PJVol commented 8 months ago

Just finished with the necessary changes, along some clean-up, commited and built. We're awaiting for results )

Update: Nah...Crashed again, but this time in 2-3 seconds. The previous builds were crashing PC instantly. The app window doesn't ever show up

PJVol commented 8 months ago

I wonder, if I would log to a file, will it survive the reboot? I mean the file is not closed at that moment so its buffer is not flushed

mann1x commented 8 months ago

@PJVol Does CPUDoc crashes as well? I remember I did something at the beginning but I don't recall anymore what...

The last part of the log will be lost in case of crash, maybe corrupted. You need to setup a logging on the cloud, if possible, to avoid it. There are C# libraries for Onedrive which seems to be the most convenient and reliable for this job.

PJVol commented 8 months ago

You know what? :) Just recieved feedback from another user who tried the same build as the first.

Hey!
Appears to be working fine except for setting max boost.
- It does not reboot the system :)
- it reads all values correctly
- it sets all values as requested except for the max boost value.

test4 test3

mann1x commented 8 months ago

Do they have the same CPU?

I remember now the issue was that some commands are different on AM5 and there was one which was causing some CPU to set the PPT to 0. Others didn't, the PPT stayed unchanged. PPT to zero causes an immediate reboot on some and a super sluggish system for others.

mann1x commented 8 months ago

And I had to set all HSMP commands to 0x0 cause even trying it would cause the PPT to go to 0. I was probing something, on HSMP or SMU, and just that, trying an unsupported command, was triggering it.

PJVol commented 8 months ago

@mann1x Hi! How's are things? ) The first user has 7800X3D

mann1x commented 8 months ago

Hey, I'm tired :) Spending the time I have free to close my build right now. It's taking too many years, too much even for my low standards!

Could be then something on 7800X3D, which is innocuous on the other CPU, triggers PPT to 0.

PJVol commented 8 months ago

I remember now the issue was that some commands are different on AM5 and there was one which was causing some CPU to set the PPT to 0. Others didn't, the PPT stayed unchanged.

The thing is that the app crashes before any set command is issued, unless this PPT thing you mentioned caused by something else. It crashed 7800x3d owner PC even when I temporarily disabled HSMP access. The app itself doesn't set anythig without user interaction, it just read limits from powerplay table + scalar via RSMU, and read VCO margins, also via RSMU

ZenTimings 1.31.1192 Debug Report
Core Version: 1.69.0

######################################################
System Info
######################################################
OS:                Microsoft Windows 11 Pro
CpuName:           AMD Ryzen 7 7800X3D 8-Core Processor
CodeName:          Raphael
CpuId:             00A60F12
BaseModel:         1
ExtendedModel:     96
Model:             97
Stepping:          2
FusedCoreCount:    8
PhysicalCoreCount: 16
NodesPerProcessor: 1
Threads:           16
SMT:               True
MbVendor:          ASUSTeK COMPUTER INC.
MbName:            ROG STRIX B650E-E GAMING WIFI
BiosVersion:       2413
SmuVersion:        84.79.223
SmuTableVersion:   00540104
PatchLevel:        0A601206
DRAM Base Address: 0000000074E4F000
ZenTimings 1.31.1192 Debug Report
Core Version: 1.69.0

######################################################
System Info
######################################################
OS:                Microsoft Windows 11 Pro
CpuName:           AMD Ryzen 9 7950X3D 16-Core Processor
CodeName:          Raphael
CpuId:             00A60F12
BaseModel:         1
ExtendedModel:     96
Model:             97
Stepping:          2
FusedCoreCount:    16
PhysicalCoreCount: 16
NodesPerProcessor: 1
Threads:           32
SMT:               True
MbVendor:          ASUSTeK COMPUTER INC.
MbName:            ROG STRIX B650E-I GAMING WIFI
BiosVersion:       2204
SmuVersion:        84.79.223
SmuTableVersion:   00540004
PatchLevel:        0A601206
DRAM Base Address: 0000000074E59000
PJVol commented 8 months ago

@mann1x Btw, does the 7800x3d always have two CCDs, one of which is disabled?

mann1x commented 8 months ago

Not sure but seems many of them are. Same for the 7600X and 7800X, many are dual CCD and maybe using the 2nd instead of 1st.

irusanov commented 8 months ago

Is it possible that the X3D has the second CCD enabled and the CPU crashes if you're trying to get CO values with a core mask for first CCD? I believe people used my debug tool to get and set CO values for 7800X3D, but as there are different CPUs it might trigger something on this particular sample.

You should be able to see which cores are active in the PMT.

PJVol commented 8 months ago

You should be able to see which cores are active in the PMT.

Right, had the smu returned the pmt of the same version, as for the "real" two-ccd cpu. At least this was the case with zen3, but seems not so with zen4 SmuTableVersion: 00540104 SmuTableVersion: 00540004

Maybe there's another way to figure it out, so here they are: Debug_Report_28482946.4572816.txt Debug_Report_7950X3D.txt

irusanov commented 8 months ago

I have released the SMUDebugTool 1.36 in its current state, using the latest dev DLL. Is it possible to check if it loads properly and if yes - what is detected on the PBO tab?

https://github.com/irusanov/SMUDebugTool/releases

Don't use Debug Report button. That lame mechanism to detect mailboxes does not work properly on these new SKUs, I might remove it altogether.

PJVol commented 8 months ago

You mean ask the user to run your tool and send a screenshot of PBO tab?

mann1x commented 8 months ago

Why you need to look at the PMT to understand which one is the actual CCD? Isn't enough to read the CCD fuses?

PJVol commented 8 months ago

Ivan's dll doesn't report active/fused ccd, I do (if app has survived launch, lol)

mann1x commented 8 months ago

Well, there's no specific function but you can read it.

If you look at CPUDoc code, I make a map of the CCDs with the fused/unfused cores.

Let me look at it.

PJVol commented 8 months ago

@irusanov Not the Raphael, but still, something you might wanna know

изображение

irusanov commented 8 months ago

You mean ask the user to run your tool and send a screenshot of PBO tab?

Yes, if possible. Mostly to see if it crashes the PC the same way.

@irusanov Not the Raphael, but still, something you might wanna know

Most probably a mask problem, which I need to work on. I have acquired a CZN recently (5300G)

PJVol commented 8 months ago

Btw, do you think it's worth asking another 7800x3d owner to run my app, just to rule out some unforeseen platform specific impact? Just in case, I think I got one )

mann1x commented 8 months ago

I have to resume my work, I'm forgetting too much stuff...

@irusanov I added the ApicId mapping to my version of the DLL in the past to handle properly the change of CO, maybe you can adopt it From what I recall without it when there's a fused core not at the end of the CCD but at the beginning or in the middle, the change will fail The mapping between virtual and physical cores is not linear and must be done via the AcpiId mapping to be accurate.

PJVol commented 8 months ago

@irusanov @mann1x UPDATE!!! App crashes PC with 7600 cpu

It would be nice if ZT is reported fuse pci range.

mann1x commented 8 months ago

@PJVol Maybe you can have him make a quick test replacing the ZenStates DLL with the one in CPUDoc. There shouldn't be any major difference that breaks it.

PJVol commented 8 months ago

@mann1x Just in case you have in mind the code that determine fused cores/ccds, then my app doesn't use ZenStates for this purpose Something isn't right, probably in my code, or some mmio SMN registers. I just need mmio dump of fuses 0x5D300 - 0x5D3FF, and Parameter Blocks 0x30081CD0 and 0x32081CD0, if these are correct )

mann1x commented 8 months ago

Ah I didn't get it, thought you adopted the DLL for everything.

Then maybe look at dev branch here:

https://github.com/mann1x/ZenStates-Core/blob/2a57e9242dd32d7b2a06962d62c89a896882d23a/Cpu.cs#L250

It's slightly different and not very well optimized but it works. it does also the ApicId mapping.

This the function specific to AM5 to detect the disabled cores via CO:

https://github.com/mann1x/ZenStates-Core/blob/2a57e9242dd32d7b2a06962d62c89a896882d23a/Cpu.cs#L649

PJVol commented 8 months ago

Thanks, I will look into it

irusanov commented 8 months ago

@PJVol 0x30081CD0 is correct, other fuses are, I think, 0x5D3BC and 0x5D3C0. PS: Your code seems to use correct addresses.

In my code I don't actually read the fuse of an inactive CCD, but I doubt a simple read would crash the CPU?

PJVol commented 8 months ago

One step closer to solving it.

PJVol commented 8 months ago

I think I figured out the issue.

@irusanov Have you got an idea why ccd_fuse1 [31:30] is always 0 in Zen4, and not in Zen3 ?

There's no more disabled CCDs in Zen4?

@mann1x How did you know that 6/8 core Zen4 cpu has one downbinned CCD ?

Btw, I've looked into your code that determines disabled cores. It seems you make use of bequeathed by ancient manuscripts, blind brute-forcing? )

irusanov commented 8 months ago

I think I figured out the issue.

Curious to know what is it and if I need to fix something in the DLL.

@irusanov Have you got an idea why ccd_fuse1 [31:30] is always 0 in Zen4, and not in Zen3 ?

There's no more disabled CCDs in Zen4?

No idea, might be related to the issues they had with Zen3 which had the second CCD active?

PS: That method with the CO probing is kind of neat workaround, don't know how reliable will be in the long term though.

mann1x commented 8 months ago

Curious as well, what it was !?

I'm going with my memory which is bugged so... hope it's correct. You know there is a downbinned CCD if the core fusemap is not on the first byte of the CCD fusemap. I don't think you can know it for sure from the map if there's a 2nd CCD which is not used.

But there are other ways to understand it. The first and most common, if you know the PMT, is to check the L3 cache temperature. You will always get the L3 metrics from the disabled CCD. If you combine it with the fusemap you can determine exactly which and where.

Another trick I think works is the CO (if the CPU supports it); just like the search for the disabled cores with Zen4. The cores on the disabled CCD will give back a result (I think always 0) instead of error.

The trick works pretty well but it's not very elegant... to be honest I did a version much more neat, slick and compact. It didn't work... After too many failures I went back to the original and gave up :P

PJVol commented 8 months ago

@irusanov I'm not 100% sure yet. Need time and some assistance. Can you tell me, why "Console Application" fail to run when build with ZenStates? I want to build a simple app that users might test on their PCs with the least possible code.

PS C:\Users\Vol\source\repos\PBO test\bin\Debug\netcoreapp3.1> .\'PBO test.exe'
Hello!
Unhandled exception. System.MissingMethodException: Method not found: 'System.Threading.Mutex System.Threading.Mutex.OpenExisting(System.String, System.Security.AccessControl.MutexRights)'.
   at OpenHardwareMonitor.Hardware.Ring0.Open()
   at ZenStates.Core.Cpu..ctor()
   at PBO_test.App..ctor() in C:\Users\Vol\source\repos\PBO test\Program.cs:line 18
   at PBO_test.Program.Main(String[] args) in C:\Users\Vol\source\repos\PBO test\Program.cs:line 11
PS C:\Users\Vol\source\repos\PBO test\bin\Debug\netcoreapp3.1>

"Windows Application" is ok, but console output doesn't work in this case. App is simple:

using System;
using ZenStates.Core;

namespace PBO_test
{
    internal class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello!");
            App app = new App();
            app.Start();
        }
    }

    public class App
    {
        public Cpu cpu = new Cpu();
        public App() { }
        public void Start()
        {
            string[] args = Environment.GetCommandLineArgs();
            if (args.Length < 2) {
                Console.WriteLine("No CMD given. Bye!");
            } else {
                string cmd = args[1];
                Console.WriteLine($"CPU: {cpu.info.cpuName} - CMD: {cmd}");
            }
        }
    }
}
irusanov commented 8 months ago

@PJVol The DLL is compiled for 3 framework targets and those for netcoreapp 3.1 and netstandard 2.1 are basically untested. There are some conditions in the code to account for the different supported features. The mentioned method is one of them.

It should use this condition, but I'm not quite sure why it doesn't in the case of a console application. Are you referencing the netcoreapp version of the DLL? I believe that might be your issue as the netcoreapp3.1 version should be compiled with MutexAcl instead of Mutex. Unless I have missed some instance in the code, but I think it won't compile at all in that case.

#if NETCOREAPP || NETSTANDARD
                    pciBusMutex = MutexAcl.OpenExisting(pciMutexName, MutexRights.Synchronize);
#else
                    pciBusMutex = Mutex.OpenExisting(pciMutexName, MutexRights.Synchronize);
#endif

PS: I'm using .net2.0 version in my apps, so if you have grabbed it from there, then that's causing your issue. You have to get the full release zip from github, build from source or I can attach a lates build. Whatever works for you, let me know.

PS2: @mann1x was the catalyst to actually build for different targets, IIRC. Before that, I was only building for dotNet 2.x

mann1x commented 8 months ago

Weird, I have my console test App and works without any issue. Mine is a .NET 6 console app. How did you configure the project?

PJVol commented 8 months ago

@irusanov idk... there's no net2.0 in target frameworks. Only these. When build target is Netcore 2.0 or 2.1 - .dll is made instead of .exe, lol. Netcore 3.1 and higher gives .exe

изображение

mann1x commented 8 months ago

Yeah I think Ivan is right, set .NET 6 or 7 and use the full release download.

PJVol commented 8 months ago

Just did it with .NET 7 with no avail. Did you manage to build console app with ZenStates?

mann1x commented 8 months ago

Yes I have an empty project where I have the stuff to test. Didn't have to do anything special, as I remember. I can share it if you want. I'm using my version of the DLL but it shouldn't matter.

PJVol commented 8 months ago

I choose create "Console App" initially, may be I should select "Console App (.NET Framework)" ?

mann1x commented 8 months ago

Yes indeed!