sp00n / corecycler

Script to test single core stability, e.g. for PBO & Curve Optimizer on AMD Ryzen or overclocking/undervolting on Intel processors
Other
663 stars 30 forks source link

script-corecycler.ps1 CPU usage check fail with "FFTSize=All" & "numberOfThreads=2" #33

Closed Lincutt closed 1 year ago

Lincutt commented 1 year ago

when set "FFTSize=All" & "numberOfThreads=2" in config.ini CPU usage check in script-corecycler.ps1 will fail and stop stress. reset "numberOfThreads=1" working fine

the attached files are my config and error log errlog.zip

sp00n commented 1 year ago

Hm. What happens if you change the FFT size or switch from AVX2 to e.g. SSE? You can also check the CPU utilization for prime95.exe with the Task Manager. For an 8 core / 16 threads machine like yours it should be 1/8th of 100%, so 12.5% CPU utilization for prime95.exe when using 2 threads. But it's only seeing a max of 4%.

You can also open up Prime95, it's minimized to the tray bar automatically, but you can restore the window. Maybe there's some initialization bug and it doesn't start correctly, although it seems to work just fine for me.

Lincutt commented 1 year ago

Hm. What happens if you change the FFT size or switch from AVX2 to e.g. SSE?

same as AVX2

You can also check the CPU utilization for prime95.exe with the Task Manager. For an 8 core / 16 threads machine like yours it should be 1/8th of 100%, so 12.5% CPU utilization for prime95.exe when using 2 threads. But it's only seeing a max of 4%.

it shows around 20% https://i.imgur.com/KHdBNG7.jpg

You can also open up Prime95, it's minimized to the tray bar automatically, but you can restore the window.

open up prime95 will make my laptop laggy... I've tried low its priority to normal and restore the window Prime95 is still running without error, but corecycle says the CPU usage was too low.

Maybe there's some initialization bug and it doesn't start correctly, although it seems to work just fine for me.

maybe this is win11 new taskmgr's bug?

now I deleted LINE 3334-3404 to make it work on my laptop.

sp00n commented 1 year ago

That is weird. You can remove the CPU utilization check, but then the only error detection is when Prime95 does write an error to its log file. Which may not always happen, but only if the error was so critical that Prime95 itself crashed.

The CPU check is relying on the Windows Performance Counters, and these can seem to break quite easily. You can try to run the enable_performance_counter.bat in the /tools directory as an Administrator, although I haven't tested it on Windows 11 yet and don't know if all the paths or commands are still the same there (I assume so though).

You can also try to access the performance counter with this PowerShell command, after having started Prime95: Get-Counter "\\name_of_pc\process(prime95)\% processor time" Where name_of_pc is the name of your computer. The exact path should also be visible in the CC log file.

Lincutt commented 1 year ago

That is weird. You can remove the CPU utilization check, but then the only error detection is when Prime95 does write an error to its log file. Which may not always happen, but only if the error was so critical that Prime95 itself crashed.

The CPU check is relying on the Windows Performance Counters, and these can seem to break quite easily. You can try to run the enable_performance_counter.bat in the /tools directory as an Administrator, although I haven't tested it on Windows 11 yet and don't know if all the paths or commands are still the same there (I assume so though).

I've tried to run enable_performance_counter.bat, it showed success. but CC still get the low power message 🥲

You can also try to access the performance counter with this PowerShell command, after having started Prime95: Get-Counter "\\name_of_pc\process(prime95)\% processor time" Where name_of_pc is the name of your computer. The exact path should also be visible in the CC log file.

https://i.imgur.com/oPUue2d.png CounterSamples around 99~200, it changed every time.

https://i.imgur.com/xpHEfqV.png after remove CPU check, the usage seems floating.

https://i.imgur.com/8OH5269.png

Lincutt commented 1 year ago

after set TortureHyperthreading=0 to use separated threads in prime95 v30.8b16 the CPU usage check is back to proper behavior (no more floating!) 🤩

https://i.imgur.com/ytwt7dv.png

I think TortureHyperthreading=1 means p95 will assign a collaboration work to 2 threads then TortureHyperthreading=0 means p95 will always assign 2 separated works to 2 threads

so if we want to calculate the correct CPU usage when TortureHyperthreading=1 (and the target FFT is bigger than 240k) maybe we should get 2 threads' usage and sum them up?

sp00n commented 1 year ago

To get two threads with TortureHyperthreading=0 you have to change the NumThreads and WorkerThreads setting as well. See line 2275 and following. Both these settings should then either be 1 or 2, depending on the selected number of threads.

This is the "old" behavior before Prime version 30.7, but you can basically copy the setting for 30.6 to 30.7 now. That's what will be in the next release anyway:

    # Prime 30.6 and before:
    if ($isPrime95_30_6) {
        Add-Content $configFile1 ('CpuNumHyperthreads=' + $settings.General.numberOfThreads)       # If this is not set, Prime95 will create two worker threads in 30.6
        Add-Content $configFile1 ('WorkerThreads='      + $settings.General.numberOfThreads)
    }

    # Prime 30.7 and above:
    if ($isPrime95_30_7) {
        # If this is not set, Prime95 will create #numCores worker threads in 30.7+
        Add-Content $configFile1 ('NumThreads='    + $settings.General.numberOfThreads)
        Add-Content $configFile1 ('WorkerThreads=' + $settings.General.numberOfThreads)

        # If we're using TortureHyperthreading in prime.txt, this needs to stay at 1, even if we're using 2 threads
        # TortureHyperthreading introduces inconsistencies with the log format for two threads though, so we won't use it
        # Add-Content $configFile1 ('NumThreads=1')
        # Add-Content $configFile1 ('WorkerThreads=1')
    }
Lincutt commented 1 year ago

replace all

$processCPUPercentage = [Math]::Round(((Get-Counter $processCounterPathTime -ErrorAction Ignore).CounterSamples.CookedValue) / $numLogicalCores, 2)

with

        $processCPUPercentage = 0

        foreach ($i in $cpuNumbersArray) {
            $threadInfo = Get-CimInstance -Query ('SELECT * FROM Win32_PerfFormattedData_Counters_ProcessorInformation WHERE Name="0,' + $i + '"')
            $processCPUPercentage += [Math]::Round($threadInfo.PercentProcessorUtility * $expectedUsagePerCore / $threadInfo.PercentProcessorPerformance, 2)
        }

and $thisProcessCPUPercentage with $processCPUPercentage

can fix this bug with any FFTSize, numberOfThreads 1or 2, TortureHyperthreading 0 or 1

the attached file is script-corecycler.ps1 with my mod, FYI🙂 my-script-corecycler.zip

Lincutt commented 1 year ago

That is weird. You can remove the CPU utilization check, but then the only error detection is when Prime95 does write an error to its log file. Which may not always happen, but only if the error was so critical that Prime95 itself crashed.

I've found a way that can reproduce this bug with v0.9.1.0 script-corecycler.ps1, list them below:

config.ini

  1. set numberOfThreads=2
  2. set runtimePerCore=auto
  3. set FFTSize=256-32768
  4. set logLevel=3

check the CPU usage messages that shows by verbose text.

sp00n commented 1 year ago

Your test scenario runs fine on my machine. What CPU do you have?


$threadInfo = Get-CimInstance -Query ('SELECT * FROM Win32_PerfFormattedData_Counters_ProcessorInformation WHERE Name="0,' + $i + '"')
$processCPUPercentage += [Math]::Round($threadInfo.PercentProcessorUtility * $expectedUsagePerCore / $threadInfo.PercentProcessorPerformance, 2)

This code would check the total usage of a core, and not that for Prime95 (resp. the selected stress test program) only. So it could generate false results if another program is using the same core.

It could be a cache problem with the Get-Counter calls... How are you starting the script, with the .bat file or calling it directly within a PowerShell terminal?

If you start it from within a terminal window, the PowerShell instance will only have the Counters available for those processes that were present when it initially started. I just experienced this when I started the script from within a PowerShell terminal window and switched to YCruncher, then I received 0% processor usage as well. After aborting the script, running a Get-Counter from the PS window, all while having YCruncher still running in the background, and then restarting the CoreCycler script from within the terminal window, it could read the processor usage correctly.

Starting the script with the batch file does not show this type of behavior, there it should work (unless the Counters themselves are borked). I might try to add an additional check for that.

Lincutt commented 1 year ago

Your test scenario runs fine on my machine. What CPU do you have?

the CPU is 5900HX in my laptop with window11 22H2.

$threadInfo = Get-CimInstance -Query ('SELECT * FROM Win32_PerfFormattedData_Counters_ProcessorInformation WHERE Name="0,' + $i + '"')
$processCPUPercentage += [Math]::Round($threadInfo.PercentProcessorUtility * $expectedUsagePerCore / $threadInfo.PercentProcessorPerformance, 2)

This code would check the total usage of a core, and not that for Prime95 (resp. the selected stress test program) only. So it could generate false results if another program is using the same core.

yes, I knew it. but I can't run it well with Get-Counter on my laptop.

It could be a cache problem with the Get-Counter calls... How are you starting the script, with the .bat file or calling it directly within a PowerShell terminal?

I run the script-corecycler.ps1 with right click on it and select "Run with PowerShell".

If you start it from within a terminal window, the PowerShell instance will only have the Counters available for those processes that were present when it initially started. I just experienced this when I started the script from within a PowerShell terminal window and switched to YCruncher, then I received 0% processor usage as well. After aborting the script, running a Get-Counter from the PS window, all while having YCruncher still running in the background, and then restarting the CoreCycler script from within the terminal window, it could read the processor usage correctly.

Starting the script with the batch file does not show this type of behavior, there it should work (unless the Counters themselves are borked). I might try to add an additional check for that.

I've tried to run the "Run CoreCycler.bat" with double click it, but the result is the same with above that I described...🥲 and I found a weird thing in my taskmgr, prime95's CPU usage is not the same between processes page and detail page.

processes page shows prime95 use around 20% https://i.imgur.com/AxLn7Cq.png

but detail page shows it use only 0-5% floating https://i.imgur.com/WOG4voQ.png (modify $maxChecks to 10 to avoid it close prime95 too fast)

sp00n commented 1 year ago

I wonder what happens if you change to YCruncher?

It may be a laptop specific thing in combination with W11. Unfortunately I have no way of testing this.

Lincutt commented 1 year ago

I wonder what happens if you change to YCruncher?

It may be a laptop specific thing in combination with W11. Unfortunately I have no way of testing this.

YCruncher seems working well. https://i.imgur.com/cWIGBO2.png https://i.imgur.com/o8ZWeBw.png

sp00n commented 1 year ago

This 19-20% in the Task Manager is really weird and is outright wrong. It should be 12.5%, as the process details also shows. It seems that the Task Manager is using two different methods to determine the CPU utilization for the overview and the details page: https://aaron-margosis.medium.com/task-managers-cpu-numbers-are-all-but-meaningless-2d165b421e43

(I myself am using Process Explorer, which uses the same method as the details page.)

I've added a new tag for version 0.9.1.1 (no release yet), which disables the TortureHyperthreading setting, maybe it'll work for you: https://github.com/sp00n/corecycler/releases/tag/v0.9.1.1

sp00n commented 1 year ago

@Lincutt Have you been able to test the new version 0.9.1.1 with your laptop yet?

Lincutt commented 1 year ago

@Lincutt Have you been able to test the new version 0.9.1.1 with your laptop yet?

yes, I've test it. it works fine on my laptop. but as I said before, turn off TortureHyperthreading means prime95 always do the separated jobs for multithreads. e.g. two logic cores for two 1024k FFTs. so that's predictable (for me

I think TortureHyperthreading is a new feature that can test the system stability in a different way. It let prime95 can use multithreads to do single collaboration job. e.g. two logic cores for one 1024k FFT. image no offense, but in my opinion we should not just turn TortureHyperthreading off to waive this.

sp00n commented 1 year ago

I actually turned off the setting to fix the log parsing issue, that it works now on your system is basically just a side effect. ;)

But I've opened a thread in the Prime95 forum regarding the setting, let's see if it does offer an improvement in the stability testing. https://www.mersenneforum.org/showthread.php?t=28058

sp00n commented 1 year ago

The response of the lead programmer:

If you are using OS tools to restrict prime95 to run on one specific core, then your two examples are equivalent.

So apparently there is no difference in terms of stress testing between the old and the new way, and I don't have to adjust the log recognition to support TortureHyperthreading. :)

Lincutt commented 1 year ago

首席程式師的回應:

如果您使用操作系統工具將 prime95 限制在一個特定內核上運行,則您的兩個範例是等效的。

所以顯然新舊方式在壓力測試方面沒有區別,也不必調整日誌識別來支援。:)TortureHyperthreading

Great! if so, then just roll back to the old mechanism without TortureHyperthreading is the best way.🙂 anyway, I just installed win10 on the same laptop and run v0.9.1.0 on it. everything is fine without the CPU usage too low error so I think that's my system or win11's problem. 🤣