sp00n / corecycler

Script to test single core stability, e.g. for PBO & Curve Optimizer on AMD Ryzen or overclocking/undervolting on Intel processors
Other
664 stars 30 forks source link

script does not detect rounding error #1

Closed erazortt closed 3 years ago

erazortt commented 3 years ago

When p95 errors out on a rounding error (see below) the script does not detect this, even though p95 will have stopped. No errors are being shown in the shell. The script continues thinking everying is alright, even though nothing is being computed after this point! Only in the p95 logs there is an entry, but the CoreCycler log does not show this.

roudning error: FATAL ERROR: Rounding was 0.5, expected less than 0.4 Hardware failure detected, consult stress.txt file.

This error is the only error my CPU throws so this tool can currently only be used while also monitoring the CPU activity.

sp00n commented 3 years ago

Normally it should detect the error and put out a corresponding message in the shell: image

The script periodically checks the CPU usage of the Prime95 process, and if the measured usage is below the expected value, it checks the results_xxx.txt generated by Prime95 (the path to which is shown at the start of the script), and parses any error messages in the last couple of rows.

erazortt commented 3 years ago

I have never got this error message in the shell. The shell remains completly silent. That's how it looks like: grafik

See how the processor activity on the Taskmanager drops since p95 is now inactive.

sp00n commented 3 years ago

The error check happens in 30 second intervals. Going by your screenshot, you may have reacted too soon. Did you wait a bit after the screenshot or did you immediately close the terminal window?

erazortt commented 3 years ago

Before I figured out what is goiing on I was having just the shell open. By chance I just had the core temperature in hwi open and wondered why it was down even though the test was already running for one hour. Only then I saw the red p95 icon. So no, no error messages even after 1 hour, while the shell continues printing is usual messages like everything is fine.

erazortt commented 3 years ago

Setting restartPrimeForEachCore = 1 helps somehow, so that at least prime is restarted, and tests the other cores. Whcih it would not do if with restartPrimeForEachCore = 0.

sp00n commented 3 years ago

Can you try to check what happens if you manually stop the worker thread or close Prime95 while the script is still running? It should be detected and restart Prime95, and a message should be displayed in the shell (although no specific Prime95 error will be displayed as in my screenshot above).

sp00n commented 3 years ago

Meh, I cannot replicate this ._.

image

erazortt commented 3 years ago

Yes, when I close p95 it is registered.

Am 08.03.2021 um 20:08 schrieb sp00n notifications@github.com:  Can you try to check what happens if you manually stop the worker thread or close Prime95 while the script is still running? It should be detected and restart Prime95, and a message should be displayed in the shell (although no specific Prime95 error will be displayed as in my screenshot above).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

erazortt commented 3 years ago

What I'm using is a fresh windows 10 20H2 installation.

erazortt commented 3 years ago

And in fact, since there is pretty much nothing on it, if you want I could create a new user account and grant you the rdp access to it. Then you can debug that on the live system.

sp00n commented 3 years ago

Can you check with the new 0.7.8.5 version? I reworked the error detection a bit, hopefully for the better.

erazortt commented 3 years ago

yup, working now!