Possible regression: 1 Prime95 error causes all further tests to fail now.

EamonNerbonne commented 11 months ago

First off: thanks for the highly useful program!

I've noticed what may be a regression between 0.9.3.0 and 0.9.4.2. Previously, when stress-testing systems running on the edge of stability, corecycler would report Prime95 detected errors in a core, and then continue. This meant it was possible to run corecycler unattended and collect all failing results after a period of time. In 0.9.4.2, when one run of Prime95 for one core fails, all subsequent runs for all other cores also fail.

I'm not sure this is a regression however, because enough changed between 0.9.3.0 and 0.9.4.2 that I needed to update the config file. In particular, I was using a short core cycling time of 10s to additionally test core load transitions, and that conflicts with the new delayFirstErrorCheck. It's possible the config changes are responsible.

I tried looking through the git history of script-corecycler.ps1, but I wasn't able to find any obvious smoking guns, though https://github.com/sp00n/corecycler/commit/fc9b055f551a57d22830d347f99cf6825d8fa986#diff-df219cdfda5262627131d1e8c19fc5dc20aaea162a39bb0fd008282f3ab7ca7c seemed plausible.

Does this bug ring any bells, or is it more likely a mis-config? For completeness, here's the diff for my config, though I don't expect that to contain any surprises config.ini diff.

sp00n commented 11 months ago

The new debug settings like delayFirstErrorCheck don't really affect/change the testing process if left untouched. They just allow you to change the previously hardcoded values, while the defaults for these haven't changed.

There's a bug in 0.9.4.2 though, where the delayFirstErrorCheck is set to 30 by default, whereas the previously hardcoded value was 0 (and not 3 like in your config).

Can you upload the log files for such a failed run?

I've also released an alpha version in the meantime, maybe this one works better for you: https://github.com/sp00n/corecycler/releases/tag/v0.9.5.0alpha2

EamonNerbonne commented 11 months ago

I intentionally misconfigured cores 3 & 4 in my bios, and ran the latest master commit. As expected, core 3 immediately failed, however errors continued on for all subsequent cores. The logfile is attached: CoreCycler_2023-08-03_22-08-52_PRIME95_AVX2.log

EamonNerbonne commented 11 months ago

I can also confirm 7f72721df2212f2c10ede257da6624fbc3eb4b45 still works, so I'm pretty sure whatever problem there is is related to fc9b055f551a57d22830d347f99cf6825d8fa986

EamonNerbonne commented 11 months ago

Right so this is indeed related to delayFirstErrorCheck. I had reduced it to 3 from 30 because I thought the value in 0.9.4.2 was intentional, but when delayFirstErrorCheck > runtimePerCore, then corecycler seems to test for 0 seconds per core.

Perhaps any non-zero delayFirstErrorCheck triggers this behavior?

sp00n commented 11 months ago

Due to the short runtime it registers the first error as another new error. I remember that I did cache the last error to compare any possible new error against it to prevent this from happening, it seems I somehow bodged this.

sp00n commented 11 months ago

Okay, I had ditched the cached error logs in favor of a getting only new log entries. This never happened though for very short runtimePerCore values and when delayFirstErrorCheck was set, as the whole "tick" code was never processed (which included getting new log entries).

So either set the delayFirstErrorCheck to 0 or change this code:

Old:

if ($tickInterval -ge 1) {

    # We need a special treatment for a custom delayFirstErrorCheck value
    if ($delayFirstErrorCheck) {
        $cpuCheckIterations = [Math]::Floor(($runtimePerCore - $delayFirstErrorCheck) / $tickInterval)
    }
    else {
        $cpuCheckIterations = [Math]::Floor($runtimePerCore / $tickInterval)
    }
}

if ($delayFirstErrorCheck) {
    $runtimeRemaining = $runtimePerCore - $delayFirstErrorCheck - ($cpuCheckIterations * $tickInterval)
}
else {
    $runtimeRemaining = $runtimePerCore - ($cpuCheckIterations * $tickInterval)
}

New:

if ($tickInterval -ge 1) {

    # We need a special treatment for a custom delayFirstErrorCheck value
    if ($delayFirstErrorCheck) {
        $cpuCheckIterations = [Math]::Floor(($runtimePerCore - $delayFirstErrorCheck) / $tickInterval)
    }
    else {
        $cpuCheckIterations = [Math]::Floor($runtimePerCore / $tickInterval)
    }

    # We want at least one tick to happen
    $cpuCheckIterations = [Math]::Max(1, $cpuCheckIterations)
}

if ($delayFirstErrorCheck) {
    $runtimeRemaining = $runtimePerCore - $delayFirstErrorCheck - ($cpuCheckIterations * $tickInterval)
}
else {
    $runtimeRemaining = $runtimePerCore - ($cpuCheckIterations * $tickInterval)
}

$runtimeRemaining = [Math]::Max(0, $runtimeRemaining)

EamonNerbonne commented 11 months ago

Yeah, as far as my personal use case goes this is all completely resolved by settings delayFirstErrorCheck to zero. If your suggested fix is all it takes to avoid others hitting both the issues with short runtimePerCore, that might help others to avoid falling into this trap.

If I read the code correctly, this will both fix the delayFirstErrorCheck > runtimePerCore results in zero runtime issue and the runtimePerCore > delayFirstErrorCheck causes 1 error to cause all further core test to fail spuriously?

Anyhow - thanks for the help and corecycler! If you have any specific things you want me to test, give a shout; but this resolution is fine for me.

sp00n commented 11 months ago

Yeah, it should fix both issues. I've also added some additional changes down the line, since right now the delayFirstErrorCheck does actually increase the runtime per core if it is very short. If it was e.g. set to delay for 30s and the runtime was 10s, it would first do the first tick (which is by default also 10 seconds) and then do the check if the first error check should be delayed, which adds another 30 seconds. In the next release this shouldn't happen anymore.

But it's a debug setting anyway, and with the correct default setting (cough) not many people should run into it anyway. There are reasons why these settings are in the debug section, I just messed up the default config during testing.

sp00n / corecycler

Possible regression: 1 Prime95 error causes all further tests to fail now. #51