.NET crashes with exit code 80131506 and kills powershell.exe & CoreCycler

Grzywax commented 10 months ago

Hi, so after two weeks of tuning I have my Curve optimal enough so cycler can run long time without errors. Since then it happens regularly that it stops. No error, nothing. It can happen iteration 14 or 18th or 17... usually after 10 iterations minimum and I never passed 20 iterations in one shot. What could be the reason ? Still unstable cores bugging ?

here is the log:

Resuming the stress test process
- Resumed: True
- 11:33:57 - Getting new log file entries
- Getting new log entries starting at position 151591 / Line 5744
- The new log file entries:
- - [Line 5745] Self-test 14336K passed!
- New file position: 151617 / Line 5745
- 11:33:57 - Checking CPU usage: 3.12%
- 11:33:58 - Tick 5 of max 36
- Remaining max runtime: 312s
- 11:34:07 - Suspending the stress test process for 1000 milliseconds
- Suspended: True
- Resuming the stress test process
- Resumed: True
- 11:34:09 - Getting new log file entries
- Getting new log entries starting at position 151617 / Line 5745
- The new log file entries:
- - [Line 5746] Self-test 15360K passed!
- New file position: 151643 / Line 5746
- 11:34:09 - Checking CPU usage: 3.12%
- 11:34:10 - Tick 6 of max 36
- Remaining max runtime: 300s
- 11:34:19 - Suspending the stress test process for 1000 milliseconds
- Suspended: True
- Resuming the stress test process

Here is screenshot:

And here's Prime95 log around the same time when corecycler stops: Self-test 8960K passed! [Thu Aug 31 11:29:22 2023] Self-test 14336K passed! Self-test 9216K passed! Self-test 9600K passed! Self-test 10240K passed! [Thu Aug 31 11:30:36 2023] Self-test 10752K passed! Self-test 11200K passed! Self-test 15360K passed! Self-test 11520K passed! [Thu Aug 31 11:31:57 2023] Self-test 12288K passed! Self-test 12800K passed! Self-test 13440K passed! [Thu Aug 31 11:33:09 2023] Self-test 13824K passed! Self-test 16000K passed! Self-test 14336K passed! Self-test 15360K passed! [Thu Aug 31 11:34:30 2023] Self-test 16000K passed! Self-test 16384K passed! Self-test 17920K passed! Self-test 16384K passed! [Thu Aug 31 11:35:40 2023]

sp00n commented 10 months ago

At first I thought you meant it freezes, but it seems the script just exits? That's a first, I've never seen this being reported. 😮 Is there any common denominator in the log files, e.g. it always happens when resuming the stress test program, etc.? Does the Windows Event Log maybe say something about this?

Grzywax commented 10 months ago

Yes, my previous issue was with "freezing" but since I deactivated quick edit it never happened.

But this one is different. I'm not able to run corecycler unattended for more than a day and a half. My theory is that some of "best" cores have still too high curve value and they clock too high in ultra light tasks - so when corecycler suspends task best core is able to clock high and if this core is at the moment used for cocrecycler script - it bugs and exits execution.

I did not check winows logs (no time for that deep dive ;] ).

But next time it happens I can post log fragment where it stopped - to see if it's the same step.

sp00n commented 10 months ago

If you haven't deleted the old log files, they should still be in our logs directory.

It's been a long time since I've let CoreCycler run for so long, it was still in the early days of writing it, so I could think of various issues for these long runtimes. Maybe a buffer overflow somewhere, maybe the log files become too large, maybe there's a memory leak... therefore checking the Windows Event Log might be useful to determine what's going on there, maybe there's some entry in there pointing in the right direction.

Grzywax commented 10 months ago

So today it happened again.

Here is the output:

Here is the log:

14:06:27 - Getting new log file entries
Getting new log entries starting at position 120928 / Line 4431
The new log file entries:
- [Line 4432] Self-test 16384K passed!
New file position: 120954 / Line 4432
14:06:27 - Checking CPU usage: 3.12%
14:06:28 - Tick 21 of max 36
Remaining max runtime: 120s
14:06:37 - Suspending the stress test process for 1000 milliseconds
Suspended: True
Resuming the stress test process
Resumed: True
14:06:39 - Getting new log file entries
No file size change for the log file
14:06:39 - Checking CPU usage: 3.12%
14:06:40 - Tick 22 of max 36
Remaining max runtime: 108s
14:06:49 - Suspending the stress test process for 1000 milliseconds
Suspended: True
Resuming the stress test process
Resumed: True
14:06:51 - Getting new log file entries
Getting new log entries starting at position 120954 / Line 4432
The new log file entries:
- [Line 4433] [Fri Sep 1 14:06:41 2023]
- [Line 4434] Self-test 18432K passed!
New file position: 121008 / Line 4434
14:06:51 - Checking CPU usage: 3.12%
14:06:52 - Tick 23 of max 36
Remaining max runtime: 96s
14:07:01 - Suspending the stress test process for 1000 milliseconds

And in windows log I got this: 01/09/2023 14:07:01 Faulting application name: powershell.exe, version: 10.0.19041.2913, time stamp: 0xcb0b8e31 Faulting module name: clr.dll, version: 4.8.9167.0, time stamp: 0x648f6bcc Exception code: 0xc0000005 Fault offset: 0x000000000009d843 Faulting process id: 0x1864 Faulting application start time: 0x01d9dc29fa19d58f Faulting application path: C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe Faulting module path: C:\Windows\Microsoft.NET\Framework64\v4.0.30319\clr.dll Report Id: efda66dd-b2e7-4684-bc0d-dea89c5d6be7 Faulting package full name: Faulting package-relative application ID: Application Application Error ID: 1000 level:Error

and this one: 01/09/2023 14:07:01 Application: powershell.exe Framework Version: v4.0.30319 Description: The process was terminated due to an internal error in the .NET Runtime at IP 00007FFD4252D843 (00007FFD42490000) with exit code 80131506. Source .NET Runtime Event ID: 1023

At the same second there was service logon: Special privileges assigned to new logon.

Subject: Security ID: SYSTEM Account Name: SYSTEM Account Domain: NT AUTHORITY Logon ID: 0x3E7

Privileges: SeAssignPrimaryTokenPrivilege SeTcbPrivilege SeSecurityPrivilege SeTakeOwnershipPrivilege SeLoadDriverPrivilege SeBackupPrivilege SeRestorePrivilege SeDebugPrivilege SeAuditPrivilege SeSystemEnvironmentPrivilege SeImpersonatePrivilege SeDelegateSessionUserImpersonatePrivilege

Also I have those errors sometime - those are 100% not cores that are being stressed that bug, but something else... It lasts second to a minute and can mark one or few cores in a row as bad:

Grzywax commented 10 months ago

I would say it could be some leak. Looks like overflow of some kind.

sp00n commented 10 months ago

https://stackoverflow.com/questions/4367664/application-crashes-with-internal-error-in-the-net-runtime This response points that it might be some hardware issues, so it could be related to overclocking (of the CPU or the RAM or both). Other responses say that at least at some point it was a bug in the .NET framework itself. Maybe you could try to update your .NET installation and see if that helps.

You could also try to disable the suspend functionality by setting suspendPeriodically = 0 in your config. It will potentially decrease the usefulness of the stress test by removing the simulated "load switches", but I've noticed that there is a potential memory leak when calling this function, so for long runtimes it may actually hit a limit where something breaks.

If you decide to test either of these, please let me know the results.

Grzywax commented 10 months ago

For the period of tunning CPU I have RAM @ JEDEC settings 4800 (those are 6000 CL30 sticks) so it's not it. CPU - sure - it's in the middle of tuning ;]

Yesterday I have updated .NET packages but today it stopped again with same errors:

cocrycler log:

             + 11:28:52 - Checking CPU usage: 3.12%
             + 
             + 11:28:53 - Tick 11 of max 36
             +            Remaining max runtime: 240s
             + 11:29:02 - Suspending the stress test process for 1000 milliseconds
             +            Suspended: True
             +            Resuming the stress test process
             +            Resumed: True
             + 11:29:04 - Getting new log file entries
             +            No file size change for the log file
             + 11:29:04 - Checking CPU usage: 3.12%
             + 
             + 11:29:05 - Tick 12 of max 36
             +            Remaining max runtime: 228s
             + 11:29:14 - Suspending the stress test process for 1000 milliseconds

Win:

Application: powershell.exe Framework Version: v4.0.30319 Description: The process was terminated due to an internal error in the .NET Runtime at IP 00007FFAF31DCE9E (00007FFAF3140000) with exit code 80131506.

Win: Faulting application name: powershell.exe, version: 10.0.19041.2913, time stamp: 0xcb0b8e31 Faulting module name: clr.dll, version: 4.8.9181.0, time stamp: 0x64b85478 Exception code: 0xc0000005 Fault offset: 0x000000000009ce9e Faulting process id: 0x1ce8 Faulting application start time: 0x01d9dd64af7bee5c Faulting application path: C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe Faulting module path: C:\Windows\Microsoft.NET\Framework64\v4.0.30319\clr.dll Report Id: 2c63bc0d-a169-4e03-a9b7-fe2ee2731fe9 Faulting package full name: Faulting package-relative application ID:

sp00n commented 3 weeks ago

Closing, as to me this looks like a hardware / overclock / undervolt issue. If you or anyone else sees this happening, let me know. But I guess I have no way to debug this without access to the faulting system.

sp00n commented 1 week ago

I've now had that .NET crash with exit code 80131506 as well, which simply kills the powershell.exe and therefore also CoreCycler. Also while running Prime95. I have no real angle to investigate this though, so let's see if it happens again.

sp00n commented 1 week ago

I'm not making much progress. I thought I had solved it by refactoring some of my code, and it ran fine for over 7 hours where before it would stop within an hour, but then it returned. I'm also not ruling out it's connected to the Remote Desktop Connection I'm using. Or it's indeed an unstable undervolt, although I haven't seen an error message for quite some time now.

I managed to get a crash dump when it happened, but I'm not really good at understanding it. The crash seems to be due to a Garbage Collector error where it tries to read memory from address 0 (i.e. null).

ExceptionAddress: 00007ffd3b357fba (clr!WKS::gc_heap::find_first_object+0x00000000000000ea)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000001
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 0000000000000000
Attempt to read from address 0000000000000000

resp. In powershell.exe.5240.dmp the assembly instruction at clr!WKS::gc_heap::find_first_object+ea in C:\Windows\Microsoft.NET\Framework64\v4.0.30319\clr.dll from Microsoft Corporation has caused an access violation exception (0xC0000005) when trying to read from memory location 0x00000000 on thread [7]

My experience with crash dumps and debugging with crash dumps is close to zero, so not sure how to progress from here.

sp00n / corecycler

.NET crashes with exit code 80131506 and kills powershell.exe & CoreCycler #56