snoplus / orca

Git repository tracking the main Orca svn for SNO+ development
2 stars 13 forks source link

Frequent ORCA crashes at CRSU when restarting/resyncing #503

Open mnirkko opened 6 years ago

mnirkko commented 6 years ago

We have commissioned and tested the remote control room at Sussex (CRSU). Unfortunately, many ORCA crashes occurred when attempting to restart/resync runs. It seems they all had the same type of exception:

  22 Exception Type:        EXC_BAD_INSTRUCTION (SIGILL)
  23 Exception Codes:       0x0000000000000001, 0x0000000000000000
  24 Exception Note:        EXC_CORPSE_NOTIFY
  25 
  26 Termination Signal:    Illegal instruction: 4
  27 Termination Reason:    Namespace SIGNAL, Code 0x4
  28 Terminating Process:   exc handler [0]
  29 
  30 Application Specific Information:
  31 *** Terminating app due to uncaught exception 'NSGenericException', reason: '-[NSAlert runModal] may only be invoked from the main thread. Behavior on other threads is undefined.

We also weren't able to resync runs because it seemed the run file wasn't opened on time (we have a latency of 100-105 ms), and we ended up with some default settings with no current PMT information. The builder instructed us to restart the run immediately. Doing so would usually result in the above crash.

mnirkko commented 6 years ago

It looks like #504 fixes the above crash. However, the underlying issue for the RESYNC not working was not found yet. Attempted to connect to teststand and start a run there. While some issues were seen here and there, these were sorted by initialising the hardware correctly and changing the server names. However, it was not possible to interact with the teststand remotely in Orca - while there was no crash or error message, Orca would simply not react. In Xcode, the following errors where shown:

2018-05-01 14:39:00.412401-0400 Orca[76172:13829179] *** Assertion failure in -[NSCustomImageRep encodeWithCoder:], /BuildRoot/Library/Caches/com.apple.xbs/Sources/AppKit/AppKit-1561.40.112/AppKit.subproj/NSCustomImageRep.m:193
2018-05-01 14:39:00.417235-0400 Orca[76172:13829179] [General] An uncaught exception was raised
2018-05-01 14:39:00.417267-0400 Orca[76172:13829179] [General] Attempt to encode a delegate-based NSCustomImageRep with no delegate (was the weakly held delegate deallocated?).
2018-05-01 14:41:29.538375-0400 Orca[76172:13829179] *** Assertion failure in -[NSCustomImageRep encodeWithCoder:], /BuildRoot/Library/Caches/com.apple.xbs/Sources/AppKit/AppKit-1561.40.112/AppKit.subproj/NSCustomImageRep.m:193
2018-05-01 14:41:29.538530-0400 Orca[76172:13829179] [General] Attempt to encode a delegate-based NSCustomImageRep with no delegate (was the weakly held delegate deallocated?).

This looks an issue affecting the teststand only, as we did not observe this kind of behaviour when sending commands to the detector 10 days ago. Another less likely option is that this is a new issue caused by PR #504. Finally, we are locally running macOS 10.13.4, which is one upgrade ahead of most machines which run 10.13.3.