slaclab / pysmurf

Other
2 stars 9 forks source link

epics crashes when using three slots simultaneously to operate detectors #713

Closed dpdutcher closed 2 years ago

dpdutcher commented 2 years ago

Describe the bug

In Princeton testing of the SO detector modules, operating three slots simultaneously causes epics to crash on one of the boards, and that board needs to be hammered to become operational again.

Here, "operating" means taking either IV curves or bias-step data using sodetlib functions. This problem has not occurred when operating two slots simultaneously, but regularly occurs when operating three. The crash is not immediate, sometimes happening after one hour, two hours, or 12 hours. It is not always the same slot that crashes. The other two slots remain operational while the third one has crashed.

Additional details

Machine is smurf-srv15. Relevant log files for such a crash that happened today are:

Beginning of traceback from ocs-pysmurf docker:

2022-03-15T18:32:49+0000 Unexpected problem with CA circuit to server "localhost:5064" was "Connection reset by peer" - disconnecting
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "localhost:5064"
    Source File: ../cac.cpp line 1223
    Current Time: Tue Mar 15 2022 18:32:49.911947848
..................................................................
2022-03-15T18:33:49+0000 ufm_biasstep_sodetlib.py: cannot connect to smurf_server_s3:AMCc:FpgaTopLevel:AppTop:AppCore:RtmCryoDet:SpiCryo:write
2022-03-15T18:34:49+0000 ufm_biasstep_sodetlib.py: cannot connect to smurf_server_s3:AMCc:FpgaTopLevel:AppTop:AppCore:RtmCryoDet:SpiCryo:write
2022-03-15T18:34:54+0000 ufm_biasstep_sodetlib.py: cannot connect to smurf_server_s3:AMCc:FpgaTopLevel:AppTop:AppCore:RtmCryoDet:SpiCryo:read
msilvafe commented 2 years ago

@dpdutcher what was the solution for this?

dpdutcher commented 2 years ago

The interim solution was to run S._caput('smurf_server_s2:AMCc:SmurfProcessor:SOStream:BuilderEncode', 1) for each slot, and the longer term solution was Jack set that to be the default in an update to the smurf-streamer. v0.4.0 certainly has the fix.