Open dpdutcher opened 2 years ago
Currently, this is seeming very similar to https://github.com/slaclab/pysmurf/issues/713 , which was "solved" by an update to the smurf-streamer but perhaps I marked as closed prematurely. Anecdotally, I've once again seen epics crashing when operating three slots for an overnight dataset, but running two slots works fine.
just for reference, seems like this generated a core dump (core_1660699519_python3_11091_11979_1001_1000) with the following backtrace:
#0 0x00007f446004f270 in SmurfBuilder::FrameFromSamples(std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>, std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>) ()
from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
[Current thread is 1 (Thread 0x7f443f4a5700 (LWP 107))]
(gdb) bt
#0 0x00007f446004f270 in SmurfBuilder::FrameFromSamples(std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>, std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>) ()
from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
#1 0x00007f4460050973 in SmurfBuilder::FlushStash() () from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
#2 0x00007f4460050e1d in SmurfBuilder::ProcessStashThread(SmurfBuilder*) () from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
#3 0x00007f44849926ef in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f4487cfc6db in start_thread (arg=0x7f443f4a5700) at pthread_create.c:463
#5 0x00007f448803588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
I will look more into it but if you find a way to semi-reliably reproduce it that would be very helpful
But definitely seems like an issue with the streamer, possibly a race condition or something
Sent this to Daniel in Slack, but if you run this command it will enable more debugging lots in the smurf-streamer dockers...
S._caput('smurf_server_s2:AMCc:SmurfProcessor:SOStream:DebugBuilder', 1)
which could provide some good info related to what's going wrong
Hi Daniel, I was able to debug this a bit after seeing it in some SAT1 tests. I added some queue limits in this PR of the smurf-streamer, which seems to have fixed a lot of the issues we were seeing when operating multiple slots on the SAT. If you want to upgrade you can use the docker tag simonsobs/smurf-streamer:v0.4.1-2-g142bddf
.
Since this has fixed things on SAT1 I'm going to close this issue, but feel free to re-open if you upgrade and still see crashes.
Happened with smurf-streamer version v0.4.1-3-g728183a . I was only doing things on one slot at the time. I don't see any out of the ordinary in the smurf streamer log or in core dumps. I can't communicate with the board now, just getting "epics failed to respond" errors.
Original crash message:
RuntimeError: epics failed to respond
During handling of the above exception, another exception occurred:
...
epics.ca.ChannelAccessGetFailure: Get failed; status code: 192
What were you doing when it crashed?
I was running https://github.com/simonsobs/readout-script-dev/blob/master/ddutcher/ufm_biasstep_sodetlib.py , it should have been running bias steps at the time it crashed. The last messges in stdout before the timeout were
[ 2022-11-12 05:29:53 ] Waiting 3 sec after switching to hcm
[ 2022-11-12 05:29:56 ] Input downsample factor is None. Using value already in pyrogue: 1
[ 2022-11-12 05:29:56 ] FLUX RAMP IS DC COUPLED.
[ 2022-11-12 05:30:00 ] caput smurf_server_s2:AMCc:SmurfProcessor:Unwrapper:reset 1
[ 2022-11-12 05:30:00 ] caput smurf_server_s2:AMCc:SmurfProcessor:Filter:reset 1
[ 2022-11-12 05:30:02 ] Writing to file : /data/smurf_data/20221112/crate1slot2/1668227025/outputs/1668231003.dat
[ 2022-11-12 05:30:02 ] /data/smurf_data/20221112/crate1slot2/1668227025/outputs/1668231003_mask.txt
[ 2022-11-12 05:30:02 ] Writing frequency mask.
[ 2022-11-12 05:30:10 ] Command failed: smurf_server_s2:AMCc:FpgaTopLevel:AppTop:AppCore:SysgenCryo:Base[2]:CryoChannels:centerFrequencyArray
[ 2022-11-12 05:30:10 ] Retry attempt 1 of 5
Interesting... this could be the same issue but I don't see a core-dump file on your system.
It seems like your smurf-server, being one of the first ones issued, is also under-spec'ed compared to the ones we're using on the SAT, so it kind of makes sense you're seeing this the most often. We were seeing it more frequently on our system that was having RAM issues. Replacing it with an official one might alleviate this issue...
Apart from replacing your server there are a few things we can probably try that might help:
Got this error when running uxm_setup, during the estimate_phase_delay portion. The full traceback is below, though I know users often enounter this error in various places, so this can be a catch-all thread.
In this particular instances, there was no associated error in the smurf-streamer docker logs, and I could still communicate with the board via the pysmurf-ipython session, and I could just restart the uxm_setup script with no hammering required.