RuntimeError: epics failed to respond

dpdutcher commented 2 years ago

Got this error when running uxm_setup, during the estimate_phase_delay portion. The full traceback is below, though I know users often enounter this error in various places, so this can be a catch-all thread.

In this particular instances, there was no associated error in the smurf-streamer docker logs, and I could still communicate with the board via the pysmurf-ipython session, and I could just restart the uxm_setup script with no hammering required.

[ 2022-08-15 14:55:17 ]  Running find_freq
Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 234, in 'calling callback function'
  File "/usr/local/lib/python3.6/dist-packages/epics/ca.py", line 730, in _onGetEvent
    result = memcopy(dbr.cast_args(args))
  File "/usr/local/lib/python3.6/dist-packages/epics/dbr.py", line 308, in cast_args
    ntype = native_type(ftype)
  File "/usr/local/lib/python3.6/dist-packages/epics/dbr.py", line 255, in native_type
    if ftype > CTRL_STRING:
TypeError: '>' not supported between instances of '_ctypes.PyCSimpleType' and 'int'
[ 2022-08-15 14:55:22 ]  Command failed: smurf_server_s6:AMCc:FpgaTopLevel:AppTop:AppCore:SysgenCryo:Base[5]:bandCenterMHz
[ 2022-08-15 14:55:22 ]  Retry attempt 1 of 5
[ 2022-08-15 14:55:27 ]  Retry attempt 2 of 5
[ 2022-08-15 14:55:32 ]  Retry attempt 3 of 5
[ 2022-08-15 14:55:37 ]  Retry attempt 4 of 5
[ 2022-08-15 14:55:42 ]  Retry attempt 5 of 5
Traceback (most recent call last):
  File "/devel/scripts/uxm_setup.py", line 18, in <module>
    uxm_setup.uxm_setup(S, cfg, bands=args.bands)
  File "/usr/local/src/pysmurf/python/pysmurf/client/util/pub.py", line 50, in wrapper
    rv = func(S, *args, **kwargs)
  File "/sodetlib/sodetlib/operations/uxm_setup.py", line 472, in uxm_setup
    S, cfg, bands, update_cfg=update_cfg)
  File "/usr/local/src/pysmurf/python/pysmurf/client/util/pub.py", line 50, in wrapper
    rv = func(S, *args, **kwargs)
  File "/sodetlib/sodetlib/operations/uxm_setup.py", line 209, in setup_phase_delay
    band_delay_us, _ = S.estimate_phase_delay(b, make_plot=True, show_plot=False)
  File "/usr/local/src/pysmurf/python/pysmurf/client/util/pub.py", line 50, in wrapper
    rv = func(S, *args, **kwargs)
  File "/usr/local/src/pysmurf/python/pysmurf/client/util/smurf_util.py", line 310, in estimate_phase_delay
    freq_dsp,resp_dsp=self.find_freq(band,subband=dsp_subbands)
  File "/usr/local/src/pysmurf/python/pysmurf/client/util/pub.py", line 50, in wrapper
    rv = func(S, *args, **kwargs)
  File "/usr/local/src/pysmurf/python/pysmurf/client/tune/smurf_tune.py", line 3459, in find_freq
    band_center = self.get_band_center_mhz(band)
  File "/usr/local/src/pysmurf/python/pysmurf/client/command/smurf_command.py", line 2345, in get_band_center_mhz
    **kwargs)
  File "/usr/local/src/pysmurf/python/pysmurf/client/command/smurf_command.py", line 186, in _caget
    raise RuntimeError("epics failed to respond")
RuntimeError: epics failed to respond

dpdutcher commented 2 years ago

Currently, this is seeming very similar to https://github.com/slaclab/pysmurf/issues/713 , which was "solved" by an update to the smurf-streamer but perhaps I marked as closed prematurely. Anecdotally, I've once again seen epics crashing when operating three slots for an overnight dataset, but running two slots works fine.

jlashner commented 2 years ago

just for reference, seems like this generated a core dump (core_1660699519_python3_11091_11979_1001_1000) with the following backtrace:

#0  0x00007f446004f270 in SmurfBuilder::FrameFromSamples(std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>, std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>) ()
   from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
[Current thread is 1 (Thread 0x7f443f4a5700 (LWP 107))]
(gdb) bt
#0  0x00007f446004f270 in SmurfBuilder::FrameFromSamples(std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>, std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>) ()
   from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
#1  0x00007f4460050973 in SmurfBuilder::FlushStash() () from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
#2  0x00007f4460050e1d in SmurfBuilder::ProcessStashThread(SmurfBuilder*) () from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
#3  0x00007f44849926ef in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f4487cfc6db in start_thread (arg=0x7f443f4a5700) at pthread_create.c:463
#5  0x00007f448803588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I will look more into it but if you find a way to semi-reliably reproduce it that would be very helpful

jlashner commented 2 years ago

But definitely seems like an issue with the streamer, possibly a race condition or something

jlashner commented 2 years ago

Sent this to Daniel in Slack, but if you run this command it will enable more debugging lots in the smurf-streamer dockers...

S._caput('smurf_server_s2:AMCc:SmurfProcessor:SOStream:DebugBuilder', 1)

which could provide some good info related to what's going wrong

jlashner commented 1 year ago

Hi Daniel, I was able to debug this a bit after seeing it in some SAT1 tests. I added some queue limits in this PR of the smurf-streamer, which seems to have fixed a lot of the issues we were seeing when operating multiple slots on the SAT. If you want to upgrade you can use the docker tag simonsobs/smurf-streamer:v0.4.1-2-g142bddf.

Since this has fixed things on SAT1 I'm going to close this issue, but feel free to re-open if you upgrade and still see crashes.

dpdutcher commented 1 year ago

Happened with smurf-streamer version v0.4.1-3-g728183a . I was only doing things on one slot at the time. I don't see any out of the ordinary in the smurf streamer log or in core dumps. I can't communicate with the board now, just getting "epics failed to respond" errors.

Original crash message:

RuntimeError: epics failed to respond
During handling of the above exception, another exception occurred:
...
epics.ca.ChannelAccessGetFailure: Get failed; status code: 192

jlashner commented 1 year ago

What were you doing when it crashed?

dpdutcher commented 1 year ago

I was running https://github.com/simonsobs/readout-script-dev/blob/master/ddutcher/ufm_biasstep_sodetlib.py , it should have been running bias steps at the time it crashed. The last messges in stdout before the timeout were

[ 2022-11-12 05:29:53 ]  Waiting 3 sec after switching to hcm
[ 2022-11-12 05:29:56 ]  Input downsample factor is None. Using value already in pyrogue: 1
[ 2022-11-12 05:29:56 ]  FLUX RAMP IS DC COUPLED.
[ 2022-11-12 05:30:00 ]  caput smurf_server_s2:AMCc:SmurfProcessor:Unwrapper:reset 1
[ 2022-11-12 05:30:00 ]  caput smurf_server_s2:AMCc:SmurfProcessor:Filter:reset 1
[ 2022-11-12 05:30:02 ]  Writing to file : /data/smurf_data/20221112/crate1slot2/1668227025/outputs/1668231003.dat
[ 2022-11-12 05:30:02 ]  /data/smurf_data/20221112/crate1slot2/1668227025/outputs/1668231003_mask.txt
[ 2022-11-12 05:30:02 ]  Writing frequency mask.
[ 2022-11-12 05:30:10 ]  Command failed: smurf_server_s2:AMCc:FpgaTopLevel:AppTop:AppCore:SysgenCryo:Base[2]:CryoChannels:centerFrequencyArray
[ 2022-11-12 05:30:10 ]  Retry attempt 1 of 5

jlashner commented 1 year ago

Interesting... this could be the same issue but I don't see a core-dump file on your system.

It seems like your smurf-server, being one of the first ones issued, is also under-spec'ed compared to the ones we're using on the SAT, so it kind of makes sense you're seeing this the most often. We were seeing it more frequently on our system that was having RAM issues. Replacing it with an official one might alleviate this issue...

Apart from replacing your server there are a few things we can probably try that might help:

Lower the max queue-size for your system. I would have to change the smurf-streamer to make this a configurable parameter, but that kind of makes sense to me.
Have the bias-step function optionally take downsampled data. Right now bias steps always disables downsampling, but that is not really necessary unless you care about calculating tau-eff. I think the combination of high data rates / a lot of epics calls is what's causing this issue in the first place, so lowering the sampling rate when possible would help.

simonsobs / sodetlib

RuntimeError: epics failed to respond #273