simonsobs / socs

Simons Observatory specific OCS agents.
BSD 2-Clause "Simplified" License
12 stars 12 forks source link

Smurf recorder quits on jackhammer soft_reset #78

Closed jlashner closed 3 years ago

jlashner commented 4 years ago

Looks like the smurf-recorder at UCSD is shutting down whenever I run jackhammer soft_reset. Strangely it seems like it is handling the disconnection/reconnection well, so I'm not sure exactly why it's crashing... Here's my full output

adaq1@adaq1:~/so/ocs-site-configs/ucsd/adaq1$ docker-compose up ocs-smurf-recorder-s2
adaq1_sisock-crossbar_1 is up-to-date
Starting adaq1_ocs-smurf-recorder-s2_1 ... done
Attaching to adaq1_ocs-smurf-recorder-s2_1
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 site_config is setting values of "auto_start" to "True".
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 site_config is setting values of "time_per_file" to "600".
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 site_config is setting values of "data_dir" to "/data/timestreams".
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 site_config is setting values of "port" to "4532".
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 site_config is setting values of "address" to "192.168.0.134".
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 Using OCS version 0.6.0+1.g569b428
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 ocs: starting <class 'ocs.ocs_agent.OCSAgent'> @ observatory.smurf-recorder-s2
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 log_file is apparently None
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 transport connected
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 session joined:
ocs-smurf-recorder-s2_1   | SessionDetails(realm=<test_realm>,
ocs-smurf-recorder-s2_1   |                session=3069769554717731,
ocs-smurf-recorder-s2_1   |                authid=<KPSS-CYPK-WJN6-YWGN-Y33C-M6EG>,
ocs-smurf-recorder-s2_1   |                authrole=<server>,
ocs-smurf-recorder-s2_1   |                authmethod=anonymous,
ocs-smurf-recorder-s2_1   |                authprovider=static,
ocs-smurf-recorder-s2_1   |                authextra={'x_cb_node_id': None, 'x_cb_peer': 'tcp4:172.18.0.1:59280', 'x_cb_pid': 11},
ocs-smurf-recorder-s2_1   |                serializer=<cbor.batched>,
ocs-smurf-recorder-s2_1   |                resumed=None,
ocs-smurf-recorder-s2_1   |                resumable=None,
ocs-smurf-recorder-s2_1   |                resume_token=None)
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 startup-op: launching record
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 start called for record
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 record:0 Status is now "starting".
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 Data directory set to /data/timestreams
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 New file every 600 seconds
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 Listening to tcp://192.168.0.134:4532
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T01:59:59+0000 Writing to file /data/timestreams/15906/1590631199_000.g3
ocs-smurf-recorder-s2_1   | 2020-05-28T02:00:24+0000 Writing to file /data/timestreams/15906/1590631224_000.g3
ocs-smurf-recorder-s2_1   | FATAL (Unknown): Could not connect to tcp://192.168.0.134:4532 (Connection refused) (dataio.cxx:120 in void g3_istream_from_path(boost::iostreams::filtering_istream&, const string&, float))
ocs-smurf-recorder-s2_1   | 2020-05-28T02:02:40+0000 G3Reader could not connect.
ocs-smurf-recorder-s2_1   | FATAL (Unknown): Could not connect to tcp://192.168.0.134:4532 (Connection refused) (dataio.cxx:120 in void g3_istream_from_path(boost::iostreams::filtering_istream&, const string&, float))
ocs-smurf-recorder-s2_1   | 2020-05-28T02:02:40+0000 G3Reader could not connect.
ocs-smurf-recorder-s2_1   | FATAL (Unknown): Could not connect to tcp://192.168.0.134:4532 (Connection refused) (dataio.cxx:120 in void g3_istream_from_path(boost::iostreams::filtering_istream&, const string&, float))
ocs-smurf-recorder-s2_1   | 2020-05-28T02:02:41+0000 G3Reader could not connect.
ocs-smurf-recorder-s2_1   | 2020-05-28T02:02:42+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:02:48+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:02:53+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:02:58+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:03+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:09+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:14+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:19+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:24+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:29+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:34+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:55+0000 G3Reader connection established
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:55+0000 session left: CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session 3069769554717731 before')
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:55+0000 transport disconnected
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:55+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f779f7453c8> in 1.7350322145009924 seconds.
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:55+0000 Writing to file /data/timestreams/15906/1590631435_000.g3
ocs-smurf-recorder-s2_1   | 2020-05-28T02:03:55+0000 Main loop terminated.
adaq1_ocs-smurf-recorder-s2_1 exited with code 0
BrianJKoopman commented 4 years ago

Interesting, did this also occur before the recent change to the crossbar config?

jlashner commented 4 years ago

That's a good question, maybe it's an autoping timeout. I'm not sure, I'd have to switch back and reboot crossbar to test.

Seems kind of weird that it's exiting exactly when it's able to reconnect to the G3NetworkSender. Btw, the "G3Reader connection established" messages occur when the smurf-streamer is shut down, and the "Writing to file..." message at 2:03:55 is when it starts back up.

BrianJKoopman commented 4 years ago

It might be a valuable test. My thought was that something in the reconnection might be blocking the main event loop when it shouldn't be and the ping doesn't happen in time for the timeout. Is the timeout set to 5 seconds?

It's also a bit strange there are 12 "G3Reader connection established" messages.

BrianJKoopman commented 3 years ago

Just to keep up on this. After talking on the phone it was realized this is on an OCS version 0.6.0+1.g569b428. The system should be updated to the latest version, which includes crossbar reconnection, and monitored for this event.

jlashner commented 3 years ago

This continues to happen when the streamer is restarted even with the newest version of socs, so I'm reopening.

jlashner commented 3 years ago

I was also able to further test this at UCSD. It looks like that now, if the recorder is connected to an existing smurf-streamer, then the smurf-streamer restarts, the smurf-recorder will briefly lose connection to crossbar after the streamer comes back online. This doesn't effect the main processing thread, so I think it's harmless. I have also verified that the agent is now able to successfully re-connect to crossbar after a few seconds without interrupting the data-processing operation, thanks to https://github.com/simonsobs/ocs/pull/180, so I'm going to close this issue.