simonsobs / socs

Simons Observatory specific OCS agents.
BSD 2-Clause "Simplified" License
12 stars 12 forks source link

ACU: Crash in sun monitor calculation #661

Closed BrianJKoopman closed 2 months ago

BrianJKoopman commented 2 months ago

I'm seeing the below error in the ACU agent from this line of the agent (note this links to the code at the commit running in the image below): https://github.com/simonsobs/socs/blob/f6c0fa735f25d73f90cf3cf2acbae6a55f280e2f/socs/agents/acu/agent.py#L2300

This occurred while updating to ocs v0.11.0 and the new Host Manager Agent. I believe things were fine before the update based on the Agent being visible, and not marked as down/cycling in ocs-web. The problem persists when I revert back though, so I think it's unrelated.

socs version: v0.4.5-26-gf6c0fa7-dev

$ sudo docker logs -f daq-satp2-ocs-acu-1
Args: ['--instance-id', 'acu', '--site-hub', 'ws://127.0.0.1:8004/ws', '--site-http', 'http://127.0.0.1:8004/call']
^[[AInstalled OCS Plugins: ['socs', 'ocs']

Renaming this process to: "ocs-agent:acu"
2024-04-29T21:22:58+0000 Using OCS version 0.10.4
2024-04-29T21:22:58+0000 ocs: starting <class 'ocs.ocs_agent.OCSAgent'> @ satp2.acu
2024-04-29T21:22:58+0000 log_file is apparently None
2024-04-29T21:22:58+0000 transport connected
2024-04-29T21:22:58+0000 session joined: {'authextra': {'x_cb_node': 'e17019f77488-7',
               'x_cb_peer': 'tcp4:172.32.2.1:55502',
               'x_cb_pid': 14,
               'x_cb_worker': 'worker001'},
 'authid': 'XFVR-A733-US6X-QKP3-VPK6-S9YX',
 'authmethod': 'anonymous',
 'authprovider': 'static',
 'authrole': 'iocs_agent',
 'realm': 'test_realm',
 'resumable': False,
 'resume_token': None,
 'resumed': False,
 'serializer': 'cbor.batched',
 'session': 3320602893879999,
 'transport': {'channel_framing': 'websocket',
               'channel_id': {},
               'channel_serializer': None,
               'channel_type': 'tcp',
               'http_cbtid': None,
               'http_headers_received': None,
               'http_headers_sent': None,
               'is_secure': False,
               'is_server': False,
               'own': None,
               'own_fd': -1,
               'own_pid': 7,
               'own_tid': 7,
               'peer': 'tcp4:127.0.0.1:8004',
               'peer_cert': None,
               'websocket_extensions_in_use': None,
               'websocket_protocol': None}}
2024-04-29T21:22:58+0000 startup-op: launching monitor
2024-04-29T21:22:58+0000 start called for monitor
2024-04-29T21:22:58+0000 monitor:0 Status is now "starting".
2024-04-29T21:22:58+0000 startup-op: launching broadcast
2024-04-29T21:22:58+0000 start called for broadcast
2024-04-29T21:22:58+0000 broadcast:1 Status is now "starting".
2024-04-29T21:22:58+0000 startup-op: launching monitor_sun
2024-04-29T21:22:58+0000 start called for monitor_sun
2024-04-29T21:22:58+0000 monitor_sun:2 Status is now "starting".
2024-04-29T21:22:58+0000 monitor:0 Status is now "running".
2024-04-29T21:22:58+0000 broadcast:1 Status is now "running".
2024-04-29T21:22:58+0000 MonitorUDP starting on 10002
2024-04-29T21:22:58+0000 Starting protocol <socs.agents.acu.agent.ACUAgent.broadcast.<locals>.MonitorUDP object at 0x7fc9d3971b50>
2024-04-29T21:22:58+0000 monitor_sun:2 Status is now "running".
2024-04-29T21:22:58+0000 (Re-)computed Sun Safety Map (took 0.2s)
2024-04-29T21:22:59+0000 monitor_sun:2 CRASH: [Failure instance: Traceback: <class 'UnboundLocalError'>: local variable 't' referenced before assignment
/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py:874:callback
/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py:981:_startRunCallbacks
/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py:1075:_runCallbacks
/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py:1946:_gotResultInlineCallbacks
--- <exception caught here> ---
/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py:2000:_inlineCallbacks
/usr/local/lib/python3.8/dist-packages/socs/agents/acu/agent.py:2300:monitor_sun
]
2024-04-29T21:22:59+0000 monitor_sun:2 Status is now "done".
2024-04-29T21:23:01+0000 No UDP packets are being received.
2024-04-29T21:23:01+0000 Requesting UDP stream enable.
2024-04-29T21:25:10+0000 Exception while trying to enable stream: HTTPConnectionPool(host='192.168.12.5', port=8080): Max retries exceeded with url: /?Module=Services.PositionBroadcast&Chapter=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc9d3985a30>: Failed to establish a new connection: [Errno 110] Connection timed out'))
2024-04-29T21:25:10+0000 TCP connection timed out: 110: Connection timed out.
2024-04-29T21:25:10+0000 monitor process failed to query version! Will keep trying.
2024-04-29T21:25:10+0000 session left: CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session 3320602893879999 before')
2024-04-29T21:25:10+0000 transport disconnected
2024-04-29T21:25:10+0000 waiting for reconnection
2024-04-29T21:25:10+0000 waiting at least 9.999998092651367 more seconds before giving up
2024-04-29T21:25:10+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7fc9d3a3bc70> in 2.005381966360726 seconds.
2024-04-29T21:25:11+0000 Requesting UDP stream enable.
2024-04-29T21:25:21+0000 Exception while trying to enable stream: HTTPConnectionPool(host='192.168.12.5', port=8080): Max retries exceeded with url: /?Module=Services.PositionBroadcast&Chapter=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc9d3971250>: Failed to establish a new connection: [Errno 113] No route to host'))
2024-04-29T21:25:21+0000 Stopping all running sessions
2024-04-29T21:25:21+0000 Stopping session monitor
2024-04-29T21:25:21+0000 stop called for monitor
2024-04-29T21:25:21+0000 Unable to publish status. TransportLost. crossbar server likely unreachable.
2024-04-29T21:25:21+0000 monitor:0 Status is now "stopping".
2024-04-29T21:25:21+0000 Stopper for "monitor" terminated with ok=True and message (None,)
2024-04-29T21:25:21+0000 Stopping session broadcast
2024-04-29T21:25:21+0000 stop called for broadcast
2024-04-29T21:25:21+0000 Unable to publish status. TransportLost. crossbar server likely unreachable.
2024-04-29T21:25:21+0000 broadcast:1 Status is now "stopping".
2024-04-29T21:25:21+0000 Stopper for "broadcast" terminated with ok=True and message (None,)
2024-04-29T21:25:21+0000 Stopping session monitor_sun
2024-04-29T21:25:21+0000 transport connected
2024-04-29T21:25:21+0000 broadcast:1 Acquisition exited cleanly.
2024-04-29T21:25:21+0000 broadcast:1 Status is now "done".
2024-04-29T21:25:21+0000 (UDP Port 10002 Closed)
2024-04-29T21:25:21+0000 Stopping protocol <socs.agents.acu.agent.ACUAgent.broadcast.<locals>.MonitorUDP object at 0x7fc9d3971b50>
2024-04-29T21:25:21+0000 session joined: {'authextra': {'x_cb_node': 'e17019f77488-7',
               'x_cb_peer': 'tcp4:172.32.2.1:59246',
               'x_cb_pid': 14,
               'x_cb_worker': 'worker001'},
 'authid': 'W7TX-C5MC-STGH-TXHH-SMX3-373Y',
 'authmethod': 'anonymous',
 'authprovider': 'static',
 'authrole': 'iocs_agent',
 'realm': 'test_realm',
 'resumable': False,
 'resume_token': None,
 'resumed': False,
 'serializer': 'cbor.batched',
 'session': 2482334816892880,
 'transport': {'channel_framing': 'websocket',
               'channel_id': {},
               'channel_serializer': None,
               'channel_type': 'tcp',
               'http_cbtid': None,
               'http_headers_received': None,
               'http_headers_sent': None,
               'is_secure': False,
               'is_server': False,
               'own': None,
               'own_fd': -1,
               'own_pid': 7,
               'own_tid': 7,
               'peer': 'tcp4:127.0.0.1:8004',
               'peer_cert': None,
               'websocket_extensions_in_use': None,
               'websocket_protocol': None}}
2024-04-29T21:25:24+0000 stopping reactor
2024-04-29T21:25:24+0000 session left: CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session 2482334816892880 before')
2024-04-29T21:25:24+0000 transport disconnected
2024-04-29T21:25:24+0000 waiting for reconnection
2024-04-29T21:25:24+0000 waiting at least 9.999998569488525 more seconds before giving up
2024-04-29T21:25:24+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7fc9d3a3bc70> in 2.095843830224841 seconds.
2024-04-29T21:25:24+0000 Main loop terminated.
mhasself commented 2 months ago

Thanks. This bug is unlikely to activate in cases where the ACU is actually on ... so not as big a concern for the active platforms. But fix is coming.