simonsobs / socs

Simons Observatory specific OCS agents.
BSD 2-Clause "Simplified" License
12 stars 12 forks source link

HWPSupervisor crashes on startup when using synaccess for `driver-iboot-id` #691

Closed BrianJKoopman closed 2 weeks ago

BrianJKoopman commented 3 weeks ago

On satp3 after maintenance they're having trouble starting up the HWP Supervisor agent, it crashes immediately with:

Args: ['--instance-id', 'hwp-supervisor', '--site-hub', 'ws://127.0.0.1:8005/ws', '--site-http', 'http://127.0.0.1:8005/call']
Installed OCS Plugins: ['socs', 'ocs']
Renaming this process to: "ocs-agent:hwp-supervisor"
2024-06-11T20:37:41+0000 Using OCS version 0.11.0
2024-06-11T20:37:41+0000 ocs: starting <class 'ocs.ocs_agent.OCSAgent'> @ satp3.hwp-supervisor
2024-06-11T20:37:41+0000 log_file is apparently None
2024-06-11T20:37:41+0000 Setting state: ControlState.Idle()
2024-06-11T20:37:41+0000 transport connected
2024-06-11T20:37:41+0000 session joined: {'authextra': {'x_cb_node': '4e2f32e9b379-6',
               'x_cb_peer': 'tcp4:172.19.0.1:53870',
               'x_cb_pid': 13,
               'x_cb_worker': 'worker001'},
 'authid': 'XGSA-KE73-GR6Y-QNXJ-JNFL-WL4V',
 'authmethod': 'anonymous',
 'authprovider': 'static',
 'authrole': 'iocs_agent',
 'realm': 'test_realm',
 'resumable': False,
 'resume_token': None,
 'resumed': False,
 'serializer': 'cbor.batched',
 'session': 3582372078905339,
 'transport': {'channel_framing': 'websocket',
               'channel_id': {},
               'channel_serializer': None,
               'channel_type': 'tcp',
               'http_cbtid': None,
               'http_headers_received': None,
               'http_headers_sent': None,
               'is_secure': False,
               'is_server': False,
               'own': None,
               'own_fd': -1,
               'own_pid': 8,
               'own_tid': 8,
               'peer': 'tcp4:127.0.0.1:8005',
               'peer_cert': None,
               'websocket_extensions_in_use': None,
               'websocket_protocol': None}}
2024-06-11T20:37:41+0000 startup-op: launching monitor
2024-06-11T20:37:41+0000 start called for monitor
2024-06-11T20:37:41+0000 monitor:0 Status is now "starting".
2024-06-11T20:37:41+0000 startup-op: launching spin_control
2024-06-11T20:37:41+0000 start called for spin_control
2024-06-11T20:37:41+0000 spin_control:1 Status is now "starting".
2024-06-11T20:37:41+0000 monitor:0 Status is now "running".
2024-06-11T20:37:41+0000 spin_control:1 Status is now "running".
2024-06-11T20:37:44+0000 monitor:0 CRASH: [Failure instance: Traceback: <class 'KeyError'>: 'outletStatus_4'
/usr/lib/python3.8/threading.py:932:_bootstrap_inner
/usr/lib/python3.8/threading.py:870:run
/usr/local/lib/python3.8/dist-packages/twisted/_threads/_threadworker.py:49:work
/usr/local/lib/python3.8/dist-packages/twisted/_threads/_team.py:192:doWork
--- <exception caught here> ---
/usr/local/lib/python3.8/dist-packages/twisted/python/threadpool.py:269:inContext
/usr/local/lib/python3.8/dist-packages/twisted/python/threadpool.py:285:<lambda>
/usr/local/lib/python3.8/dist-packages/twisted/python/context.py:117:callWithContext
/usr/local/lib/python3.8/dist-packages/twisted/python/context.py:82:callWithContext
/usr/local/lib/python3.8/dist-packages/ocs/ocs_agent.py:984:_running_wrapper
/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py:1107:monitor
/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py:135:update
/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py:136:<dictcomp>
]
2024-06-11T20:37:44+0000 monitor:0 Status is now "done".

I haven't dug into this too deeply, but it seems like the label its looking for 'outletStatus_4' is the syntax for if the driver power agent is an IBootBar agent and not a synaccess agent.

I know we added the ability to select remote PDU type in https://github.com/simonsobs/socs/pull/653, but maybe we're still hitting some edge case that misses support for this?

ykyohei commented 3 weeks ago

OK satp3-supervisor was working before the maintenance because hwp-supervisor agent was brought up when synccess agent is down,