Unavailable UPS agent causes monitor crash in HWP Supervisor

I was helping satp2 try to recover their HWP system this morning and found the supervisor agent in this state:

2024-09-26T17:09:50+0000 startup-op: launching monitor
2024-09-26T17:09:50+0000 start called for monitor
2024-09-26T17:09:50+0000 monitor:0 Status is now "starting".
2024-09-26T17:09:50+0000 startup-op: launching spin_control
2024-09-26T17:09:50+0000 start called for spin_control
2024-09-26T17:09:50+0000 spin_control:1 Status is now "starting".
2024-09-26T17:09:50+0000 monitor:0 Status is now "running".
2024-09-26T17:09:50+0000 spin_control:1 Status is now "running".
2024-09-26T17:09:55+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.power-ups-az.ops>'], {}]
2024-09-26T17:09:55+0000 Could not connect to client: power-ups-az
2024-09-26T17:09:55+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.power-iboot-hwp-2.ops>'], {}]
2024-09-26T17:09:56+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.power-iboot-hwp-2.ops>'], {}]
2024-09-26T17:09:56+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.acu.ops>'], {}]
2024-09-26T17:09:56+0000 Could not connect to client: power-iboot-hwp-2
2024-09-26T17:09:56+0000 monitor:0 CRASH: [Failure instance: Traceback: <class 'ValueError'>: Could not find upsOutputSource OID
/usr/lib/python3.10/threading.py:1016:_bootstrap_inner
/usr/lib/python3.10/threading.py:953:run
/opt/venv/lib/python3.10/site-packages/twisted/_threads/_threadworker.py:49:work
/opt/venv/lib/python3.10/site-packages/twisted/_threads/_team.py:192:doWork
--- <exception caught here> ---
/opt/venv/lib/python3.10/site-packages/twisted/python/threadpool.py:269:inContext
/opt/venv/lib/python3.10/site-packages/twisted/python/threadpool.py:285:<lambda>
/opt/venv/lib/python3.10/site-packages/twisted/python/context.py:117:callWithContext
/opt/venv/lib/python3.10/site-packages/twisted/python/context.py:82:callWithContext
/opt/venv/lib/python3.10/site-packages/ocs/ocs_agent.py:984:_running_wrapper
/opt/venv/lib/python3.10/site-packages/socs/agents/hwp_supervisor/agent.py:1374:monitor
/opt/venv/lib/python3.10/site-packages/socs/agents/hwp_supervisor/agent.py:442:update_ups_state
]
2024-09-26T17:09:56+0000 monitor:0 Status is now "done".

It seems like it wasn't able to connect to any of the clients so when monitor goes to grab state info it hits this raise, which it doesn't handle: https://github.com/simonsobs/socs/blob/33b1e1d82d367829a9273222d374a0219b151801/socs/agents/hwp_supervisor/agent.py#L442

EDIT: This was on socs image: v0.5.1-22-g7d2f158-dev

simonsobs / socs

Unavailable UPS agent causes monitor crash in HWP Supervisor #770