simonsobs / socs

Simons Observatory specific OCS agents.
BSD 2-Clause "Simplified" License
12 stars 12 forks source link

Unavailable UPS agent causes monitor crash in HWP Supervisor #770

Open BrianJKoopman opened 1 day ago

BrianJKoopman commented 1 day ago

I was helping satp2 try to recover their HWP system this morning and found the supervisor agent in this state:

2024-09-26T17:09:50+0000 startup-op: launching monitor
2024-09-26T17:09:50+0000 start called for monitor
2024-09-26T17:09:50+0000 monitor:0 Status is now "starting".
2024-09-26T17:09:50+0000 startup-op: launching spin_control
2024-09-26T17:09:50+0000 start called for spin_control
2024-09-26T17:09:50+0000 spin_control:1 Status is now "starting".
2024-09-26T17:09:50+0000 monitor:0 Status is now "running".
2024-09-26T17:09:50+0000 spin_control:1 Status is now "running".
2024-09-26T17:09:55+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.power-ups-az.ops>'], {}]
2024-09-26T17:09:55+0000 Could not connect to client: power-ups-az
2024-09-26T17:09:55+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.power-iboot-hwp-2.ops>'], {}]
2024-09-26T17:09:56+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.power-iboot-hwp-2.ops>'], {}]
2024-09-26T17:09:56+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.acu.ops>'], {}]
2024-09-26T17:09:56+0000 Could not connect to client: power-iboot-hwp-2
2024-09-26T17:09:56+0000 monitor:0 CRASH: [Failure instance: Traceback: <class 'ValueError'>: Could not find upsOutputSource OID
/usr/lib/python3.10/threading.py:1016:_bootstrap_inner
/usr/lib/python3.10/threading.py:953:run
/opt/venv/lib/python3.10/site-packages/twisted/_threads/_threadworker.py:49:work
/opt/venv/lib/python3.10/site-packages/twisted/_threads/_team.py:192:doWork
--- <exception caught here> ---
/opt/venv/lib/python3.10/site-packages/twisted/python/threadpool.py:269:inContext
/opt/venv/lib/python3.10/site-packages/twisted/python/threadpool.py:285:<lambda>
/opt/venv/lib/python3.10/site-packages/twisted/python/context.py:117:callWithContext
/opt/venv/lib/python3.10/site-packages/twisted/python/context.py:82:callWithContext
/opt/venv/lib/python3.10/site-packages/ocs/ocs_agent.py:984:_running_wrapper
/opt/venv/lib/python3.10/site-packages/socs/agents/hwp_supervisor/agent.py:1374:monitor
/opt/venv/lib/python3.10/site-packages/socs/agents/hwp_supervisor/agent.py:442:update_ups_state
]
2024-09-26T17:09:56+0000 monitor:0 Status is now "done".

It seems like it wasn't able to connect to any of the clients so when monitor goes to grab state info it hits this raise, which it doesn't handle: https://github.com/simonsobs/socs/blob/33b1e1d82d367829a9273222d374a0219b151801/socs/agents/hwp_supervisor/agent.py#L442

EDIT: This was on socs image: v0.5.1-22-g7d2f158-dev

jlashner commented 23 hours ago

Thanks for this. The correct behavior is probably to catch this in the monitor_state process and mark it as degraded... and also raise a flag to make sure none of the spin-up commands can run.

I think it might make sense to move the safety check logic from the control-update function into properties of the HWPState object, such as spin_up_safe and grip_safe that check internal state variables like this and return a bool. (I don't think UPS state is currently checked anywhere before)