simonsobs / socs

Simons Observatory specific OCS agents.
BSD 2-Clause "Simplified" License
12 stars 12 forks source link

`run_and_validate()` in HWP Supervisor doesn't check command ran successfully #631

Open BrianJKoopman opened 7 months ago

BrianJKoopman commented 7 months ago

I noticed that run_and_validate() doesn't actually "validate" that the command ran successfully, just that the initial OCS call goes through.

As written, it looks like Tasks could fail and the action would just move on to the next task call. session['success'] would need to be checked to "validate" each task is complete.

_Originally posted by @BrianJKoopman in https://github.com/simonsobs/socs/pull/628#discussion_r1488300484_

ykyohei commented 5 months ago

I think this is related to this issue. I pasted the two examples error log in satp3 below.

Sometimes supervisor functions fail, and failure of supervisor task do not stop the schedule. It is difficult to notice the error right now. I think it's better to stop the schedule when the supervisor task fails.

2024-04-13T11:03:31+0000 pmx_off:3 Status is now "done".
2024-04-13T11:03:31+0000 pmx_off:3 Completed with state: ControlState.Error(traceback='Traceback (most recent call last):\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 755, in update\n    self.run_and_validate(clients.pid.declare_freq,\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate\n    raise ControlClientError("op-wait timed out")\nocs.client_http.ControlClientError: op-wait timed out\n', start_time=1713006210.634159)

2024-04-13T11:03:30+0000 Setting state: ControlState.Error(traceback='Traceback (most recent call last):\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 755, in update\n    self.run_and_validate(clients.pid.declare_freq,\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate\n    raise ControlClientError("op-wait timed out")\nocs.client_http.ControlClientError: op-wait timed out\n', start_time=1713006210.634159)
ocs.client_http.ControlClientError: op-wait timed out
    raise ControlClientError("op-wait timed out")
  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate
    self.run_and_validate(clients.pid.declare_freq,
  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 755, in update
Traceback (most recent call last):
2024-04-13T11:03:30+0000 Error updating state:
2024-04-13T11:03:00+0000 Starting op: name=declare_freq, kwargs={'freq': 0}
2024-04-13T11:03:00+0000 Completed op: name=set_off, success=True, kwargs={}
2024-04-13T11:02:59+0000 Starting op: name=set_off, kwargs={}
2024-04-13T11:02:59+0000 Setting state: ControlState.PmxOff(success=True)
2024-04-16T15:28:38+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.hwp-pid.ops>'], {}]
2024-04-16T15:28:36+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.hwp-pid.ops>'], {}]
2024-04-16T15:28:34+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.hwp-pid.ops>'], {}]
2024-04-16T15:28:32+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.hwp-pid.ops>'], {}]
2024-04-16T15:28:30+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.hwp-pid.ops>'], {}]
2024-04-16T15:21:25+0000 pid_to_freq:4 Status is now "done".

2024-04-16T15:21:25+0000 pid_to_freq:4 Completed with state: ControlState.Error(traceback='Traceback (most recent call last):\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 653, in update\n    self.run_and_validate(clients.pid.set_direction,\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate\n    raise ControlClientError("op-wait timed out")\nocs.client_http.ControlClientError: op-wait timed out\n', start_time=1713280884.8706326)
2024-04-16T15:21:24+0000 Setting state: ControlState.Error(traceback='Traceback (most recent call last):\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 653, in update\n    self.run_and_validate(clients.pid.set_direction,\n  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate\n    raise ControlClientError("op-wait timed out")\nocs.client_http.ControlClientError: op-wait timed out\n', start_time=1713280884.8706326)
ocs.client_http.ControlClientError: op-wait timed out
    raise ControlClientError("op-wait timed out")
  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 633, in run_and_validate
    self.run_and_validate(clients.pid.set_direction,
  File "/usr/local/lib/python3.8/dist-packages/socs/agents/hwp_supervisor/agent.py", line 653, in update
Traceback (most recent call last):
2024-04-16T15:21:24+0000 Error updating state:
2024-04-16T15:20:54+0000 Starting op: name=set_direction, kwargs={'direction': '0'}
2024-04-16T15:20:54+0000 Setting state: ControlState.PIDToFreq(target_freq=2.0, direction='0', freq_tol=0.05, freq_tol_duration=10.0)
2024-04-16T15:20:54+0000 pid_to_freq:4 Status is now "starting".
2024-04-16T15:20:54+0000 start called for pid_to_freq
ykyohei commented 5 months ago

Once we fix this issue, I think I can test this in satp3 by intentionally turn off one of the hwp-agent and run the supervisor task.

jlashner commented 5 months ago

Hi Kyohei, do you mind posting the code with how you are running supervisor tasks?

If you are running it through a hwp-supervisor client as opposed to through sorunlib, you need to manually check if the session finished successfully, since a failed agent operation will not otherwise raise an error in the client script. For instance, everything in sorunlib uses this check_response function to make sure the task finished successfully.

If you are already using sorunlib directly, than we need to figure out why this is not failing, since I'm pretty sure when an operation enters an Error state like in your examples above, session['success'] should already be false.

ykyohei commented 5 months ago

Hi Jack, I was running through hwp-supervisor client as follows.

sup = OCSClient('hwp-supervisor')

sup.pid_to_freq(target_freq = 2.0)
sup.pmx_off()

I actually didn't know that it is possible to run these from sorunlib. I will try again through sorunlib. https://github.com/simonsobs/sorunlib/blob/7170fb9b62e8c29adb34912dbbd86f9521e672eb/src/sorunlib/hwp.py If I run these commands through sorunlib, and if one of the agent operations fails, should it raise an error and stop the schedule? And if I run through hwp-supervisor client, I should manually do something like as follows?

_, _, session = sup.pid_to_freq(target_freq = 2.0)
assert session['success']

I'm confused about the point of this issue now. We should run the supervisor task through sorunlib, and as long as we do so, this issue that Brian points out is not a real issue?

Also, do you think we should implement gripper.grip, gripper.ungrip, supervisor.enable_driver_board, and supervisor.disable_driver_board in sorunlib for the same reason?

jlashner commented 5 months ago

yep, asserting session['success'] should work!

Brian is correctly pointing out that this needs to be asserted wherever client operations are called, but the hwp-supervisor is not checking when it calls out to other agents.

This needs to be done both in the supervisor agent, and also any client script running in nextline.

jlashner commented 5 months ago

I think implementing these functions in sorunlib is a good idea to minimize the possibility of people making mistakes when calling them, but it isn't strictly necessary to get it to run.