simonsobs / socs

Simons Observatory specific OCS agents.
BSD 2-Clause "Simplified" License
12 stars 12 forks source link

HWP Network Monitor agent #756

Open jlashner opened 2 days ago

jlashner commented 2 days ago

One part of the HWP shutdown process that we discussed is having an agent that is in charge of managing the HWP ibootbars in the event of extended network outages. I will start implementing this but have a couple of related questions...

@ykyohei, @bbixler500, do you happen to know what Ibootbar outlets are being used for the PMX and the LED driver board for each telescope, or where that's recorded? Also do you have a sense of about how long we can wait on network outage before each of these needs to be turned off?

@BrianJKoopman is there a summary anywhere with info on the work you did about agent zombie processes, and trying to get agent processes to keep running after a crossbar disconnect?

Thanks!

bbixler500 commented 2 days ago

The outlet information is on the general hwp confluence page here. The driver board has two Acopian power supplies, which are +5V and -10V. As for a standard time to wait for network outages, I don't really have a number in mind. We have had minor outages while running scans in the past, which the hwp remained rotating through, so I wouldn't want the threshold to be too short.

BrianJKoopman commented 1 day ago

@BrianJKoopman is there a summary anywhere with info on the work you did about agent zombie processes, and trying to get agent processes to keep running after a crossbar disconnect?

Yup! Documentation for the connection timeout is the ocs site config docs.

This can also be passed as the environment variable CROSSBAR_TIMEOUT, useful for the docker containers. That's a bit hidden, but is in the ocs-agent-cli docs.