simonsobs / socs

Simons Observatory specific OCS agents.
BSD 2-Clause "Simplified" License
12 stars 12 forks source link

SMuRF Crate Monitor: Slow Read Rate + Crashes when slots deactivated #391

Open msilvafe opened 1 year ago

msilvafe commented 1 year ago

I would like to document three issues/desired upgrades to the smurf_crate_monitor:

  1. Slow Read Rate
  2. Agent crashes when slots are deactivated
  3. No boolean sensors (i.e. internal crate alarms) are returned

1. Slow Read Rate Currently the crate monitor reads out once every 30 sec from these lines: https://github.com/simonsobs/socs/blob/702af44481d1cdbca34e3c217b73e51ff59fdb8f/socs/agents/smurf_crate_monitor/agent.py#L247-L250

However this is pretty slow when actively debugging thermal or fan issues with the crate. There is a limit to the speed at which we can poll through this ssh connection which in an earlier version of the code I calculated by rerunning the polling command a few times in a row to time it and set that as a limit to the fastest poll rate. I think we probably want to reinstitute something like this where the user can input a sample rate and if it exceed the fastest poll rate then it'll return a warning and set it to the maximum poll rate.

2. Agent crashes when slots are deactivated During the initialization of the agent the shelf manager is polled for all available sensors and the field names are populated from this. Sensors are only returned/available for slots that are activated (powered on). If a slot in the crate is deactivated after initialization then the feeds are no longer published (which is ok) but if a slot is activated after the agent is initialized then the agent crashes because it tried to publish field values that weren't setup when the agent was started up. I think we should modify the behavior to check if the field name list includes a new set of fields and add those.

3. No boolean sensors Currently we just search through the sensor list returned by the shelf manager for sensors with numerical values (voltages, currents, temperatures, fan speeds) and only register those sensors to publish to the feed. However there are many additional sensors that are just binary and reserved for alarms. We didn't originally include these because there are many of these and the names returned from the shelf manager was not very clear as to what they referred to and there's no reference for those in any manuals. However we should have the option to record these so that we can keep track of all alarms states from the crates to be able to pass this history off to Comtel (the crate manufacturer) in the case of any hardware or software failures that require their attention for repair.

BrianJKoopman commented 1 year ago

2. Agent crashes when slots are deactivated During the initialization of the agent the shelf manager is polled for all available sensors and the field names are populated from this. Sensors are only returned/available for slots that are activated (powered on). If a slot in the crate is deactivated after initialization then the feeds are no longer published (which is ok) but if a slot is activated after the agent is initialized then the agent crashes because it tried to publish field values that weren't setup when the agent was started up. I think we should modify the behavior to check if the field name list includes a new set of fields and add those.

It sounds like the block structure is changing within the data dict that's being published to the feed. The solution here is to split the slots into different blocks and publish them separately (still to the same feed) so they can dynamically be added/dropped.