pcdshub / lightpath

LCLS Lightpath Module
https://pcdshub.github.io/lightpath
Other
4 stars 9 forks source link

BUG: Handle unreachable PV's / devices better #152

Closed tangkong closed 1 year ago

tangkong commented 1 year ago

Expected Behavior

We should be able to configure the EpicsSignal default timeout at startup, or rely on an environment variable

Current Behavior

Currently we set the default EpicsSignal timeout to 10s, which worked around various TimeoutErrors and pyepics failed to send connection and access rights. This works fine when all the devices are online, but devices are actually disconnected, this can cause large slowdowns when starting up lightpath (or when switching beamlines).

Furthermore, switching to a beamline with disconnected devices seems to freeze the ui, as all disconnected devices time out at one of these function (find_device_state) calls in series before any device is shown in lightpath.

https://github.com/pcdshub/lightpath/blob/05a9394e09acf256649738f6233d54a7ba2b1b4d/lightpath/path.py#L83

Possible Solution

Look for the timeout setting in an env variable, or add a cli argument to specify it.

Also handle disconnected devices more gracefully.

Context

This was first encountered in #142, and brought up again in #150 while testing during a mini-shutdown against a switch that was powered off, disabling a large number of devices.

Your Environment

During testing of #150

tangkong commented 1 year ago

_Also see https://github.com/pcdshub/lightpath/pull/150#discussion_r954382394_

ZLLentz commented 1 year ago

The most important thing here is actually that devices that don't connect prevent the UI from updating, seemingly forever.

For example, switching to the QRIX line freezes the GUI. The default line right now (XPP) doesn't open while the TXI switch work is ongoing.

tangkong commented 1 year ago

It was observed that a single disconnected device could result in a huge number of disconnected device errors, thrown from find_device_state.

At every point where we ask for a LightRow widget to update its color, we consider its position in relation to the path as a whole. Because of this, if a disconnected device is the impediment, lightpath will call find_device_state on it for each device, causing the unexpectedly large number of connection failures.

While this may not be a new problem, it's certainly exacerbated by the switch to considering the whole path for each widget.

Thoughts on possible solutions:

ZLLentz commented 1 year ago

It should be pretty quick to check that a device is disconnected, right? (~device~ lightpath_signal.connected attribute). Can we mark these with an Unknown/Disconnected state and move on? I'm not sure whether unknown devices are supposed to count as blocking or not. Hypothetically these could be Unknown/Disconnected and be marked as... not blocking? (Until they connect and we can get a real update?)

tangkong commented 1 year ago

You're right in that checking the connection status explicitly should be fast. We just catch an exception, rather than actually checking the connection status. This leads to a delay that is made worse by the timeout setting I mentioned earlier, if I remove that we get a the errors come in a lot faster.

Maybe we just add a connection check to find_device_state before trying to get the value. That does sound easiest.

We currently mark unknown devices as blocking, and I think I'm happy with that behavior. I suppose the disconnected status could be obvious enough with the new coloring.