project8 / dragonfly

a dripline-based slow-control implementation
Other
0 stars 0 forks source link

EthernetProvider failure mode #54

Closed wcpettus closed 7 years ago

wcpettus commented 8 years ago

EthernetProvider failures repeatedly spamming slack are irritating and potentially dangerous if they hide another error message. At some point, the service should probably just crash with warnings sent to slack, and the prologix boxes need to be smart enough to pass the crash on to the appropriate repeater and not crash the longmorn/wolfburn/lagavulin provider.

laroque commented 8 years ago

A more general solution to the slack irritation could be integrated with #56.

Maybe this is obvious already, but it seems reasonable that reconnect could cache the timestamps of the most recent N reconnect attempts. If reconnect() is called more than N times in the last M minutes, then alert and crash.

The dependent services is harder, it is a design feature that services aren't able to track the state of other services (including when they go up or come down). I believe that the correct behavior is that when one service fails when trying to communicate with another service, it should determine how it responds. Hardware services should probably crash when their prologix box is down, then the entire set can be restarted once it is resolved (once supervisord is deployed that becomes a trivial step). There may be other services which implement logic on multiple targets which should not crash so that they can continue on their other duties.

wcpettus commented 8 years ago

In the current version of Ethernet/Repeater providers, this is the handling of a disconnected instrument:

lockin (or other GPIB):

ardbeg (or any provider that directly uses EthernetProvider):

Attempting to start a service to a disconnected instrument:

lockin (or other GPIB):

ardbeg (or any provider that directly uses EthernetProvider):

A weird failure mode for ardbeg:

wcpettus commented 8 years ago

A couple proposals:

If we are making slack smarter in handling the error messages, there is no necessity for the services to actually crash when they go unresponsive. But that can also be considered.

laroque commented 8 years ago

I definitely like the first one (I think I discussed this when upgrades to the ethernet provider were started but maybe I mis-remember or maybe it was forgotten) and the second at least for hard dependencies like RepeaterProvider (less sensible for something like the daq interface which may have lots of dependencies which are nested and what not).

In the case of an unhandled DriplineException, it should already result in a logger.critical. Most, however, are handled and may or may not do so. I agree that the Repeater could/should be looking at the return code. In some cases the error should be passed back to the user (for example, if the instrument reports an error, there's nothing wrong with the repeater but the user will want to know). Others, like the repeater being unable to communicate with its target, should result in local errors for the repeater.

I like crashing over going "inactive" because it seems more concrete. What would an "inactive" service do? Does it stop its internal logging; does it reject incoming requests; are there some things it still responds to; is there some way to make it "active" again other than restarting it? Do we have any particular cases where a restart is slow and/or does not get us back to the proper state? (if so then that's probably a cause for concern, since we do want to be sure things are in a consistent and easily reproducible state)

wcpettus commented 8 years ago

The current behavior of EthernetProvider is that every new command will try to reconnect, but the service will never crash. For most instruments we are logging multiple endpoints every 30 sec, resulting in a never-ending stream of cannot connect warnings (but the stream will be slow because the reconnect timeout is long?). For a few instruments like ardbeg without automatic logging, we will only fail as many times as a user (or a service like ESR) is pushing commands. And for instruments communicating through GPIB, things happen faster because the prologix reconnect is faster, so we can get more errors.

The more fundamental question is, what do we gain by killing the service vs allowing it to continue running and continue trying to reconnect (hopefully with suppressed error messaging)? Is there ever a case where an instrument would lose connectivity for 5/10/60 minutes and then a reconnect would be successful? If so, allowing the service to continue running is helpful. If not it's just repeatedly hitting the same wall.

wcpettus commented 8 years ago

I checked the repeated failure modes today:

If EthernetProvider has an "is_repeater" flag for the prologix services, we can trigger different warning behavior for the prologix and send the warning back to the GPIB instruments where noise suppression can be done.

laroque commented 8 years ago

If a "direct ethernet provider" instrument is later able to reconnect, does it everything start working again from the user's perspective? If things start to work, but then fail again, will it issue the warning again? If so the first two bullet points seem like the perfect behavior, and like what we want to duplicate for GPIB devices that require a repeater.

Why can't the prologix provider, when it has trouble talking to the prologix box send a ReplyMessage with an appropriate returncode (203 "Hardware Connection Error" seems right). Then the Repeater can re-raise that exception when it gets the reply and the endpoint is free to handle that exception or crash.

It seems like the returncode can serve an equivalent purpose to the "is_repeater" flag, without breaking the layered nature of the system. If it isn't, the direct connection should potentially be using a similar behavior, catching the socket error and returning a code 203 to the user.

wcpettus commented 8 years ago

If a "direct ethernet provider" instrument is later able to reconnect, does it everything start working again from the user's perspective? Yes. The current EthernetProvider only issues the logger.critical message from the get method, it can only reach the get method if there is a valid socket connected, otherwise the send_command method will raise a socket.error when it tries to socket.send. Caveat - In the upcoming fix of EthernetProvider, I want to tweak the error message so that it tells the user if the reconnect is successful or not. So I need to not break the above behavior.

I'm not sure on the proper use of the "is_repeater" flag, I've rehashed this three times and gotten three different answers. Needing a reconnect is bad behavior for any EthernetProvider service, so we want to know if we ever trigger that method. But how do we get the right alert behavior between

My current best solution is an "is_repeater" flag so that if a prologix goes to the reconnect block and successfully reconnect, it returns an error code to the RepeaterProvider to attempt its own "reconnect". If the higher level check_connection is successful, it will resend the command and all should be well, it was just a glitch. If the higher level check_connection fails, it's probably a GPIB instrument issue, needs a "dead_connection" flag raised in the repeater service, and gets a different Slack alert. The three cases will look like:

RepeaterProvider will need a check_connection method which is executed at init and is triggered whenever the dead_connection flag gets raised. dead_connection flag should block excessive Slack warnings from the GPIB instrument service.

It feels a little convoluted, but I'm happy that it catches all undesired behavior, even the ones I don't anticipate ever happening but want to know if they do.

wcpettus commented 8 years ago

There is an alternative, which feels clunkier, and generally has more Slack duplicates - Without the "is_repeater" flag, EthernetProvider will send its own Slack alert for both the second two cases, but on the second get failure an error gets passed back to RepeaterProvider. The challenge is now that every check_connection of the repeater has to send with a special flag to EthernetProvider so that EthernetProvider doesn't send a new Slack alert every time (a la Balvenie on Tuesday, 10 a minute gets tiresome). The behavior of these instruments that they keep trying to reconnect every time is good, I don't want to suppress that for the RepeaterProvider/GPIB's. And it can't be an internal variable of the EthernetProvider because we don't want one dead GPIB to suppress warnings from others, so it has to be a special argument to the send method.

I think I got that right, and I think the option of the previous post is cleaner.

laroque commented 8 years ago

This still violates some principles of layering that are important to the way things are structured. Let me take your three cases above in reverse order and walk through what I'd expect:

I'm back to thinking that it could make sense to let services die. Maybe not by default, but in the case where the lockin is dead, it makes sense that its service would crash. After all, if something bad happened and a person needs to do something like power cycle the device, it isn't so hard to also restart a service. The alternative is for providers to internally cache the health of their connection.

wcpettus commented 7 years ago

I've implemented a minimal fix to this and done some corresponding cleanup to EthernetProvider and MuxerProvider in the process (mostly moving special muxer treatment out of Ethernet and back into its dedicated class).

The current behavior (in the feature branch):

Some possible improvements:

guiguem commented 7 years ago

In general, the logic seems better and the code cleaner, so I am not sure to have recommendations.

From a Slack perspective, the spam is still tolerated: a service can send 30 messages within 5 minutes before being "muted", so we should be able to catch the 3 (number of attempts of starting a service via supervisor) times 1-2 slack messages you are talking about. We might decide to decrease it, the number of reconnect per minute might then be a good estimation of the tolerance we need...

wcpettus commented 7 years ago

Merged into develop and slated for release in v1.5.2