EthernetProvider failure mode

project8 / dragonfly

a dripline-based slow-control implementation

Other

0 stars 0 forks source link

EthernetProvider failure mode #54

Closed wcpettus closed 7 years ago

wcpettus commented 8 years ago

EthernetProvider failures repeatedly spamming slack are irritating and potentially dangerous if they hide another error message. At some point, the service should probably just crash with warnings sent to slack, and the prologix boxes need to be smart enough to pass the crash on to the appropriate repeater and not crash the longmorn/wolfburn/lagavulin provider.

laroque commented 8 years ago

A more general solution to the slack irritation could be integrated with #56.

Maybe this is obvious already, but it seems reasonable that reconnect could cache the timestamps of the most recent N reconnect attempts. If reconnect() is called more than N times in the last M minutes, then alert and crash.

The dependent services is harder, it is a design feature that services aren't able to track the state of other services (including when they go up or come down). I believe that the correct behavior is that when one service fails when trying to communicate with another service, it should determine how it responds. Hardware services should probably crash when their prologix box is down, then the entire set can be restarted once it is resolved (once supervisord is deployed that becomes a trivial step). There may be other services which implement logic on multiple targets which should not crash so that they can continue on their other duties.

wcpettus commented 8 years ago

In the current version of Ethernet/Repeater providers, this is the handling of a disconnected instrument:

lockin (or other GPIB):

EthernetProvider (for prologix) will get command list (including the ++addr meant for itself), and will throw an error when it tries to communicate with the instrument and gets no response within timeout (cannot connect slack alert 1)
EthernetProvider attempts to reconnect, succesfully reconnects
EthernetProvider attempts to resend the command list fails again (cannot connect slack alert 2)
there is no except loop to catch this error, dripline exception raised, dripline exception passed back up the chain
user receives an error code response

ardbeg (or any provider that directly uses EthernetProvider):

EthernetProvider gets the command, throws an error when socket.recv times out (cannot connect slack alert)
EthernetProvider attempts to reconnect, fails at socket.connect
reconnect takes ~60 sec before failing, so dripline_agent times out and user gets a timeout, not an error code
if using wolfburn, the lockin service will show a nasty traceback

Attempting to start a service to a disconnected instrument:

lockin (or other GPIB):

service starts without issue, all commands will fail

ardbeg (or any provider that directly uses EthernetProvider):

service crashes at socket.connect (in ~3 sec this time), endpoints not bound

A weird failure mode for ardbeg:

when disconnecting and reconnecting the ethernet cable, ardbeg can detect that there was a network reconnect and goes unresponsive
still allows socket.connect, but becomes incapable of sending replies
requires a hard reconnect

wcpettus commented 8 years ago

A couple proposals:

[ ] EthernetProvider services should require at least one cmd_at_reconnect to ensure the instrument is communicative. The reply should be checked for non-zero length. (there can even be a default like *OPC? for all the SCPI providers so we only need to add a few special cases)
[ ] RepeaterProvider should have a similar cmd_at_init check, since they don't have a reconnect method it's not a direct correspondence.
[ ] RepeaterProvider should be able to check the error codes. Since we are raising a specific code, that could trigger a logger.critical response which slack picks up.

If we are making slack smarter in handling the error messages, there is no necessity for the services to actually crash when they go unresponsive. But that can also be considered.

laroque commented 8 years ago

I definitely like the first one (I think I discussed this when upgrades to the ethernet provider were started but maybe I mis-remember or maybe it was forgotten) and the second at least for hard dependencies like RepeaterProvider (less sensible for something like the daq interface which may have lots of dependencies which are nested and what not).

In the case of an unhandled DriplineException, it should already result in a logger.critical. Most, however, are handled and may or may not do so. I agree that the Repeater could/should be looking at the return code. In some cases the error should be passed back to the user (for example, if the instrument reports an error, there's nothing wrong with the repeater but the user will want to know). Others, like the repeater being unable to communicate with its target, should result in local errors for the repeater.

I like crashing over going "inactive" because it seems more concrete. What would an "inactive" service do? Does it stop its internal logging; does it reject incoming requests; are there some things it still responds to; is there some way to make it "active" again other than restarting it? Do we have any particular cases where a restart is slow and/or does not get us back to the proper state? (if so then that's probably a cause for concern, since we do want to be sure things are in a consistent and easily reproducible state)

wcpettus commented 8 years ago

The current behavior of EthernetProvider is that every new command will try to reconnect, but the service will never crash. For most instruments we are logging multiple endpoints every 30 sec, resulting in a never-ending stream of cannot connect warnings (but the stream will be slow because the reconnect timeout is long?). For a few instruments like ardbeg without automatic logging, we will only fail as many times as a user (or a service like ESR) is pushing commands. And for instruments communicating through GPIB, things happen faster because the prologix reconnect is faster, so we can get more errors.

The more fundamental question is, what do we gain by killing the service vs allowing it to continue running and continue trying to reconnect (hopefully with suppressed error messaging)? Is there ever a case where an instrument would lose connectivity for 5/10/60 minutes and then a reconnect would be successful? If so, allowing the service to continue running is helpful. If not it's just repeatedly hitting the same wall.

wcpettus commented 8 years ago

I checked the repeated failure modes today:

For direct ethernet provider instruments, when the reconnect method fails (socket.connect doesn't work), all future commands will fail silently. The socket.send() will generate a socket.error since the socket is unconnected, which triggers the reconnect block, which fails. This all is silent (after the first "cannot connect" warning), so the service will continue running and continue failing.
if the operator is sending commands, he/she will get: "2016-08-26T16:58:27[Level 25] dragonfly.subcommands.dripline_agent(85) -> return message: [Errno 113] No route to host"
For GPIB instruments, because the socket is good, we keep throwing these errors.

If EthernetProvider has an "is_repeater" flag for the prologix services, we can trigger different warning behavior for the prologix and send the warning back to the GPIB instruments where noise suppression can be done.

laroque commented 8 years ago

If a "direct ethernet provider" instrument is later able to reconnect, does it everything start working again from the user's perspective? If things start to work, but then fail again, will it issue the warning again? If so the first two bullet points seem like the perfect behavior, and like what we want to duplicate for GPIB devices that require a repeater.

Why can't the prologix provider, when it has trouble talking to the prologix box send a ReplyMessage with an appropriate returncode (203 "Hardware Connection Error" seems right). Then the Repeater can re-raise that exception when it gets the reply and the endpoint is free to handle that exception or crash.

It seems like the returncode can serve an equivalent purpose to the "is_repeater" flag, without breaking the layered nature of the system. If it isn't, the direct connection should potentially be using a similar behavior, catching the socket error and returning a code 203 to the user.

wcpettus commented 8 years ago

If a "direct ethernet provider" instrument is later able to reconnect, does it everything start working again from the user's perspective? Yes. The current EthernetProvider only issues the logger.critical message from the get method, it can only reach the get method if there is a valid socket connected, otherwise the send_command method will raise a socket.error when it tries to socket.send. Caveat - In the upcoming fix of EthernetProvider, I want to tweak the error message so that it tells the user if the reconnect is successful or not. So I need to not break the above behavior.

I'm not sure on the proper use of the "is_repeater" flag, I've rehashed this three times and gotten three different answers. Needing a reconnect is bad behavior for any EthernetProvider service, so we want to know if we ever trigger that method. But how do we get the right alert behavior between

a connection glitch which then triggers a successful reconnect, followed by successfully send/get sequence (want to send one Slack alert but also successfully return the endpoint value?)
a GPIB instrument failure, triggering a successful reconnect, followed by a failed send/get (want a Slack alert telling of doom)

My current best solution is an "is_repeater" flag so that if a prologix goes to the reconnect block and successfully reconnect, it returns an error code to the RepeaterProvider to attempt its own "reconnect". If the higher level check_connection is successful, it will resend the command and all should be well, it was just a glitch. If the higher level check_connection fails, it's probably a GPIB instrument issue, needs a "dead_connection" flag raised in the repeater service, and gets a different Slack alert. The three cases will look like:

wolfburn dies: ethernet reconnect fails, EthernetProvider will issue its own Slack warning; error code passed back to RepeaterProvider, which fails its connection, and issues a Slack warning; lockin in dead state, wolfburn in dead state; two Slack alerts
lockin dies: ethernet reconnect succeeds, but is_repeater flag triggers special raise behavior to pass warning back up a level; RepeaterProvider attempts a reconnect and fails, issues Slack warning; lockin in dead state, wolfburn in connected state; one Slack alert
connection glitch: ethernet reconnect succeeds, as above; RepeaterProvider reconnects successfully, sends Slack warning; Repeater resends command to Ethernet, which should succeed; both in live state; one Slack alert

RepeaterProvider will need a check_connection method which is executed at init and is triggered whenever the dead_connection flag gets raised. dead_connection flag should block excessive Slack warnings from the GPIB instrument service.

It feels a little convoluted, but I'm happy that it catches all undesired behavior, even the ones I don't anticipate ever happening but want to know if they do.

wcpettus commented 8 years ago

There is an alternative, which feels clunkier, and generally has more Slack duplicates - Without the "is_repeater" flag, EthernetProvider will send its own Slack alert for both the second two cases, but on the second get failure an error gets passed back to RepeaterProvider. The challenge is now that every check_connection of the repeater has to send with a special flag to EthernetProvider so that EthernetProvider doesn't send a new Slack alert every time (a la Balvenie on Tuesday, 10 a minute gets tiresome). The behavior of these instruments that they keep trying to reconnect every time is good, I don't want to suppress that for the RepeaterProvider/GPIB's. And it can't be an internal variable of the EthernetProvider because we don't want one dead GPIB to suppress warnings from others, so it has to be a special argument to the send method.

I think I got that right, and I think the option of the previous post is cleaner.

laroque commented 8 years ago

This still violates some principles of layering that are important to the way things are structured. Let me take your three cases above in reverse order and walk through what I'd expect:

connection glitch: These happen in any networks, as long as they aren't frequent and successfully reconnect they aren't a problem but I agree it is useful to know. Your procedure seems fine.
lockin dies: the command returns no data so the EthernetProvider tries to reconnect. We add a "ensure_connected" method that uses some simple command to test communication. For the prologix boxes we could use ++addr to query the current address, or there may be some IDN? or OPC? equivalent. We could also create PrologixProvider(EthernetProvider) which is identical except that if it gets no data it does that same test before trying to reconnect. If there is no reply to ++addr then normal reconnect logic applies, if the prologix box replies but the instrument returned no data we have error code 203, return that to the RepeaterProvider. (note that thus far, if the problem is the lockin, there have been no slack alerts, the service is responsible for the prologix box which is working). The repeater provider sees code 203 and issues a slack alert that the lockin died.
wolfburn dies: state of the lockin is unknown. If unable to reconnect then EthernetProvider should issue a warning to slack. The RepeaterProvider must still get some timeout which seems likely to produce a 2nd warning in slack which isn't great... but it is the case that this service is unresponsive so that's correct.

I'm back to thinking that it could make sense to let services die. Maybe not by default, but in the case where the lockin is dead, it makes sense that its service would crash. After all, if something bad happened and a person needs to do something like power cycle the device, it isn't so hard to also restart a service. The alternative is for providers to internally cache the health of their connection.

wcpettus commented 7 years ago

I've implemented a minimal fix to this and done some corresponding cleanup to EthernetProvider and MuxerProvider in the process (mostly moving special muxer treatment out of Ethernet and back into its dedicated class).

The current behavior (in the feature branch):

all ethernet services require the final reconnect command to return a '1', if this check fails a logger.critical is sent to slack and the socket is broken
- config files should be thoroughly tested, as this is an easy way to send a lot of slack messages and break your service
failure of any ethernet instrument: single logger.critical at line 183 of EthernetProvider, this will then trigger a reconnect attempt
- if successful, the command will be resent and will presumably work (or if a syntax error exists, we'll get another logger.critical)
- if unsuccessful, the socket will not be established and no logger.critical will be sent
- only way to spam error messages to slack is by repeatedly sending a bad command, which should be caught when editing the config file
failure of any repeater service: double logger.critical at line 183 of EthernetProvider since reconnect will be successful and second send will fail again
- this will raise a 202 code exception (DriplineHardwareResponselessError) which RepeaterProvider will catch, send a logger.critical, and crash with sys.exit()
- if the ethernet instrument we are repeating from (e.g. wolfburn) crashes, the exception code is 201 (DriplineHardwareConnectionError) which passes the error back but doesn't crash the repeater service as socket errors are suppressed

Some possible improvements:

when RepeaterProvider crashes with sys.exit(), it doesn't pass along a DriplineException, so any operator has to wait for the timeout and notice the Slack error; I'm not sure how to get around this while still crashing the service
- alternatively the service could go into "safe mode", but I haven't had a good vision for how we break out of such a safe mode without spamming more error messages
add a comparable connection check for repeated (e.g., GPIB) instruments, this will have to allow the repeat_target time to launch
better error messages - currently an error to Slack won't tell the user what happened when the reconnect was attempted
- if _reconnect took a broadcast=(bool) argument, it would be possible to control when logger.critical messages got sent from _reconnect:
- __init__ would call _reconnect with broadcast=True
- send would call _reconnect with broadcast=True for except DriplineHardwareResponselessError
- send would call _reconnect with broadcast=False for except socket.error (since an error was already broadcast whenever the socket broke)
- this would probably require more logger.critical's to be sent, because otherwise on a successful reconnect, the second attempted send couldn't send an error to Slack if it failed
- Ok, this is a bit of a mess

guiguem commented 7 years ago

In general, the logic seems better and the code cleaner, so I am not sure to have recommendations.

From a Slack perspective, the spam is still tolerated: a service can send 30 messages within 5 minutes before being "muted", so we should be able to catch the 3 (number of attempts of starting a service via supervisor) times 1-2 slack messages you are talking about. We might decide to decrease it, the number of reconnect per minute might then be a good estimation of the tolerance we need...

I guess the typical timeout are 10 sec, so it should be ok, unless it is a run.
Not sure what you mean...
I would be in favor of a message after a successful reconnect if a alert has been sent before due to an impossible first connection, aka the complex logic... Else the operator might panic, when things are going all right. This is not a general rule, but when a critical appears, it means that the operator or someone else needs to have a look to something. In this case (reconnection successful), we should say that at the end everything went fine...

wcpettus commented 7 years ago

Merged into develop and slated for release in v1.5.2