ClientService option to fail on the first failure, instead of reconnecting

twisted-trac commented 7 years ago

	@warner reported
Trac ID	trac#9116
Type	enhancement
Created	2017-04-13 21:20:29Z

In Magic-Wormhole, when you run wormhole send, the first thing it does is to spin up a ClientService to manage the connection to our "rendezvous server". Once the connection succeeds, the automatic re-connection logic is perfect: the client can now tolerate the server getting bounced in the middle of a long-lived file-send operation. I'm super happy about this being available, as it replaced a lot of custom code in my app.

However, the fact that it hides an initial failure is a bit of a drag. Most of the time, if the first connection fails, then it's unlikely that a second connection is going to succeed. We should really report this error to the user instead of silently trying again and again forever. Things like:

ConnectionFailed: the user might not actually be connected to the internet, because their wifi is turned off
DNSError: the hostname argument they provided was mistyped, or I broke my DNS records, or I let the registration lapse
TLS error: I let the certificate lapse

(I'm reminded of the Hitchhiker's Guide story about the cruise passengers held in suspended animation for centuries, awaiting civilization to be rebuilt and complete delivery of the refreshing lemon-soaked paper napkins)

What I'm thinking is an extra argument to ClientService(), something like failOnFirstError=True. If present, at startup the state machine is diverted to a separate connecting_for_first_time state, which behaves a lot like the normal connecting state except in how it reacts to a failure:

the retry timer is not started
any pending cs.whenConnected() Deferreds get an errback with the actual connection failure (a ConnectionFailed of some sort)
the error is made sticky, so future whenConnected() calls get errbacked too
the Service remains in the "running" state
stopService will not report any sticky errors

The Service remains "running" so that callers can call .stopService as usual, rather than an error quietly pretending to be the application and stop the service behind its back. The error is "sticky" so that a .whenConnected() called after the error occurs can still see the error (instead of accidentally waiting for a new connection that will never come). This is especially helpful for errors that happen quickly (even synchronously) because the network is completely unavailable. Restarting the service will clear the sticky error.

If we make it far enough to get a connection established, even once, then any failure past that point will trigger a reconnect as before. We might reconsider this if/when we implement the extra initialization function from #8375: maybe a failure of that function should inhibit reconnection too. E.g. if your IMAP login credentials are wrong, then retrying the connection isn't going to help.

I'm working on a branch for this, since otherwise I'll need to copy all of ClientService into magic-wormhole to make a private modified copy there. I'll add an update here when I've got some code to look at.

Searchable metadata

``` trac-id__9116 9116 type__enhancement enhancement reporter__warner warner priority__normal normal milestone__None None branch__ branch_author__ status__closed closed resolution__fixed fixed component__core core keywords__None None time__1492118429095682 1492118429095682 changetime__1493059079007089 1493059079007089 version__None None owner__Brian_Warner__warner_____ Brian Warner ```

twisted-trac commented 7 years ago

	@warner set owner to @warner @warner set status to `closed`

In changeset 1f4a41d466e6f8645688aa7ec4fefc0eafe02822

#!CommitTicketReference repository="" revision="1f4a41d466e6f8645688aa7ec4fefc0eafe02822"
ClientService.whenConnected: add failAfterFailures=

If set to an integer N, the Nth connection failure will cause the
whenConnected Deferred to fail (with the same Failure as the endpoint.connect
reported). So setting it to 1 means the very first failure will be delivered,
setting it to 2 means the first failure is ignored but a second consecutive
failure will be reported, etc.

This allows applications to react differently to initial connection problems,
such as a DNS error, or a complete lack of internet connectivity. If the
first attempt fails, additional attempts are unlikely to do any better. For
some applications, silently re-attempting the doomed connection forever may
cause more confusion than simply emitting an error and terminating
immediately.

The default value (None) retains the current behavior of waiting forever (or
at least until the service is stopped).

closes ticket:9116

twisted-trac commented 7 years ago

	@warner set status to `reopened`

sigh, that wasn't supposed to get closed until the commit containing that comment landed on trunk.

twisted-trac commented 7 years ago

	@warner set status to `closed`

In changeset adbacd620d6fb58f8819ceb2a79e1d67f584c496

#!CommitTicketReference repository="" revision="adbacd620d6fb58f8819ceb2a79e1d67f584c496"
Merge pull request #772 from twisted/9116-fail-after-failures

Author: warner

Reviewer: glyph, exarkun

Fixes: ticket:9116

Add failAfterFailures= to t.a.i.ClientService.whenConnected()

twisted-trac commented 7 years ago

	@warner commented

PR filed in https://github.com/twisted/twisted/pull/767

twisted-trac commented 7 years ago

	@warner commented

The implementation is in fact simpler. However to actually use it reveals an interleaving hazard, which we need to decide how to resolve.

The whenConnected Deferred will fire as part of the (input=)_connectionFailed transition away from (oldstate=)_connecting, which is supposed to take it to (newstate=)_waiting. However Automat currently runs the output functions in the middle of the transition, so at the moment the Deferred fires, the state machine will still be in _connecting. If our callback immediately calls stopService(), then it will catch the state machine in the middle of transition, with calamitous results. In this case, it will cause a reentrant transition from oldstate=_connecting via input=stop to newstate=_disconnecting, and it will start waiting for a connection attempt to fail (which will never happen, because it happened once already).

The proper fix for this (and all interleaving hazards) is of course an eventual-send. If this were Tahoe-LAFS or Foolscap, I'd throw in a call to eventually(), using the routine Glyph wrote for me over 10 years ago. The downsides to using eventual-send are:

debugging: it breaks up the call stack, exactly like a callLater, so stack traces in exceptions are less useful
testing: unit tests must use yield d or yield self.assertFailure() instead of a synchronous self.successResultOf(d) or self.failureResultOf(d), making them slightly slower and causing them to need a reactor
reactor: callers of this API have to provide a reactor (for a Clock), whereas before it was pretty agnostic as to its event loop

The spectrum of solutions I can think of are:

1 (partial): have Automat move to the new state before calling any Outputs
2: expose the hazard to application authors: tell magic-wormhole to use something like an eventual-send if they might possibly touch the state machine from within an Output or the callback/errback of a Deferred that is fired by an Output
3: expose it to Twisted: ClientService uses it's reactor reference to make a callLater(0) before firing the whenConnected Deferred
4: handle it in Automat: the Output list should be processed in a separate reactor turn from the state transition
5 (full): Automat should run each Output in a separate turn

4 and 5 require Automat to be aware of a reactor. 1 doesn't require any eventual-sends. Both 1 and 2 are likely to cause bugs down the line as people don't notice the interaction hazards.

I mention fixing this in Automat because Glyph and I have talked about this in the past, in the context of HostnameEndpoints. He was considering similar workarounds (under the name "feedback", I think?), but I don't know how far that idea has progressed.

If I had my druthers I'd go with fix 5, since that will address the problem most generally, but the dependency and abstraction changes it would need are probably too much.

So I guess I'd recommend fix number 3. It's not a huge change, since ClientService already has a reactor reference, and for nearly the same purpose (to set the retry timer) as an eventual-send queue.

There's a question of what ordering guarantees we get out of a reactor.callLater(0), which then dictates whether two whenConnected(1) Deferreds will fire in the order in which they were created, or in a random order. If callLater(0) doesn't guarantee ordering, we can just expose that to applications (tell them not to rely upon that order) without much loss.

twisted-trac commented 7 years ago

	@glyph commented

This wasn't in review, but I have a design issue with it, so I will comment even though the tests are failing ;).

This should definitely be an argument, but it should not be an argument to __init__, as implemented. Instead, it should be an argument to whenConnected. As long as the service is running, it should keep connecting. However, in your example, you should say clientService.whenConnected(failAfterFailures=1), and when it fails, stop the service (as well as, presumably, exit the program!)

In fact, I'd go so far as to say that whenConnected should have a finite default for failAfterFailures; no request timeout should be unbounded.

What do you think?

twisted-trac commented 7 years ago

	@warner commented

Interesting idea.. let me experiment with that API in magic-wormhole and see how it feels.

(tests ought to be passing now, trunk was broken. totally not my fault :)

twisted-trac commented 7 years ago

	@warner commented

I think I like it. I'm uncertain about the non-infinite-default, since I've kind of learned to avoid imposing policy on applications (default timeouts in particular, and this is sort of a timeout). OTOH, it's the application that chooses to call whenConnected, so it's not like the whole ClientService's lifetime is being influenced by that default. "we don't impose a policy, but if you ask for status, we'll give you one in finite time" seems like a plausible policy.

Let me see if I can find a clean way to implement that. I suspect it will be a smaller patch than what I've got now. And I'll see if I can figure out how to fix the remaining apidocs tests that I missed before.

twisted-trac commented 7 years ago

	@warner commented

Oh, apparently Automat already does solution 1, which may be good enough to land this.

The remaining danger is that transition 1 calls outputs A+B+C, and output A causes transition 2, which calls outputs D+E+F. The order of execution will be ADEFBC, and outputs B+C will be executed while the state machine is in the new-state of transition 2, which may be surprising.

New PR is in https://github.com/twisted/twisted/pull/770

twisted-trac commented 7 years ago

	@warner commented

PR passes tests, ready for review.

(something is funky with the coverage: everything I added is covered, as expected, but things completely unrelated to my patch appear to have lost coverage. what's going on?)

One question for review: is the syntax I used in the docstring legal? I've been unable to run the docs build locally, to see whether it can render that asterisk-delimited list properly.

twisted-trac commented 7 years ago

	@glyph commented

(something is funky with the coverage: everything I added is covered, as expected, but things completely unrelated to my patch appear to have lost coverage. what's going on?)

Not all the builders are running, because you pushed to a PR in your own fork rather than to the twisted repository proper. Of course you probably needed to do that since you didn't have write access.

As, historically, you've previously held the rights & responsibilities of a Twisted committer and just didn't get migrated in the github migration, I've re-added you. Accept the email invites &c &c and put the branch into twisted/twisted, and you'll see the coverage numbers go back up. (You'll also see the relevant OS X build happen and actually turn your review status green.)

twisted-trac commented 7 years ago

	@glyph commented

Reviewed in https://github.com/twisted/twisted/pull/770#pullrequestreview-34250829 - with some caveats I think it's good to go.

twisted / twisted

ClientService option to fail on the first failure, instead of reconnecting #9116