Closed twisted-trac closed 7 years ago
@warner set owner to @warner @warner set status to closed |
---|
In changeset 1f4a41d466e6f8645688aa7ec4fefc0eafe02822
#!CommitTicketReference repository="" revision="1f4a41d466e6f8645688aa7ec4fefc0eafe02822"
ClientService.whenConnected: add failAfterFailures=
If set to an integer N, the Nth connection failure will cause the
whenConnected Deferred to fail (with the same Failure as the endpoint.connect
reported). So setting it to 1 means the very first failure will be delivered,
setting it to 2 means the first failure is ignored but a second consecutive
failure will be reported, etc.
This allows applications to react differently to initial connection problems,
such as a DNS error, or a complete lack of internet connectivity. If the
first attempt fails, additional attempts are unlikely to do any better. For
some applications, silently re-attempting the doomed connection forever may
cause more confusion than simply emitting an error and terminating
immediately.
The default value (None) retains the current behavior of waiting forever (or
at least until the service is stopped).
closes ticket:9116
@warner set status to closed |
---|
In changeset adbacd620d6fb58f8819ceb2a79e1d67f584c496
#!CommitTicketReference repository="" revision="adbacd620d6fb58f8819ceb2a79e1d67f584c496"
Merge pull request #772 from twisted/9116-fail-after-failures
Author: warner
Reviewer: glyph, exarkun
Fixes: ticket:9116
Add failAfterFailures= to t.a.i.ClientService.whenConnected()
@warner commented |
---|
PR filed in https://github.com/twisted/twisted/pull/767
@warner commented |
---|
The implementation is in fact simpler. However to actually use it reveals an interleaving hazard, which we need to decide how to resolve.
The whenConnected
Deferred will fire as part of the
(input=)_connectionFailed
transition away from (oldstate=)_connecting
,
which is supposed to take it to (newstate=)_waiting
. However Automat
currently runs the output functions in the middle of the transition, so at
the moment the Deferred fires, the state machine will still be in
_connecting
. If our callback immediately calls stopService()
, then it
will catch the state machine in the middle of transition, with calamitous
results. In this case, it will cause a reentrant transition from
oldstate=_connecting
via input=stop
to newstate=_disconnecting
, and it
will start waiting for a connection attempt to fail (which will never happen,
because it happened once already).
The proper fix for this (and all interleaving hazards) is of course an
eventual-send. If this were Tahoe-LAFS or Foolscap, I'd throw in a call to
eventually()
, using
the routine Glyph wrote for me over 10 years ago. The downsides to using
eventual-send are:
callLater
, so stack traces in exceptions are less usefulyield d
or yield self.assertFailure()
instead of a synchronous self.successResultOf(d)
or self.failureResultOf(d)
, making them slightly slower and causing them to need a reactorThe spectrum of solutions I can think of are:
whenConnected
Deferred4 and 5 require Automat to be aware of a reactor. 1 doesn't require any eventual-sends. Both 1 and 2 are likely to cause bugs down the line as people don't notice the interaction hazards.
I mention fixing this in Automat because Glyph and I have talked about this in the past, in the context of HostnameEndpoints. He was considering similar workarounds (under the name "feedback", I think?), but I don't know how far that idea has progressed.
If I had my druthers I'd go with fix 5, since that will address the problem most generally, but the dependency and abstraction changes it would need are probably too much.
So I guess I'd recommend fix number 3. It's not a huge change, since ClientService already has a reactor reference, and for nearly the same purpose (to set the retry timer) as an eventual-send queue.
There's a question of what ordering guarantees we get out of a
reactor.callLater(0)
, which then dictates whether two whenConnected(1)
Deferreds will fire in the order in which they were created, or in a random
order. If callLater(0)
doesn't guarantee ordering, we can just expose that
to applications (tell them not to rely upon that order) without much loss.
@glyph commented |
---|
This wasn't in review, but I have a design issue with it, so I will comment even though the tests are failing ;).
This should definitely be an argument, but it should not be an argument to __init__
, as implemented. Instead, it should be an argument to whenConnected
. As long as the service is running, it should keep connecting. However, in your example, you should say clientService.whenConnected(failAfterFailures=1)
, and when it fails, stop the service (as well as, presumably, exit the program!)
In fact, I'd go so far as to say that whenConnected
should have a finite default for failAfterFailures
; no request timeout should be unbounded.
What do you think?
@warner commented |
---|
I think I like it. I'm uncertain about the non-infinite-default, since I've kind of learned to avoid imposing policy on applications (default timeouts in particular, and this is sort of a timeout). OTOH, it's the application that chooses to call whenConnected
, so it's not like the whole ClientService
's lifetime is being influenced by that default. "we don't impose a policy, but if you ask for status, we'll give you one in finite time" seems like a plausible policy.
Let me see if I can find a clean way to implement that. I suspect it will be a smaller patch than what I've got now. And I'll see if I can figure out how to fix the remaining apidocs tests that I missed before.
@warner commented |
---|
Oh, apparently Automat already does solution 1, which may be good enough to land this.
The remaining danger is that transition 1 calls outputs A+B+C, and output A causes transition 2, which calls outputs D+E+F. The order of execution will be ADEFBC, and outputs B+C will be executed while the state machine is in the new-state of transition 2, which may be surprising.
New PR is in https://github.com/twisted/twisted/pull/770
@warner commented |
---|
PR passes tests, ready for review.
(something is funky with the coverage: everything I added is covered, as expected, but things completely unrelated to my patch appear to have lost coverage. what's going on?)
One question for review: is the syntax I used in the docstring legal? I've been unable to run the docs build locally, to see whether it can render that asterisk-delimited list properly.
@glyph commented |
---|
(something is funky with the coverage: everything I added is covered, as expected, but things completely unrelated to my patch appear to have lost coverage. what's going on?)
Not all the builders are running, because you pushed to a PR in your own fork rather than to the twisted repository proper. Of course you probably needed to do that since you didn't have write access.
As, historically, you've previously held the rights & responsibilities of a Twisted committer and just didn't get migrated in the github migration, I've re-added you. Accept the email invites &c &c and put the branch into twisted/twisted, and you'll see the coverage numbers go back up. (You'll also see the relevant OS X build happen and actually turn your review status green.)
@glyph commented |
---|
Reviewed in https://github.com/twisted/twisted/pull/770#pullrequestreview-34250829 - with some caveats I think it's good to go.
In Magic-Wormhole, when you run
wormhole send
, the first thing it does is to spin up aClientService
to manage the connection to our "rendezvous server". Once the connection succeeds, the automatic re-connection logic is perfect: the client can now tolerate the server getting bounced in the middle of a long-lived file-send operation. I'm super happy about this being available, as it replaced a lot of custom code in my app.However, the fact that it hides an initial failure is a bit of a drag. Most of the time, if the first connection fails, then it's unlikely that a second connection is going to succeed. We should really report this error to the user instead of silently trying again and again forever. Things like:
(I'm reminded of the Hitchhiker's Guide story about the cruise passengers held in suspended animation for centuries, awaiting civilization to be rebuilt and complete delivery of the refreshing lemon-soaked paper napkins)
What I'm thinking is an extra argument to
ClientService()
, something likefailOnFirstError=True
. If present, at startup the state machine is diverted to a separateconnecting_for_first_time
state, which behaves a lot like the normalconnecting
state except in how it reacts to a failure:cs.whenConnected()
Deferreds get an errback with the actual connection failure (a ConnectionFailed of some sort)The Service remains "running" so that callers can call .stopService as usual, rather than an error quietly pretending to be the application and stop the service behind its back. The error is "sticky" so that a
.whenConnected()
called after the error occurs can still see the error (instead of accidentally waiting for a new connection that will never come). This is especially helpful for errors that happen quickly (even synchronously) because the network is completely unavailable. Restarting the service will clear the sticky error.If we make it far enough to get a connection established, even once, then any failure past that point will trigger a reconnect as before. We might reconsider this if/when we implement the extra initialization function from #8375: maybe a failure of that function should inhibit reconnection too. E.g. if your IMAP login credentials are wrong, then retrying the connection isn't going to help.
I'm working on a branch for this, since otherwise I'll need to copy all of ClientService into magic-wormhole to make a private modified copy there. I'll add an update here when I've got some code to look at.
Searchable metadata
``` trac-id__9116 9116 type__enhancement enhancement reporter__warner warner priority__normal normal milestone__None None branch__ branch_author__ status__closed closed resolution__fixed fixed component__core core keywords__None None time__1492118429095682 1492118429095682 changetime__1493059079007089 1493059079007089 version__None None owner__Brian_Warner__warner_____ Brian Warner