Open btasker opened 1 year ago
I think the simplest way to address this would be to have that select
statement check whether the current stream is the only one in the ClientConn
and if so, set DoNotReuse
.
I don't know whether it's the best method though - HTTP/2's multiplexing complicates things.
At the net/http2
level there doesn't seem to be a way to tell whether a timeout is a connection problem or just that the other end is being slow (or the caller setting an overly low timeout).
As a result, receiving a timeout doesn't necessarily mean that the connection itself is dead: tearing it down might therefore impact working requests transiting the same connection via other streams.
Setting doNotReuse
on the connection feels like an acceptable compromise: it allows other streams to carry on unimpeded (or hit their own timeouts), whilst ensuring that a potentially problematic connection is not reused.
It'd probably be tidier to extend ClientConn
to include a hasErrored
property and achieve the same end, but I'll link a PR in in a minute using doNotReuse
- I'm happy to go the other way, but it didn't seem worth the effort until it's confirmed that this is the right path.
CC @neild @bradfitz
Refusing to reuse a connection that's not successfully handling requests seems reasonable to me.
False positives (a request timing out because of a slow server handler, not a bad connection, say) will result in additional connections being unnecessarily created, which seems less bad than continuing to send requests to an unresponsive connection.
Change https://go.dev/cl/486156 mentions this issue: http2: don't reuse connections that are experiencing errors
Change https://go.dev/cl/490335 mentions this issue: net/http2: don't re-use connections that are experiencing errors
@gopherbot Please open backport to 1.19.
Rationale quoting #60301 (which is for the other open minor release):
When encountered, this is a serious issue: requests are blocked until the kernel gives up on the socket (with default settings on a Linux host, that's around 17 minutes). If the operator restarts the application as a result, data loss may occur.
Although the fix is in x/net v0.10.0 it will not be available to most developers who likely import net/http (and so rely on the h2 bundle rolled into Go's releases)
The only viable workaround is to disable the H2 client entirely via environment variable.
Backport issue(s) opened: #60301 (for 1.20), #60662 (for 1.19).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.
Change https://go.dev/cl/507395 mentions this issue: http2: revert Transport change from CL 486156
@btasker is
http2Trans, err := http2.ConfigureTransports(transport)
if err == nil {
http2Trans.ReadIdleTimeout = time.Duration(ReadIdleTimeout)
}
still a valid workaround for go 1.20.x? (I assume yes since I see the fix planned for 1.20.7 Milestone now) What about 1.21?
Is the ReadIdleTimeout of the http2 transport redundant with https://pkg.go.dev/net#Dialer.KeepAlive?
Hey @serbrech
still a valid workaround for go 1.20.x? (I assume yes since I see the fix planned for 1.20.7 Milestone now) What about 1.21?
Yep, still valid for both.
Is the ReadIdleTimeout of the http2 transport redundant with https://pkg.go.dev/net#Dialer.KeepAlive?
I've not checked the interaction, but I suspect it's probably not redundant, no (as the dialer keepalive should be operating at a lower level).
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
A connection is established to a remote server (using HTTP/2 where available) and requests are sent periodically over this connection - it's essentially a write pipeline working through a queue sequentially.
This being the internet, sometimes bad things happen to individual connections and packets go awry.
If a timeout occurs, the connection will (well, should) be closed:
(that
CloseIdleConnections()
call is sourced from https://github.com/golang/go/issues/36026)What did you expect to see?
On the next request (generally a retry), a new connection should be established (if possible) and writes should resume.
What did you see instead?
The underlying TCP connection does not get closed - the kernel sits retransmitting the most recent packets (until
net.ipv4.tcp_retries2
is reached about 17mins later).Subsequent requests all attempt to use the dead connection and so time out.
Repro
The following can be used to help repro the behaviour
If you build and run that, you should see
403
printed on a new line every second or so (it being 403 isn't pertinent here, it's just easy/convenient to generate)Check which source port is being used
and then block it in
iptables
to break the return pathThe script should stop printing 403s and after around 5s will print
Timed out
.Although the code calls
CloseIdleConnections()
, it'll silently fail to do anything to the connection (more below)Subsequent attempts will also print
Time out
and re-runningnetstat
will show that same connection still in anESTABLISHED
state. If you take a packet capture it'll show re-transmission of whatever packet we interrupted by adding the rule.If you remove the
iptables
rule once the remote server starts sending resets, the kernel will tear down the connection and requests will go back to behaving as they should.Hacking it out
It is actually possible to work around this at the caller's level, although it's not exactly pleasant to look at.
This works because
net/http2
specifically checks for aConnection: close
header before placing a request, and set'sdoNotReuse
on theClientConn
, so when the next request comes around, the connection won't be used even if it is otherwise considered still active.Underlying Issue
The underlying seems to be that there's some raciness between normal stream closure and
closeIdleConnections()
.The crucial check is here
If the map
cc.streams
contains any entries, the connection isn't considered idle and therefore can't be closed.Dropping some prints in as a quick test shows it's this conditional we're failing at
Gives
net/http2
does actually try to tidy up the streams,doRequest
callscleanupWriteRequest
which ultimately drops the broken stream.Adding some more prints though, shows that this only tends to happen after the caller has had opportunity to call
closeIdleConnections()
:The string
Forgetting ID
was added here.Raciness
Timings definitely seem to be important here - if the
time.Sleep
call in the repro is adjusted to wait 5 seconds, you'll sometimes see a string of timeouts followed by writes recovering.That's not the only raciness though.
Stream closures/timeouts etc are handled by a
select
statement hereSometimes this is triggered by
cs.abort
and sometimes byctx.Done()
. This appears to be as a result of an attempt to abort the stream when the context hits deadline - so there's an element of potluck in terms of which gets through (I've not dug into this in too much depth though).TL:DR
CloseIdleConnections()
relies oncloseIfIdle()
which will close a connection only if there are no active streams on that connection.However, when a timeout occurs, there may be a delay before the stream is closed (leaving it visible to
closeIfIdle()
).By the time the stream closure occurs, the timeout has already been bubbled up to the caller, who may then choose to call
CloseIdleConnections()
knowing that the connection has failed.Because the dead connection does not get closed by this call, it remains available for use by the next request.
The result is that requests will fail until the kernel itself gives up on the connection, or a RST or FIN makes it through from the remote end.