Open aathan opened 1 year ago
What is transportDialFunc
? Can you provide the full code to reproduce the issue?
Unfortunately no, it is non-trivial for me to provide transportDialFunc, which is a custom dialer that orchestrates multiple tcp connections between various nodes. I theorized that this may be due to the dialer taking longer than the TLS handshake timeout, therefore it may be possible to create a failing test case by doing a simple dial after a time.Sleep() longer than the TLS handshake timeout.
Irrespective of the contents of transportDialFunc, it seems prudent to add the nil
check to crypto/tls/conn.go where I indicated, no?
The relevant part of http code is below. TLS handshaking only starts after dialing has successfully completed. So if the conn is invalid, it would be a problem with your dialer code returning invalid values.
To add to what @seankhliao said, the only way to construct a tls.Conn
is via tls.Client
or tls.Server
, both of which require you to pass the already established TCP connection. Thus the only way for it to panic where it did is if you misuse the API, so adding the nil
check is superfluous. (The handshake would panic shortly thereafter upon attempting to write to the TCP connection anyway.)
Hence this seems like a bug in your transportDialFunc
implementation.
TL;DR
I may indeed have a race condition in my code (see 5 below), but there also seems to be a missing safety check in Transport.dial(), irrespective of whether additionally checking c.conn!=nil
in tls/conn.go is considered superfluous.
Observations:
(1) The only reference I can find to TLSHandshakeTimeout
is within (*persistConn) addTLS()
which in turn is only used by (*Transport) dialConn()
(2) (*Transport) dial()
does less error checking when using a DialContext. A return value of (nil,nil)
allows a nil net.Conn to be passed back with no indicated error. Is the raw pass-through when using a DialContext intentional?
I might argue it would be safer to check for c==nil && err==nil
irrespective of the dialer, particularly if a non-nil net.Conn is required by the rest of the code; but maybe you want the raw behavior. I don't know enough about the rest of framework to have a strong opinion.
func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
if t.DialContext != nil {
return t.DialContext(ctx, network, addr)
}
if t.Dial != nil {
c, err := t.Dial(network, addr)
if c == nil && err == nil {
err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
}
return c, err
}
return zeroDialer.DialContext(ctx, network, addr)
}
(3) That dial() implementation is only ever called from (* Transport) dialConn()
in one place. In that place, a hypothetical return of (nil,nil)
from dial, as just discussed, would result in pconn.conn==nil
during the call to pconn.addTLS().
In addTLS() the pconn.conn is used as the conn member in tlsConn. Subsequently, that's the proximal cause of the nil dereference in (*Conn) handshakeContext()
via tlsConn.HandshakeContext()
, invoked from inside a goroutine
Since I can't find anywhere that explicitly sets tlsConn.conn after the initial construction, the nil (or non-nil bad) value must have come from dial().
(4) Note that this panic would require that the handshakeCtx is Done() possibly due to the incoming parent context being Done().
Note: The comments in (*Conn) handshakeContext()
seem a little misleading, in that the goroutine there waits on <-done
and <-handleCtx.Done()
, and no guarantee exists as to which will be invoked by select if both are active.
Anyway, the parent context to handshakeCtx is the "top level" Request.ctx
that started the entire chain. This is my code for that. d.TimeoutSec=10 and TLSHandshakeTimeout=5 seconds.
ctx, cancel := context.WithTimeout(context.Background(), time.Second*time.Duration(d.TimeoutSec))
defer cancel()
req, err := http.NewRequestWithContext(ctx, "GET", urlURL.String(), nil)
Given these values either the handshakeCtx or parent context were canceled during the handshake in less than 5 seconds, or the system was so busy that it scheduled the handshakeCtx's goroutine prior to the time.AfterFunc()
goroutine sometime after both expired. Dunno if that's possible with go's scheduler, but I thought it's worth mentioning here since it may evidence that things can happen in a rather weird ordering under load.
(5) Anyway, my dialer function looks like this. I guess there is a race condition in this code, right? Specifically, if the wg.Wait() wakes up transportDialFunc() first, and err is set to nil in the goroutine second, err might be returned as nil irrespective of ret's value.
func transportDialFunc(ctx context.Context, network, addr string, myStuff *MyStuff) (ret net.Conn, err error) {
now := time.Now()
if network != "tcp" {
return nil, fmt.Errorf("unsupported network=%s", network)
}
// bunch of setup code...
foo := something
foo.streamHandler = func() {
foo.wg.Done()
}
go func() {
err = SetupStream(&foo)
}()
foo.wg.Wait()
ret = foo.StreamConn
if ret == nil && err == nil {
err = fmt.Errorf("ret==nil")
} else if ret != nil && err != nil {
err = nil
}
if err != nil {
//log
} else {
//log
}
return
}
Where exactly is wg.Done()
getting called? I see you are doing it in streamHandler
, which is some sort of callback. Is that invoked at the end of SetupStream
? Similarly, where exactly is wg.Add(1)
getting called? Does SetupStream
run forever, or do you expect it to finish before transportDialFunc
returns?
The crux of my question is, why are you calling SetupStream
in a goroutine, only to (I assume) wait for it to finish via the wait group immediately thereafter?
I'll answer you below, but I think the more relevant item for this issue is whether dial() should check the (nil,nil) case for a custom DialContext()? I would say "yes"
The code in 5 above is simplified, hiding implementation details of the app that are not immediately relevant here.
SetupStream() runs in a goroutine for all the usual reasons such as: Needing to run in a context where no locks are held, needing to perform tasks that could execute in parallel with those that run after readiness signalled by wg.Done(), etc. etc.
As you gleaned, that callback in fact does happen within SetupStream ... but really it could happen any time. The point is we don't have any guarantees as to when err=
in the goroutine will execute relative to the code after the wg.Wait() ... including immediately after ret==nil && err==nil
is checked ... although less likely, this execution order could happen even if wg.Done() executes after SetupStream() completes ... right?
PS: I recognize the logical issue here, of relying both on the synchronous return value of the function, and waiting on the waitgroup. Both are necessary, so I'll be modiying the code accordingly.
My intent was to identity the root cause of the issue. The lack of a nil
conn check is more of a secondary issue, as the code would still fail, just with a fairly generic error message rather than a panic.
With regards to SetupStream
, I do not understand your code. Essentially you are saying it could continue to do work after signalling via wg.Done()
. However, if this is the case, that means that there is no way for transportDialFunc
to properly use its return value, as it has no idea when it will return. You should refactor your code to block on SetupStream
finishing, at which point the goroutine is useless and you should just call it normally.
if err := SetupStream(&foo); err != nil {
return nil, err
}
if foo.StreamConn == nil {
return nil, errors.New(...)
}
return foo.StreamConn, nil
With regards to SetupStream,
I agree with you. I said what you said, differently.
the code would still fail
I think you're making the wrong assessment about the nil check because we are no longer talking about the same nil check.
The nil check I'm talking about now is not the same one I suggested when I opened the issue. Instead it's the one in 2 above. It would prevent this panic, and instead (correctly) behave as if the dialer failed to provide a net.Conn.
Explicitly, I'm talking about doing:
func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
if t.DialContext != nil {
c,err := t.DialContext(ctx, network, addr)
if c == nil && err == nil {
err = errors.New("net/http: Transport.DialContext hook returned (nil, nil)")
}
return c, err
}
if t.Dial != nil {
c, err := t.Dial(network, addr)
if c == nil && err == nil {
err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
}
return c, err
}
return zeroDialer.DialContext(ctx, network, addr)
}
instead of:
func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
if t.DialContext != nil {
return t.DialContext(ctx, network, addr)
}
if t.Dial != nil {
c, err := t.Dial(network, addr)
if c == nil && err == nil {
err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
}
return c, err
}
return zeroDialer.DialContext(ctx, network, addr)
}
The nil check I'm talking about now is not the same one I suggested when I opened the issue. Instead it's the one in 2 above. It would prevent this panic, and instead (correctly) behave as if the dialer failed to provide a net.Conn.
Right, that's what I was referring to. You'd get an error back that says net/http: Transport.DialContext hook returned (nil, nil)
instead of a panic. I agree that's better than a panic, and would be consistent with how it handles Transport.Dial
today. My point was just that even if that change is made, you'd still have to fix transportDialFunc
.
transportDialFunc is not in fact "totally" broken, because the nil net.Conn is the significant return value -- that said, as we've discussed, it may return (nil,nil) instead of (nil,!nil), but if it returns a nil net.Conn, it's a failure to dial, and the (nil,nil) should (as I think we agree) not cause a panic. I'll send in a PR. Just for completeness, another user seems to concur about this bug/improvement: https://groups.google.com/g/golang-nuts/c/qCn3BDUq74U
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes, 1.20.0
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
What did you expect to see?
No panics from inside golang stdlib.
What did you see instead?
Proposed fix:
In crypto/tls/conn.go around line 1441:
I believe the problem may be that this should read:
I have not deeply understood the net.Conn or tls dialer internals, but, I believe the issue here stems from the possibility that the tls handshake timer may expire before the dialer is actually finished dialing. Whether this implies that there should also be a deeper/upstream fix is beyond me.
I've prepared all of my credentials so that I can submit a related PR to googlesource.com & gerrit etc. once I get some feedback.
I would appreciate any tips on how to work around this issue in the meantime. I'm not sure how to replicate or override a core golang package to patch in this fix??