golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.38k stars 17.7k forks source link

net/http: tls handshake panic with custom dialer #58469

Open aathan opened 1 year ago

aathan commented 1 year ago

What version of Go are you using (go version)?

go version go1.20 linux/amd64

Does this issue reproduce with the latest release?

yes, 1.20.0

What operating system and processor architecture are you using (go env)?

go env Output
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/aathan/.cache/go-build"
GOENV="/home/aathan/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/aathan/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/aathan/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build4181227200=/tmp/go-build -gno-record-gcc-switches"

What did you do?

var internalTransport *http.Transport = &http.Transport{
  DialContext: func(ctx context.Context, network, addr string) (ret net.Conn, err error) {
   return transportDialFunc(ctx, network, addr, myStuff)
  },
  TLSHandshakeTimeout: TLS_HANDSHAKE_TIMEOUT,
}

What did you expect to see?

No panics from inside golang stdlib.

What did you see instead?

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x5d073c]
goroutine 4133611 [running]:
crypto/tls.(*Conn).handshakeContext.func2()
  /usr/local/go/src/crypto/tls/conn.go:1441 +0xbc
created by crypto/tls.(*Conn).handshakeContext
  /usr/local/go/src/crypto/tls/conn.go:1437 +0x205
Main process exited, code=exited, status=2/INVALIDARGUMENT
Failed with result 'exit-code'.

Proposed fix:

In crypto/tls/conn.go around line 1441:

go func() {
select {
case <-handshakeCtx.Done():
// Close the connection, discarding the error
_ = c.conn.Close()
interruptRes <- handshakeCtx.Err()
case <-done:
interruptRes <- nil
}
}()

I believe the problem may be that this should read:

conn := c.conn
if conn!= nil {
  conn.Close()
}

I have not deeply understood the net.Conn or tls dialer internals, but, I believe the issue here stems from the possibility that the tls handshake timer may expire before the dialer is actually finished dialing. Whether this implies that there should also be a deeper/upstream fix is beyond me.

I've prepared all of my credentials so that I can submit a related PR to googlesource.com & gerrit etc. once I get some feedback.

I would appreciate any tips on how to work around this issue in the meantime. I'm not sure how to replicate or override a core golang package to patch in this fix??

rittneje commented 1 year ago

What is transportDialFunc? Can you provide the full code to reproduce the issue?

aathan commented 1 year ago

Unfortunately no, it is non-trivial for me to provide transportDialFunc, which is a custom dialer that orchestrates multiple tcp connections between various nodes. I theorized that this may be due to the dialer taking longer than the TLS handshake timeout, therefore it may be possible to create a failing test case by doing a simple dial after a time.Sleep() longer than the TLS handshake timeout.

Irrespective of the contents of transportDialFunc, it seems prudent to add the nil check to crypto/tls/conn.go where I indicated, no?

seankhliao commented 1 year ago

The relevant part of http code is below. TLS handshaking only starts after dialing has successfully completed. So if the conn is invalid, it would be a problem with your dialer code returning invalid values.

https://github.com/golang/go/blob/e03ee85ef434f307500a71927dfb3e876161847a/src/net/http/transport.go#L1621-L1634

rittneje commented 1 year ago

To add to what @seankhliao said, the only way to construct a tls.Conn is via tls.Client or tls.Server, both of which require you to pass the already established TCP connection. Thus the only way for it to panic where it did is if you misuse the API, so adding the nil check is superfluous. (The handshake would panic shortly thereafter upon attempting to write to the TCP connection anyway.)

Hence this seems like a bug in your transportDialFunc implementation.

aathan commented 1 year ago

TL;DR I may indeed have a race condition in my code (see 5 below), but there also seems to be a missing safety check in Transport.dial(), irrespective of whether additionally checking c.conn!=nil in tls/conn.go is considered superfluous.

Observations:

(1) The only reference I can find to TLSHandshakeTimeout is within (*persistConn) addTLS() which in turn is only used by (*Transport) dialConn()

(2) (*Transport) dial() does less error checking when using a DialContext. A return value of (nil,nil) allows a nil net.Conn to be passed back with no indicated error. Is the raw pass-through when using a DialContext intentional?

I might argue it would be safer to check for c==nil && err==nil irrespective of the dialer, particularly if a non-nil net.Conn is required by the rest of the code; but maybe you want the raw behavior. I don't know enough about the rest of framework to have a strong opinion.

func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
    if t.DialContext != nil {
        return t.DialContext(ctx, network, addr)
    }
    if t.Dial != nil {
        c, err := t.Dial(network, addr)
        if c == nil && err == nil {
            err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
        }
        return c, err
    }
    return zeroDialer.DialContext(ctx, network, addr)
}

(3) That dial() implementation is only ever called from (* Transport) dialConn() in one place. In that place, a hypothetical return of (nil,nil) from dial, as just discussed, would result in pconn.conn==nil during the call to pconn.addTLS().

In addTLS() the pconn.conn is used as the conn member in tlsConn. Subsequently, that's the proximal cause of the nil dereference in (*Conn) handshakeContext() via tlsConn.HandshakeContext(), invoked from inside a goroutine

Since I can't find anywhere that explicitly sets tlsConn.conn after the initial construction, the nil (or non-nil bad) value must have come from dial().

(4) Note that this panic would require that the handshakeCtx is Done() possibly due to the incoming parent context being Done().

Note: The comments in (*Conn) handshakeContext() seem a little misleading, in that the goroutine there waits on <-done and <-handleCtx.Done(), and no guarantee exists as to which will be invoked by select if both are active.

Anyway, the parent context to handshakeCtx is the "top level" Request.ctx that started the entire chain. This is my code for that. d.TimeoutSec=10 and TLSHandshakeTimeout=5 seconds.

    ctx, cancel := context.WithTimeout(context.Background(), time.Second*time.Duration(d.TimeoutSec))
    defer cancel()
    req, err := http.NewRequestWithContext(ctx, "GET", urlURL.String(), nil)

Given these values either the handshakeCtx or parent context were canceled during the handshake in less than 5 seconds, or the system was so busy that it scheduled the handshakeCtx's goroutine prior to the time.AfterFunc() goroutine sometime after both expired. Dunno if that's possible with go's scheduler, but I thought it's worth mentioning here since it may evidence that things can happen in a rather weird ordering under load.

(5) Anyway, my dialer function looks like this. I guess there is a race condition in this code, right? Specifically, if the wg.Wait() wakes up transportDialFunc() first, and err is set to nil in the goroutine second, err might be returned as nil irrespective of ret's value.

func transportDialFunc(ctx context.Context, network, addr string, myStuff *MyStuff) (ret net.Conn, err error) {
    now := time.Now()
    if network != "tcp" {
        return nil, fmt.Errorf("unsupported network=%s", network)
    }

    // bunch of setup code...
    foo := something
    foo.streamHandler = func() {
        foo.wg.Done()
    }

    go func() {
        err = SetupStream(&foo)
    }()

    foo.wg.Wait()
    ret = foo.StreamConn

    if ret == nil && err == nil {
        err = fmt.Errorf("ret==nil")
    } else if ret != nil && err != nil {
        err = nil
    }
    if err != nil {
        //log
    } else {
        //log
    }
    return
}
rittneje commented 1 year ago

Where exactly is wg.Done() getting called? I see you are doing it in streamHandler, which is some sort of callback. Is that invoked at the end of SetupStream? Similarly, where exactly is wg.Add(1) getting called? Does SetupStream run forever, or do you expect it to finish before transportDialFunc returns?

The crux of my question is, why are you calling SetupStream in a goroutine, only to (I assume) wait for it to finish via the wait group immediately thereafter?

aathan commented 1 year ago

I'll answer you below, but I think the more relevant item for this issue is whether dial() should check the (nil,nil) case for a custom DialContext()? I would say "yes"


The code in 5 above is simplified, hiding implementation details of the app that are not immediately relevant here.

SetupStream() runs in a goroutine for all the usual reasons such as: Needing to run in a context where no locks are held, needing to perform tasks that could execute in parallel with those that run after readiness signalled by wg.Done(), etc. etc.

As you gleaned, that callback in fact does happen within SetupStream ... but really it could happen any time. The point is we don't have any guarantees as to when err= in the goroutine will execute relative to the code after the wg.Wait() ... including immediately after ret==nil && err==nil is checked ... although less likely, this execution order could happen even if wg.Done() executes after SetupStream() completes ... right?

PS: I recognize the logical issue here, of relying both on the synchronous return value of the function, and waiting on the waitgroup. Both are necessary, so I'll be modiying the code accordingly.

rittneje commented 1 year ago

My intent was to identity the root cause of the issue. The lack of a nil conn check is more of a secondary issue, as the code would still fail, just with a fairly generic error message rather than a panic.

With regards to SetupStream, I do not understand your code. Essentially you are saying it could continue to do work after signalling via wg.Done(). However, if this is the case, that means that there is no way for transportDialFunc to properly use its return value, as it has no idea when it will return. You should refactor your code to block on SetupStream finishing, at which point the goroutine is useless and you should just call it normally.

if err := SetupStream(&foo); err != nil {
    return nil, err
}

if foo.StreamConn == nil {
    return nil, errors.New(...)
}

return foo.StreamConn, nil
aathan commented 1 year ago

With regards to SetupStream,

I agree with you. I said what you said, differently.

the code would still fail

I think you're making the wrong assessment about the nil check because we are no longer talking about the same nil check.

The nil check I'm talking about now is not the same one I suggested when I opened the issue. Instead it's the one in 2 above. It would prevent this panic, and instead (correctly) behave as if the dialer failed to provide a net.Conn.

Explicitly, I'm talking about doing:

func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
    if t.DialContext != nil {
        c,err := t.DialContext(ctx, network, addr)
        if c == nil && err == nil {
            err = errors.New("net/http: Transport.DialContext hook returned (nil, nil)")
        }
        return c, err
    }
    if t.Dial != nil {
        c, err := t.Dial(network, addr)
        if c == nil && err == nil {
            err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
        }
        return c, err
    }
    return zeroDialer.DialContext(ctx, network, addr)
}

instead of:

func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
    if t.DialContext != nil {
        return t.DialContext(ctx, network, addr)
    }
    if t.Dial != nil {
        c, err := t.Dial(network, addr)
        if c == nil && err == nil {
            err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
        }
        return c, err
    }
    return zeroDialer.DialContext(ctx, network, addr)
}
rittneje commented 1 year ago

The nil check I'm talking about now is not the same one I suggested when I opened the issue. Instead it's the one in 2 above. It would prevent this panic, and instead (correctly) behave as if the dialer failed to provide a net.Conn.

Right, that's what I was referring to. You'd get an error back that says net/http: Transport.DialContext hook returned (nil, nil) instead of a panic. I agree that's better than a panic, and would be consistent with how it handles Transport.Dial today. My point was just that even if that change is made, you'd still have to fix transportDialFunc.

aathan commented 1 year ago

transportDialFunc is not in fact "totally" broken, because the nil net.Conn is the significant return value -- that said, as we've discussed, it may return (nil,nil) instead of (nil,!nil), but if it returns a nil net.Conn, it's a failure to dial, and the (nil,nil) should (as I think we agree) not cause a panic. I'll send in a PR. Just for completeness, another user seems to concur about this bug/improvement: https://groups.google.com/g/golang-nuts/c/qCn3BDUq74U