golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.27k stars 17.57k forks source link

net/http: Transport wastes TLS handshake progress #50984

Open rhysh opened 2 years ago

rhysh commented 2 years ago

What version of Go are you using (go version)?

$ go1.16 version
go version go1.16.13 darwin/arm64

$ go1.17 version
go version go1.17.6 darwin/arm64

$ go1.18 version
go version devel go1.18-41f485b9a7 Mon Jan 31 13:43:52 2022 +0000 darwin/arm64
# aka go1.18beta2

Does this issue reproduce with the latest release?

Yes, this is present in the latest stable release, go1.17.6, and the latest tagged beta. It is not present in go1.16.13. It was introduced by the work for #32406.

What operating system and processor architecture are you using (go env)?

go env Output ``` $ go1.17 env GO111MODULE="" GOARCH="arm64" GOBIN="" GOCACHE="/Users/rhys/Library/Caches/go-build" GOENV="/Users/rhys/Library/Application Support/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="arm64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/rhys/go/pkg/mod" GONOPROXY="*" GONOSUMDB="*" GOOS="darwin" GOPATH="/Users/rhys/go" GOPRIVATE="*" GOPROXY="direct" GOROOT="/Users/rhys/go/version/go1.17" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/Users/rhys/go/version/go1.17/pkg/tool/darwin_arm64" GOVCS="" GOVERSION="go1.17.6" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/dev/null" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/lf/n4ry0qv5639_3z0nhyhy3z7c0000gs/T/go-build3534659518=/tmp/go-build -gno-record-gcc-switches -fno-common" ```

What did you do?

An application I work with uses net/http to make outbound requests over HTTP/1.1+TLS. It uses Context values to control the lifecycle of those requests and to set timeouts. It cleans up the Context values (to avoid leaking timers, as recommended by go vet) when the outbound HTTPS call is complete.

Here's an overview of the most relevant code:

func callWithTimeout(ctx context.Context) {
    ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()
    doRequest(ctx) // uses net/http package for HTTPS call several layers down
}

I've included a full reproducer at the bottom, also available at https://go.dev/play/p/_vg00gqqbpG although the Playground's network stack messes slightly with the results.

What did you expect to see?

I expected that code pattern (use Context to set a timeout, clean up immediately after the call) to be a best practice. I expected it to lead to efficient use of compute resources. I expected it to provide good performance (not much unnecessary latency).

What did you see instead?

After upgrading the application from from Go 1.16 to Go 1.17, the load balancer in front of the HTTPS service this application calls saw about a 4x increase in the volume of inbound TLS handshake requests, while reporting no change in the typical number of concurrent connections. A 4x increase in TLS handshake volume for this service is enough to throw off capacity planning.

The reproducer shows that canceling the Context value that was used for a request that is now complete causes harm to the connection pool. It shows that when a request finds an empty pool and so triggers creation of a new connection, the link between the request that triggered the new connection outlives the request's need for that specific connection.

When another request to the same backend completes and makes its connection available in the Transport's pool, the request that triggered a new dial may use the now-idle connection instead. That allows the request to complete even before the connection it triggered is ready.

All recent Go versions include that link when executing a TCP handshake, leading to some TCP connections being closed as soon as they become ready thereby wasting the work of the DNS lookup and TCP handshake. New in Go 1.17 is an additional link to the TLS handshake, allowing that more-expensive work to be discarded. From the reproducer:

    // New in Go 1.17 via https://golang.org/issue/32406, Go's HTTP client
    // connection pool now also discards useful progress on outbound TLS
    // handshakes.
    //
    // At t=0ms, the first request triggers a new connection
    // At t=0ms, the TCP handshake for the first connection is complete and the TLS handshake begins
    // At t=200ms, the TLS handshake completes and the first HTTP request begins
    // At t=250ms, the first HTTP request completes and the first connection enters the idle pool
    // At t=400ms, the second request removes the first connection from the idle pool
    // At t=420ms, the third request finds an empty pool and dials a second connection
    // At t=420ms, the second connection TLS handshake begins
    // At t=450ms, the second HTTP request completes and hands its connection to the pool
    // At t=450ms, the third request intercepts the first connection before it enters the pool
    // At t=500ms, the third HTTP request completes and returns to the application code
    // At t=500ms, the application code has the full HTTP response, so cancels its Context
    // At t=500ms, Go 1.17's call to tls.Conn.HandshakeContext aborts
    // At t=620ms, Go 1.16's call to tls.Conn.Handshake completes, and goes into the idle pool
    // At t=800ms, the fourth request removes the first connection from the idle pool
    // At t=800ms, the fifth request uses an idle connection (Go 1.16) or dials fresh (Go 1.17+)
    t.Run("minimal pool with slow TLS", varyCancel(
        2,                    // ideal pool size
        0,                    // delay in TCP
        200*time.Millisecond, // delay in TLS
        50*time.Millisecond,  // delay in HTTP
        []time.Duration{
            0,                      // t=0 ms     create connection 1
            400 * time.Millisecond, // t=400 ms   use warm connection 1 until t=450
            20 * time.Millisecond,  // t=420 ms   trigger new connection 2 (BUG: work may be discarded!)
            380 * time.Millisecond, // t=800 ms   observe pool size (use 1)
            0,                      // t=800 ms   observe pool size (use 2, or dial 3)
        }))

With Go 1.17+, the idle pool remains smaller than what the application needs in the steady state. More requests find an empty pool and need to wait for a connection to become available (generally winning in the race against a freshly-dialed connection). The reproducer shows an increase to the average request duration, though I haven't measured that in a production setting. (TLS handshakes are expensive, not slow.)

Reproducer results ``` $ go1.16 test -v === RUN TestHTTPDialCancel === RUN TestHTTPDialCancel/minimal_pool_with_slow_TCP dial_test.go:148: NOTE: expect failure here for all recent Go versions === RUN TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel dial_test.go:132: average duration 102.287083ms === RUN TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel dial_test.go:124: TCP handshake count; 3 > 2 dial_test.go:130: timeout=20m0s tcp=3 tls=1 http=5 dial_test.go:132: average duration 113.971666ms === RUN TestHTTPDialCancel/minimal_pool_with_slow_TLS === RUN TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel dial_test.go:132: average duration 101.454508ms === RUN TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel dial_test.go:132: average duration 101.3856ms === RUN TestHTTPDialCancel/large_pool_with_slow_TLS === RUN TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel dial_test.go:132: average duration 75.285054ms === RUN TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel dial_test.go:132: average duration 76.206017ms --- FAIL: TestHTTPDialCancel (7.69s) --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP (1.78s) --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel (0.87s) --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel (0.92s) --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TLS (1.73s) --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel (0.87s) --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel (0.86s) --- PASS: TestHTTPDialCancel/large_pool_with_slow_TLS (4.17s) --- PASS: TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel (2.07s) --- PASS: TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel (2.10s) FAIL exit status 1 FAIL code.justin.tv/rhys/nursery/issue/golang/go/50xxx 7.844s ``` ``` $ go1.17 test -v === RUN TestHTTPDialCancel === RUN TestHTTPDialCancel/minimal_pool_with_slow_TCP dial_test.go:148: NOTE: expect failure here for all recent Go versions === RUN TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel dial_test.go:132: average duration 103.162191ms === RUN TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel dial_test.go:124: TCP handshake count; 3 > 2 dial_test.go:130: timeout=20m0s tcp=3 tls=1 http=5 dial_test.go:132: average duration 114.508925ms === RUN TestHTTPDialCancel/minimal_pool_with_slow_TLS === RUN TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel dial_test.go:132: average duration 102.52155ms === RUN TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel dial_test.go:124: TCP handshake count; 3 > 2 dial_test.go:127: TLS handshake count; 3 > 2 dial_test.go:130: timeout=20m0s tcp=3 tls=3 http=5 dial_test.go:132: average duration 113.051833ms === RUN TestHTTPDialCancel/large_pool_with_slow_TLS === RUN TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel dial_test.go:132: average duration 77.059902ms === RUN TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel dial_test.go:124: TCP handshake count; 24 > 8 dial_test.go:127: TLS handshake count; 24 > 8 dial_test.go:130: timeout=20m0s tcp=24 tls=24 http=44 dial_test.go:132: average duration 87.613746ms --- FAIL: TestHTTPDialCancel (8.04s) --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP (1.81s) --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel (0.87s) --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel (0.93s) --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TLS (1.89s) --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel (0.87s) --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel (1.02s) --- FAIL: TestHTTPDialCancel/large_pool_with_slow_TLS (4.34s) --- PASS: TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel (2.09s) --- FAIL: TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel (2.25s) FAIL exit status 1 FAIL code.justin.tv/rhys/nursery/issue/golang/go/50xxx 8.780s ``` ``` $ go1.18 test -v === RUN TestHTTPDialCancel === RUN TestHTTPDialCancel/minimal_pool_with_slow_TCP dial_test.go:148: NOTE: expect failure here for all recent Go versions === RUN TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel dial_test.go:132: average duration 105.759366ms === RUN TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel dial_test.go:124: TCP handshake count; 3 > 2 dial_test.go:130: timeout=20m0s tcp=3 tls=1 http=5 dial_test.go:132: average duration 113.322008ms === RUN TestHTTPDialCancel/minimal_pool_with_slow_TLS === RUN TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel dial_test.go:132: average duration 101.212549ms === RUN TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel dial_test.go:124: TCP handshake count; 3 > 2 dial_test.go:127: TLS handshake count; 3 > 2 dial_test.go:130: timeout=20m0s tcp=3 tls=3 http=5 dial_test.go:132: average duration 116.365683ms === RUN TestHTTPDialCancel/large_pool_with_slow_TLS === RUN TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel dial_test.go:132: average duration 76.463704ms === RUN TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel dial_test.go:124: TCP handshake count; 24 > 8 dial_test.go:127: TLS handshake count; 24 > 8 dial_test.go:130: timeout=20m0s tcp=24 tls=24 http=44 dial_test.go:132: average duration 90.389791ms --- FAIL: TestHTTPDialCancel (8.01s) --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP (1.79s) --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel (0.87s) --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel (0.92s) --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TLS (1.88s) --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel (0.87s) --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel (1.01s) --- FAIL: TestHTTPDialCancel/large_pool_with_slow_TLS (4.33s) --- PASS: TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel (2.08s) --- FAIL: TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel (2.25s) FAIL exit status 1 FAIL code.justin.tv/rhys/nursery/issue/golang/go/50xxx 8.825s ```
Reproducer code, dial_test.go ``` // Copyright 2022 The Go Authors. All rights reserved. // Use of this source code is governed by a BSD-style // license that can be found in the LICENSE file. package main import ( "context" "crypto/tls" "fmt" "io" "io/ioutil" "log" "net" "net/http" "net/http/httptest" "sync" "sync/atomic" "testing" "time" ) func TestHTTPDialCancel(t *testing.T) { withServer := func(tcpDelay, tlsDelay, httpDelay time.Duration, fn func(srv *httptest.Server)) (int, int, int) { var ( tcpDialCount int32 tlsHandshakeCount int32 httpHandleCount int32 ) srv := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { atomic.AddInt32(&httpHandleCount, 1) time.Sleep(httpDelay) })) srv.EnableHTTP2 = false srv.Config.ErrorLog = log.New(ioutil.Discard, "", 0) // Pause for a moment during the handshake so we can see what happens when // we cancel the Context of a completed HTTP Request. srv.TLS = &tls.Config{} srv.TLS.GetConfigForClient = func(chi *tls.ClientHelloInfo) (*tls.Config, error) { atomic.AddInt32(&tlsHandshakeCount, 1) time.Sleep(tlsDelay) return nil, nil } srv.StartTLS() defer srv.Close() // Before making any requests, add a delay to the TCP Dialer so we can see // what happens when we cancel the Context of a completed HTTP Request. srv.Client().Transport.(*http.Transport).DialContext = func(ctx context.Context, network, addr string) (net.Conn, error) { atomic.AddInt32(&tcpDialCount, 1) time.Sleep(tcpDelay) return (&net.Dialer{}).DialContext(ctx, network, addr) } // Allow a large connection pool srv.Client().Transport.(*http.Transport).MaxIdleConnsPerHost = 100 fn(srv) return int(atomic.LoadInt32(&tcpDialCount)), int(atomic.LoadInt32(&tlsHandshakeCount)), int(atomic.LoadInt32(&httpHandleCount)) } doRequest := func(ctx context.Context, srv *httptest.Server, timeout time.Duration) error { if timeout > 0 { // BUG: canceling the context associated with an already-complete // HTTP request leads to an increase in TLS handshake count. sub, cancel := context.WithTimeout(ctx, timeout) defer cancel() ctx = sub } req, err := http.NewRequestWithContext(ctx, "GET", srv.URL, nil) if err != nil { return fmt.Errorf("NewRequestWithContext: %w", err) } resp, err := srv.Client().Do(req) if err != nil { return fmt.Errorf("Do Request: %w", err) } defer resp.Body.Close() _, err = io.Copy(ioutil.Discard, resp.Body) if err != nil { return fmt.Errorf("Discard Body: %w", err) } return nil } callWithDelays := func(t *testing.T, tcpDelay, tlsDelay, httpDelay time.Duration, delays []time.Duration, timeout time.Duration) (int, int, int, time.Duration) { var total time.Duration tcp, tls, http := withServer(tcpDelay, tlsDelay, httpDelay, func(srv *httptest.Server) { ctx := context.Background() var wg sync.WaitGroup for _, delay := range delays { time.Sleep(delay) wg.Add(1) go func() { defer wg.Done() start := time.Now() err := doRequest(ctx, srv, timeout) if err != nil { t.Errorf("HTTP request failed: %v", err) } atomic.AddInt64((*int64)(&total), int64(time.Now().Sub(start))) }() } wg.Wait() }) return tcp, tls, http, total / time.Duration(len(delays)) } varyCancel := func(poolSize int, tcpDelay, tlsDelay, httpDelay time.Duration, delays []time.Duration) func(t *testing.T) { return func(t *testing.T) { fn := func(timeout time.Duration) func(t *testing.T) { return func(t *testing.T) { tcp, tls, http, avg := callWithDelays(t, tcpDelay, tlsDelay, httpDelay, delays, timeout) if tcp > poolSize { t.Errorf("TCP handshake count; %d > %d", tcp, poolSize) } if tls > poolSize { t.Errorf("TLS handshake count; %d > %d", tls, poolSize) } if t.Failed() { t.Logf("timeout=%s tcp=%d tls=%d http=%d", timeout, tcp, tls, http) } t.Logf("average duration %s", avg) } } // No timeout, so no context.WithCancel / WithTimeout call t.Run("no timeout/cancel", fn(0)) // Huge timeout, key change is the presence of "defer cancel()" once // the outbound request is complete t.Run("with timeout/cancel", fn(20*time.Minute)) } } // Go's HTTP client connection pool has discarded useful progress on // outbound TCP handshakes for several releases. // // NOTE: this test doesn't work well on the Playground. t.Run("minimal pool with slow TCP", func(t *testing.T) { t.Logf("NOTE: expect failure here for all recent Go versions") varyCancel( 2, // ideal pool size 200*time.Millisecond, // delay in TCP 0, // delay in TLS 50*time.Millisecond, // delay in HTTP []time.Duration{ 0, // t=0 ms create connection 1 400 * time.Millisecond, // t=400 ms use warm connection 1 until t=450 20 * time.Millisecond, // t=420 ms trigger new connection 2 (BUG: work may be discarded!) 380 * time.Millisecond, // t=800 ms observe pool size (use 1) 0, // t=800 ms observe pool size (use 2, or dial 3) })(t) }) // New in Go 1.17 via https://golang.org/issue/32406, Go's HTTP client // connection pool now also discards useful progress on outbound TLS // handshakes. // // At t=0ms, the first request triggers a new connection // At t=0ms, the TCP handshake for the first connection is complete and the TLS handshake begins // At t=200ms, the TLS handshake completes and the first HTTP request begins // At t=250ms, the first HTTP request completes and the first connection enters the idle pool // At t=400ms, the second request removes the first connection from the idle pool // At t=420ms, the third request finds an empty pool and dials a second connection // At t=420ms, the second connection TLS handshake begins // At t=450ms, the second HTTP request completes and hands its connection to the pool // At t=450ms, the third request intercepts the first connection before it enters the pool // At t=500ms, the third HTTP request completes and returns to the application code // At t=500ms, the application code has the full HTTP response, so cancels its Context // At t=500ms, Go 1.17's call to tls.Conn.HandshakeContext aborts // At t=620ms, Go 1.16's call to tls.Conn.Handshake completes, and goes into the idle pool // At t=800ms, the fourth request removes the first connection from the idle pool // At t=800ms, the fifth request uses an idle connection (Go 1.16) or dials fresh (Go 1.17+) t.Run("minimal pool with slow TLS", varyCancel( 2, // ideal pool size 0, // delay in TCP 200*time.Millisecond, // delay in TLS 50*time.Millisecond, // delay in HTTP []time.Duration{ 0, // t=0 ms create connection 1 400 * time.Millisecond, // t=400 ms use warm connection 1 until t=450 20 * time.Millisecond, // t=420 ms trigger new connection 2 (BUG: work may be discarded!) 380 * time.Millisecond, // t=800 ms observe pool size (use 1) 0, // t=800 ms observe pool size (use 2, or dial 3) })) // The impact of discarding useful progress on TLS handshakes is unbounded: // A client running Go 1.17 or newer, which creates a context for each // request which it cancels when the request is complete, may steadily churn // through new TLS connections. It can do this even when its maximum // outbound concurrency is below the MaxIdleConnsPerHost limit. t.Run("large pool with slow TLS", varyCancel( 8, // ideal pool size 0, // delay in TCP 200*time.Millisecond, // delay in TLS 50*time.Millisecond, // delay in HTTP []time.Duration{ 0, // t=0 ms create connection 1 0, // t=0 ms create connection 2 0, // t=0 ms create connection 3 0, // t=0 ms create connection 4 400 * time.Millisecond, // t=400 ms use warm connection 1 until t=450 0, // t=400 ms use warm connection 2 until t=450 0, // t=400 ms use warm connection 3 until t=450 0, // t=400 ms use warm connection 4 until t=450 20 * time.Millisecond, // t=420 ms trigger new connection 5 (BUG: work may be discarded!) 0, // t=420 ms trigger new connection 6 (BUG: work may be discarded!) 0, // t=420 ms trigger new connection 7 (BUG: work may be discarded!) 0, // t=420 ms trigger new connection 8 (BUG: work may be discarded!) 380 * time.Millisecond, // t=800 ms use warm connection 1 until t=850 0, // t=800 ms use warm connection 2 until t=850 0, // t=800 ms use warm connection 3 until t=850 0, // t=800 ms use warm connection 4 until t=850 20 * time.Millisecond, // t=820 ms use warm connection 5, or trigger new connection 9 (BUG: work may be discarded!) 0, // t=820 ms use warm connection 6, or trigger new connection 10 (BUG: work may be discarded!) 0, // t=820 ms use warm connection 7, or trigger new connection 11 (BUG: work may be discarded!) 0, // t=820 ms use warm connection 8, or trigger new connection 12 (BUG: work may be discarded!) 380 * time.Millisecond, // t=1200 ms use warm connection 1 until t=1250 0, // t=1200 ms use warm connection 2 until t=1250 0, // t=1200 ms use warm connection 3 until t=1250 0, // t=1200 ms use warm connection 4 until t=1250 20 * time.Millisecond, // t=1220 ms use warm connection 5, or trigger new connection 13 (BUG: work may be discarded!) 0, // t=1220 ms use warm connection 6, or trigger new connection 14 (BUG: work may be discarded!) 0, // t=1220 ms use warm connection 7, or trigger new connection 15 (BUG: work may be discarded!) 0, // t=1220 ms use warm connection 8, or trigger new connection 16 (BUG: work may be discarded!) 380 * time.Millisecond, // t=1600 ms use warm connection 1 until t=1650 0, // t=1600 ms use warm connection 2 until t=1650 0, // t=1600 ms use warm connection 3 until t=1650 0, // t=1600 ms use warm connection 4 until t=1650 20 * time.Millisecond, // t=1620 ms use warm connection 5, or trigger new connection 17 (BUG: work may be discarded!) 0, // t=1620 ms use warm connection 6, or trigger new connection 18 (BUG: work may be discarded!) 0, // t=1620 ms use warm connection 7, or trigger new connection 19 (BUG: work may be discarded!) 0, // t=1620 ms use warm connection 8, or trigger new connection 20 (BUG: work may be discarded!) 380 * time.Millisecond, // t=2020 ms observe pool size (use 1) 0, // t=2020 ms observe pool size (use 2) 0, // t=2020 ms observe pool size (use 3) 0, // t=2020 ms observe pool size (use 4) 0, // t=2020 ms observe pool size (use 5, or dial 21) 0, // t=2020 ms observe pool size (use 6, or dial 22) 0, // t=2020 ms observe pool size (use 7, or dial 23) 0, // t=2020 ms observe pool size (use 8, or dial 24) })) } ```
komuw commented 2 years ago

sounds related to https://github.com/golang/go/issues/50657 somehow

bcmills commented 2 years ago

CC @neild

bcmills commented 1 year ago

@duanmengkk, see https://go.dev/wiki/NoPlusOne.

krk commented 1 year ago

Would this issue when fixed, improve the latency of the TLS handshakes?

mark4z commented 1 year ago

go 1.19.3 has this problem

kassiansun commented 1 year ago

I think we can't use a dedicated context for the TLS handshake, (at least not with a simple code modification). If user doesn't use idle pool (DisableKeepAlive: true), this will make TLS handshake (within queueForDial) process un-cancellable.

Maybe we can check the connection pool for idle connection, if there're any idle connections, we can skip the dialing process, fasthttp is doing a similar process. The caveats is if the pool get occupied after the check, the call will have to wait a longer time to get an idle connection (instead of trying to dial the remote in parallel).

Another solution is adding a DialDelay option, user can choose whether they want to wait a bit longer to prioritize using the idle connection, instead of aggressively dialing the remote host. This might make the implementation a bit more efficient under some load scenarios.

jaronoff97 commented 4 months ago

Any updates here? It seems like a few things that use this as a core dependency are experiencing intermittent TLS handshake errors (i am one of those people)

sunilkumar-nfer commented 3 months ago

Any update here ?

mpoindexter commented 1 month ago

Is this the same thing as #59017, which was fixed in 1.23?

zhangguanzhang commented 1 week ago

any update?