golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.85k stars 17.52k forks source link

x/net/http2: client can't fully utilize network resource #37373

Open ochanism opened 4 years ago

ochanism commented 4 years ago

What version of Go are you using (go version)?

$ go version
go version go1.13.6 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/jayden/Library/Caches/go-build"
GOENV="/Users/jayden/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/jayden/Workspace/GoProject"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/jayden/Workspace/GoProject/src/golang.org/x/net/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/j1/94761d9d49j6gz5p55375sbr0000gn/T/go-build875938892=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I ran an http2 webdav server and I made a client with golang x/net/http2 for a test. The client uploaded a 1 MB file infinitely with a few hundred goroutine. MaxConcurrentStreams of the server was 100 and StrictMaxConcurrentStreams of the client was false. I measured the throughput between the server and the client on AWS. And network bandwidth between the server and the client was 150 Gbps in the infrastructure level.

What did you expect to see?

I expected that the throughput would be around 150 Gbps.

What did you see instead?

The measured throughput was 90 Gbps.

Proposal

I figured out what caused this. I modified the http2 connection pool a little as shown in the below link. (https://github.com/ochanism/net/commit/b2cc93d427fb3b5dc0bc42be8c8943d76f35eaf3) With the above code, I could achieve 150 Gbps throughput. For achieving high throughput, multiple TCP connections are needed. But the current version of the http2 connection pool increases TCP connections only when there is no available TCP connection (CurrentMaxStreams == MaxConcurrentStreams for all the connections). I reduced MaxConcurrentStreams value at the server-side, but the number of TCP connections wasn't increased than I expected.

So I added a MinConcurrentConns field to http2 Transport. MinConcurrentConns is the minimum number of TCP connections and ClientConnPool tries to maintain MinConcurrentConns TCP connections at least. If 0, ClientConnPool creates a TCP connection only when needed.

Please review my proposal.

fraenkel commented 4 years ago

Curious if you just tried to walk backwards through the connections?

ochanism commented 4 years ago

Curious if you just tried to walk backwards through the connections?

@fraenkel Could you elaborate your question?

fraenkel commented 4 years ago

@ochanism Your fix was to randomly start somewhere in the list. The current code walks forward. Since connections are added to the end, in most cases, the next connection with availability will be the last one. Obviously it all depends on which streams in which connections are closed.

Ultimately the bigger issue is that one has no way of configuring your new option from net/http. Now if this is only for use from net/http2 its less of an issue.

Its also related to #34511

ochanism commented 4 years ago

@fraenkel Client in net/http doesn't have this issue that I raised since the number of TCP connections for connection pooling can be adjusted via MaxIdleConns. However, there is no way to control the number of TCP connections in net/http2. So I've added MinConcurrentConns to control it. Moreover, walking connections backward cannot resolve this issue. In this case, every request will select the last connection at a high probability and the remaining connections in the pool will be expired due to IdleTimeout. So I've made the client randomly select a connection within MinConcurrentConns window to prevent the expiration.

fraenkel commented 4 years ago

Your selection will have similar issues in practice. Randomly selecting won't solve optimizing the network traffic across multiple connections. The underlying locking is a bit loose and writes currently lock reads. Which means depending on how loaded (# of streams) a connection is, and how much data is being sent back, the network may appear to be utilized but you are still dealing with lots of contention.

The fact that you are creating a new rand for each request does not provide any guarantees over the selection across the connections.

ochanism commented 4 years ago

@fraenkel Yes, you're right. All the random algorithms in real world have the same problem. The goal of my proposal is not to utilize network resource optimally. The main issue of the current http2 client is to use a small number of tcp connections even if there is a large traffic (not high frequent request). I can make it optimal, but I don't want to make it complex. I think the random algorithm shows a good performance for generic use cases. Of course, as you mentioned, there could be an issue for some edge cases.

cagedmantis commented 4 years ago

/cc @bradfitz @tombergan

ochanism commented 4 years ago

@cagedmantis @bradfitz @tombergan Please give me any feedback. I've been waiting for your review for 1 month.

bradfitz commented 4 years ago

It sounds like a knob we wouldn't want to expose to users.

How would you document how to tune it?

I'd rather just see it be automatically fast instead.

ochanism commented 4 years ago

It sounds like a knob we wouldn't want to expose to users.

How would you document how to tune it?

I'd rather just see it be automatically fast instead.

@bradfitz I agree with your opinion. It would be very nice if the value is adjusted automatically according to the current traffic pattern. But as you know, estimating and understanding the traffic patterns is not an easy problem. Many variables are related to it.

I think the MinConcurrentConns field has a similar role to the MaxIdleConns field in HTTP transport. (Not exact same, but similar) Normally, most users don't need to care about MinConcurrentConns value. When MinConcurrentConns = 0, it will work as the previous. But some users will face the low-throughput problem like me, and they need to find an appropriate MinConcurrentConns value for their traffic pattern.

I've been running a server with this library in production, and there was a huge gain in terms of throughput. When I used the vanilla net/http library, an HTTP client uses only one or two TCP connections even if there is huge traffic. When I set strictMaxConcurrentStreams to false, I expected that the number of TCP connections would be increased if there is burst traffic. But it didn't work as I expected.

What do you think about this? Do you have any idea?

ncw commented 4 years ago

We've noticed this in rclone too https://github.com/rclone/rclone/issues/3631#issuecomment-685679459

When uploading 4 files at once to Google Drive

Without HTTP/2: 4 Connections (on netstat) & Total Speed: 320 MB/s With HTTP/2: 1 Connection (on netstat) & Total Speed: 160 MB/s

So that is 50% of the speed using HTTP/2. My conjecture is that would go faster if we could persuade the HTTP/2 transport to use more TCP connections but I don't know how to do that.

aojea commented 2 years ago

Without HTTP/2: 4 Connections (on netstat) & Total Speed: 320 MB/s With HTTP/2: 1 Connection (on netstat) & Total Speed: 160 MB/s

I can reproduce similar values with following snippet, by changing the DisableKeepAlives option, that allows to create multiple connections per request when is set to true

package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "net/http/httptest"
    "strings"
    "sync"
    "time"
)

func handler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    n := 1 << 20
    chunk := strings.Repeat("x", 1024)
    for i := 0; i < n; i++ {
        _, err := fmt.Fprintf(w, chunk)
        if err != nil {
            log.Printf("Error writing %v", err)
        }
    }

    size := float64(n) / 1024
    duration := time.Since(start).Seconds()
    log.Printf("Server: %f MBps", size/duration)
}

func main() {
    sv := httptest.NewUnstartedServer(http.HandlerFunc(handler))
    sv.EnableHTTP2 = true
    sv.StartTLS()

    var wg sync.WaitGroup

    tr, ok := sv.Client().Transport.(*http.Transport)
    if !ok {
        panic("transport unsupported")
    }
    tr.DisableKeepAlives = false

    workers := 4
    for i := 1; i <= workers; i++ {
        wg.Add(1)
        i := i
        go func() {
            defer wg.Done()
            start := time.Now()

            resp, err := sv.Client().Get(sv.URL + "/")
            if err != nil {
                log.Fatal(err)
            }
            defer resp.Body.Close()

            data, err := ioutil.ReadAll(resp.Body)
            if err != nil {
                log.Printf("Error reading %v", err)
            }

            size := float64(len(data)) / (1024 * 1024)
            duration := time.Since(start).Seconds()

            log.Printf("Client%d: %f MBps ", i, size/duration)
        }()
    }
    wg.Wait()

}
go run server.go 
2021/11/26 11:19:01 Server: 96.358534 MBps
2021/11/26 11:19:01 Client4: 96.338120 MBps 
2021/11/26 11:19:02 Server: 92.666090 MBps
2021/11/26 11:19:02 Client2: 92.651739 MBps 
2021/11/26 11:19:02 Server: 88.723179 MBps
2021/11/26 11:19:02 Client1: 88.693800 MBps 
2021/11/26 11:19:03 Server: 86.221247 MBps
2021/11/26 11:19:03 Client3: 86.193994 MBps 
$ go run server.go 
2021/11/26 11:19:27 Server: 63.471122 MBps
2021/11/26 11:19:27 Client4: 63.462420 MBps 
2021/11/26 11:19:29 Server: 58.067915 MBps
2021/11/26 11:19:29 Client3: 58.061386 MBps 
2021/11/26 11:19:35 Server: 43.437425 MBps
2021/11/26 11:19:35 Client1: 43.433545 MBps 
2021/11/26 11:19:38 Server: 38.357418 MBps
2021/11/26 11:19:38 Client2: 38.355121 MBps 

Setting tr.DisableKeepAlives = true creates one different connection per request, if I set to false I have half the throughput,

Rootax commented 2 years ago

I hope this bug is not forgotten, it's a pretty important one...