golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.23k stars 17.47k forks source link

net/http: HTTP/2 client sending RST_STREAM at more than 2500 transactions-per-second (TPS) #52845

Open en-vee opened 2 years ago

en-vee commented 2 years ago

What version of Go are you using (go version)?

$ go version
go version go1.18.1 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\AAA\AppData\Local\go-build
set GOENV=C:\Users\AAA\AppData\Roaming\go\env
set GOEXE=
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\AAA\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=linux
set GOPATH=C:\Users\AAA\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.18.1
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=0
set GOMOD=NUL
set GOWORK=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-fPIC -m64 -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=C:\Users\AAA~1\AppData\Local\Temp\go-build2512202176=/tmp/go-build -gno-record-gcc-switches

What did you do?

I am using a non-TLS HTTP/2 go client program which sends more than 2500 POST requests per second to a Netty based HTTP/2 server. (Note : I am running this program on a Linux 64-bit machine. So basically my build envt is Windows, but the run envt is Linux).

Each of the request is sent by a separate goroutine which in turn calls the following ChfConnection.SendUpdate() method spawned in another goroutine.

func (connection *ChfConnection) SendUpdate(sessionId, messageBody string) (string, error) {
    totalN40RequestsTriggered.WithLabelValues("Update").Add(1)
    url := fmt.Sprintf("%s%s", connection.baseUrl, connection.updateUri)
    url = strings.ReplaceAll(url, "{ChargingDataRef}", sessionId)
    alog.Trace("POST ChargingDataRequest-Update =>\n%s", messageBody)
    resp, err := connection.send(messageBody, url, POST)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()
    respBody, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        return "", err
    }

    alog.Trace("ChargingDataResponse-Update =>\n%s", string(respBody))

    if resp.StatusCode != http.StatusOK {
        return string(respBody), errors.New("HTTP request failed")
    }
    return string(respBody), nil
}

func (connection *ChfConnection) send(messageBody string, url string, reqType string) (*http.Response, error) {
    r := strings.NewReader(messageBody)
    req, err := http.NewRequest(reqType, url, r)
    if err != nil {
        return nil, err
    }
    return connection.do(req)
}

func (connection *ChfConnection) do(req *http.Request) (resp *http.Response, err error) {
    req.Header.Add("user-agent", "go-http-client")
    req.Header.Add("content-type", httpContentTypeApplicationJson)
    resp, err = connection.httpClient.Do(req)
    defer func() {
        if resp != nil {
            processedChargingRequestsCounter.WithLabelValues(fmt.Sprintf("%d", resp.StatusCode), req.Method).Add(1)
        }

        if resp == nil {
            unknownFailures.Inc()
        }
    }()
    return
}

When the rate of requests is as low as about 2000 per second, then I do not see my client sending RST_STREAM to the server.

However, when the rate is about 2500-3000 per second, I see the following in my program log file (after enabling GODEBUG="http2debug=2" ):

2022/05/10 21:59:12.827474 http2: Framer 0xc23f448000: wrote HEADERS flags=END_HEADERS stream=247467 len=65
2022/05/10 21:59:12.896018 http2: Framer 0xc23f448000: wrote DATA stream=247467 len=75 data="{\n\t\"invocationSequenceNumber\": 1,\n\t\"invocationTimeStamp\": \"2022-05-10T11:59"
2022/05/10 21:59:12.996612 RoundTrip failure: context deadline exceeded
2022/05/10 21:59:12.996692 http2: Framer 0xc23f448000: wrote RST_STREAM stream=247467 len=4 ErrCode=CANCEL

I am confused as to which context deadline is being exceeded ?

The client timeout that I have set in my HTTP client is 5 seconds whereas based on the log messages above, the time between the last part of the DATA frame and the time when I see the context deadline is just about 100 ms !!!

What did you expect to see?

No such errors

What did you see instead?

As mentioned above, I see Roundtrip timeout errors. This leads to message/request rate to be much lower (about 30% lower) than what I expect it to be.

en-vee commented 2 years ago

Also, note that although my build environment is Windows, I am compiling into a Linux executable. So, the built application is being run on a Linux 64-bit machine with the following configuration :

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Stepping:              4
CPU MHz:               2999.994
BogoMIPS:              5999.98
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d
heschi commented 2 years ago

cc @neild

en-vee commented 2 years ago

@bradfitz do you have any inputs on this ?

candaceblewett42 commented 2 years ago

Really I've been doing all of this shit??!!! Who is using my damn GitHub tell me your damn name dick

neild commented 2 years ago

The client timeout that I have set in my HTTP client is 5 seconds whereas based on the log messages above, the time between the last part of the DATA frame and the time when I see the context deadline is just about 100 ms !!!

A client timeout set with http.Client.Timeout sets a time limit for the full request. If it takes 4900ms to start writing the request (possibly because the write is blocked behind other requests being sent), then there's only another 100ms left for the response to be read.

The http.Transport.ResponseHeaderTimeout setting sets a time limit between the request being written and a response being read.

en-vee commented 2 years ago

thanks @neild so does it mean that a single connection to my endpoint is not enough for sending out more than 2500-3000 requests per second ? I can try with more than 1 to test this.

I am using the low-level HTTP/2 transport directly due to the fact that I need to do non-TLS HTTP/2 traffic as below. How do I set the ResponseHeaderTimeout for the http2.Transport ?

if protocol == "http1" {
                httpTransport = &http.Transport{
                    MaxConnsPerHost:     config.MaxConnectionsPerHost(),
                    MaxIdleConns:        config.MaxIdleConnections(),
                    MaxIdleConnsPerHost: config.MaxIdleConnectionsPerHost(),
                    //IdleConnTimeout:     10 * time.Second,
                }
            } else if protocol == "http2" {
                httpTransport = &http2.Transport{
                    AllowHTTP: true,
                    DialTLS: func(network, addr string, cfg *tls.Config) (net.Conn, error) {
                        return net.Dial(network, addr)
                    },
                    StrictMaxConcurrentStreams: config.StrictMaxConcurrentStreams(),
                }
            }
en-vee commented 2 years ago

@neild I tried with multiple connections to the server and basically round-robin'ed / distributed all my requests over those multiple connections and this now seems to work upto a point where the requests per second per endpoint/connection hits around 2500. I am wondering if there is any way that I can identify why 2500 is a limit on the machine on which I am running my client. Is it that the TCP send-Q is getting full for some reason ?

Or are there some internal channels used by the net/http package which are getting full and hence causing the other requests to get blocked ?

What is the maximum number of requests that the net/http transport can hold on it's inbound side before more calls get blocked ?