http client throttling - Githubissues

stokito commented 2 years ago

I using the fasthttp for an OpenRTB proxy that receives JSON requests from SSP, changes incoming json and forwards to other DSPs. Under a heavy load I see in CPU profile that almost all CPU is spent on connection to DSPs: Dial, then Syscall6 (Write). Load average is multiple time higher than CPUs count. The http client returns a lot of net.OpError errors. The MaxConnsPerHost is 20480 and MaxIdleConnDuration is 1 hour so the keep alive should work fine.

As far I understood this happens because: We have a load of 10000 QPS but each requests is processed in 100ms e.g. we need 1000 parallel connections to a DSP. But when we reached a conn limit of the DSP itself our request is failed with connection's net.OpError. But the http client keeps to try to establish new connections and this eats all the CPU.

I tried to implement a simple throttling that makes a 400ms delay when on connection error occurs. It looks like:

var hostClient fasthttp.HostClient
var throttlingEnabled atomic.Bool
var throttlingStarted time.Time
var throttlingDelay = 400 * time.Millisecond

func performRequest(req *fasthttp.Request, res *fasthttp.Response) {
    if throttlingEnabled.Load() {
        if time.Now().Sub(throttlingStarted) > throttlingDelay {
            throttlingEnabled.Store(false)
            log.Printf("throtling: disable\n")
        } else {
            log.Printf("throtling: skip request\n")
        }
        return
    }
    requestTimeout := 200 * time.Millisecond
    connErr := hostClient.DoTimeout(req, res, requestTimeout)
    errName := reflect.TypeOf(connErr).String()
    if errName == "*net.OpError" {
        throttlingEnabled.Store(true)
        throttlingStarted = time.Now()
    }
}

Now the processed QPS fallen down at least twice but yes, no any load spikes. Is anything better than the solution? Maybe I can reuse the rate.Limiter from golang.org/x/time/rate package. I see that the PipedClient do have some throttling so maybe something similar can be added to a HostClient? We need something that will work smarter and with recovery from a heavy load.

Another one question is what will happen if the connection and TLS handshake takes longer than the DoTimeout() timeout? For example connection takes 200ms but the request timeout is 100ms. Then it looks like no any connection will ever established.

Do we have any article/documentation on configuring a server for a heavy load? Like increase allowed opened files in systemd unit to DefaultLimitNOFILE=524288 and etc. Can anyone recommend to me something to read.

erikdubbelboer commented 1 year ago

Sounds like your MaxConnsPerHost is configured to a higher number than what the DSP allows?

If keep alive connections are being used properly I wouldn't expect much time spent on connecting (which is expensive with tls).

Maybe try adjusting this first? And set MaxConnWaitTimeout to something that then makes sense for you.

stokito commented 1 year ago

Sounds like your MaxConnsPerHost is configured to a higher number than what the DSP allows?

That can be a reason, right. The problem here is that the TCPDialer.tryDial() will think that this is a network failure and try to establish a new connection which also will fail. But it has a check for a timeout so when So the CPU will be waisted in many connection attempts. E.g. for my case I have a relatively good connection and a cause of most errors is load or limits. Maybe the Concurrency options can help here. If I'll set it to 5 for example then it looks like it will try to reconnect but not that fast.

I checked sources of GRPC and it has a lot of throttles, limiters and jitters. In addition the GRPC uses HTTP2 as a transport which use only one connection for many parallel requests.

It's difficult to me to get that logic. Ideally we need some easy to use dialer library with many options.

Anyway now I going to try to write my own dialer to fix at least this one problem. If anyone wish to participate we can try to make the dialer together. I see some other things that may be good to make:

Connect to all IPs that we received from DNS to spread load among them.
Add the same ConnState tracker as a Server has
Add some jitter before connection to avoid a wall of connections
Some graceful adaptation to failures.

erikdubbelboer commented 1 year ago

Have you tried using fasthttp.TCPDailer and setting https://pkg.go.dev/github.com/valyala/fasthttp#TCPDialer.Concurrency?

valyala / fasthttp

http client throttling #1448