Open stokito opened 2 years ago
Sounds like your MaxConnsPerHost is configured to a higher number than what the DSP allows?
If keep alive connections are being used properly I wouldn't expect much time spent on connecting (which is expensive with tls).
Maybe try adjusting this first? And set MaxConnWaitTimeout to something that then makes sense for you.
Sounds like your
MaxConnsPerHost
is configured to a higher number than what the DSP allows?
That can be a reason, right. The problem here is that the TCPDialer.tryDial() will think that this is a network failure and try to establish a new connection which also will fail. But it has a check for a timeout so when So the CPU will be waisted in many connection attempts. E.g. for my case I have a relatively good connection and a cause of most errors is load or limits. Maybe the Concurrency options can help here. If I'll set it to 5 for example then it looks like it will try to reconnect but not that fast.
I checked sources of GRPC and it has a lot of throttles, limiters and jitters. In addition the GRPC uses HTTP2 as a transport which use only one connection for many parallel requests.
It's difficult to me to get that logic. Ideally we need some easy to use dialer library with many options.
Anyway now I going to try to write my own dialer to fix at least this one problem. If anyone wish to participate we can try to make the dialer together. I see some other things that may be good to make:
Have you tried using fasthttp.TCPDailer
and setting https://pkg.go.dev/github.com/valyala/fasthttp#TCPDialer.Concurrency?
I using the fasthttp for an OpenRTB proxy that receives JSON requests from SSP, changes incoming json and forwards to other DSPs. Under a heavy load I see in CPU profile that almost all CPU is spent on connection to DSPs: Dial, then Syscall6 (Write). Load average is multiple time higher than CPUs count. The http client returns a lot of
net.OpError
errors. TheMaxConnsPerHost
is 20480 andMaxIdleConnDuration
is 1 hour so the keep alive should work fine.As far I understood this happens because: We have a load of 10000 QPS but each requests is processed in 100ms e.g. we need 1000 parallel connections to a DSP. But when we reached a conn limit of the DSP itself our request is failed with connection's
net.OpError
. But the http client keeps to try to establish new connections and this eats all the CPU.I tried to implement a simple throttling that makes a 400ms delay when on connection error occurs. It looks like:
Now the processed QPS fallen down at least twice but yes, no any load spikes. Is anything better than the solution? Maybe I can reuse the
rate.Limiter
fromgolang.org/x/time/rate
package. I see that the PipedClient do have some throttling so maybe something similar can be added to a HostClient? We need something that will work smarter and with recovery from a heavy load.Another one question is what will happen if the connection and TLS handshake takes longer than the DoTimeout() timeout? For example connection takes 200ms but the request timeout is 100ms. Then it looks like no any connection will ever established.
Do we have any article/documentation on configuring a server for a heavy load? Like increase allowed opened files in systemd unit to
DefaultLimitNOFILE=524288
and etc. Can anyone recommend to me something to read.