Closed arouene closed 2 years ago
2.4.9 was released yesterday and contains multiple fixes regarding connection teardown, you may want to give it a try.
With your advice, we just tried the version 2.4.9 with no more luck. Problem still there...
Do you know if it concerns h1 or h2 connections or both ? client or server connections ?
So we restricted connections to the HTTP2 protocol and then to the HTTP1 protocols, while it's better with H1 connections, it is still not perfect.
The huge amount of FIN-WAIT-2 we have is between HAProxy and his backends.
ss -tn | grep FIN-WAIT
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53466
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53783
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.14:61827
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.16:59372
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53585
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53457
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53643
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.14:61905
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53421
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53553
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.11:60750
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.11:60683
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.16:59299
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.15:62992
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.11:60672
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53764
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53587
FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.11:60649
...
Backends are IIS 10 Servers, HAProxy use H1 connections with them.
(The yellow curve is the %Ta time from the HAProxy logs)
To diagnose a little further, we got the %TR time from the logs (time to receive the full request from first byte). Surprisingly that's where the time is passed (according to HAProxy), we have some requests that takes seconds to get the full query from the client... For now, I try to get those queries with tcpdump.
Thanks, it could be indeed really helpful. That's surprising because the 2.4 is not so different than 2.3 on the request parsing...
Ok, so I may have something, but I don't know how to interpret it
Here is a request (that's a view from datadog that gets the logs from haproxy) It says, the total time of the request (Ta) is 2.7s, client request (TR) is taking 1.3s, and the backend respond in 1.4s.
Here is a trace of that request from tcpdump (172.16.3.71 is a virtual IP facing the clients, 172.16.3.73 is the server IP used for contacting the backends) The request happens in an already opened TCP session, we can find the 1.3s between the last ACK and the packet of the request.
Does that really mean that the request takes 1.3s to be sent? Could that be a problem from the log system of haproxy? I mean, the services doesn't fell more slow when we are using them with haproxy 2.4.
The huge amount of FIN-WAIT-2 we have is between HAProxy and his backends.
ss -tn | grep FIN-WAIT FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53466 FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53783
Seems like those sockets are in from the frontend, not the backend. 172.16.4.71:443
is what you are binding haproxy to in the frontend.
172.16.4.71:443
is what you are binding haproxy to in the frontend.
Yes you are right, but 172.16.4.10 is a private IP of one of our backends.
I don't know why the backends sometimes use the virtual IP, but now you mention it, the FIN-WAIT-2 are always on the port 443 of the virtual ip.
The huge amount of FIN-WAIT-2 we have is between HAProxy and his backends.
ss -tn | grep FIN-WAIT FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53466 FIN-WAIT-2 0 0 172.16.4.71:443 172.16.4.10:53783
Seems like those sockets are in from the frontend, not the backend.
172.16.4.71:443
is what you are binding haproxy to in the frontend.
So I got an info from my team, sometimes our backends use haproxy to contact other backends, so you are right, in this case they are the clients !
Ok, so I may have something, but I don't know how to interpret it
Here is a request (that's a view from datadog that gets the logs from haproxy) It says, the total time of the request (Ta) is 2.7s, client request (TR) is taking 1.3s, and the backend respond in 1.4s.
Here is a trace of that request from tcpdump (172.16.3.71 is a virtual IP facing the clients, 172.16.3.73 is the server IP used for contacting the backends) The request happens in an already opened TCP session, we can find the 1.3s between the last ACK and the packet of the request.
Does that really mean that the request takes 1.3s to be sent? Could that be a problem from the log system of haproxy? I mean, the services doesn't fell more slow when we are using them with haproxy 2.4.
Here, the timing reported by HAProxy seems correct. The question is to know why if happens on the 2.4.9 and not on the 2.3.16. It may be just a log issue, but I'm unable to reproduce this behavior. I must check in the code.
Then if you see a FIN-WAIT-2
on HAProxy, it means it is waiting for the connection close from the remote end-point. HAProxy has closed the connection and this was acknowledged. But the peer has not yet closed the connection on his side. Here, it is the client application on your backends. This may come from recent changes about shutdowns but the 2.3.16 is very similar to the 2.4.9 on this point.
About difference on the number of established connections, it may be interesting to compare the reuse rate between the 2.3.16 and 2.4.9. It may explain the difference.
Honestly, I have no explanation about the timing difference you observed between the 2.4.9 and the 2.3.16.
Thanks anyway for searching! I have no explanation either... I will keep one or two haproxy 2.4, and stay on 2.3 for now. If I have more information I will update this issue, but honestly I don't know what to seek or where to look...
If you have any question, don't hesitate.
Ok, I can explain the difference for %TR
for HTTP/2 clients. On 2.3 and lower, it is always 0, except if you have an option to wait the request payload in your configuration. On these versions, the time to receive and parse the request headers is included in the idle time (%Ti
). Since the 2.4, for HTTP/2 streams, idle time is never set because there is no easy way to measure it per stream. %TR
is thus the time to receive the request from the moment the stream is created.
I have no explanation for HTTP/1 clients. But I don't know if there is a difference. After checking your graphs. The first one about %Ta
mixes HTTP/2 and HTTP/1 clients. So the difference may be explained by changes on HTTP/2 only. On the last graph, you compare the difference between HTTP/1 and HTTP/2 on the 2.4 only. I suspect there is no difference in HTTP/1 between the 2.3 and the 2.4.
To be sure, you may graph the idle time (%Ti
) and the request time (%TR
) for HTTP/2 clients. The sum of both should be more or less the same between the 2.3 and the 2.4. For HTTP/1 clients, the graph of the request time (%TR
) is enough to verify if there is any difference between versions.
I'm closing this issue with works as designed
label because it seems there is no bug here. Feel free to reopen it if I'm wrong.
Detailed Description of the Problem
We recently migrated from HAProxy 2.3.16 to HAProxy 2.4.8 on our infrastructure.
We have immediately acknowledge a degradation in latency of responses, HAProxy adding up to 30sec in latency.
(The yellow curve is the %Ta time from the HAProxy logs)
This degradation comes with an increase of TCP TimeWait.
(On the left side is HAProxy 2.3.16, on the right side of the graph is HAProxy 2.4.8)
And also a decrease of TCP Established connections.
(On the left side is HAProxy 2.3.16, on the right side of the graph is HAProxy 2.4.8)
We tested versions 2.0.13, 2.3.1 and 2.3.16 without problem. And without any modification in the config file or the OS, we tested version 2.4.0 and 2.4.8 which then presented the above problem.
Expected Behavior
No more latency than HAProxy 2.3 branch
Steps to Reproduce the Behavior
Migration steps from 2.3 to 2.4
Do you have any idea what may have caused this?
The changelog of 2.4 shows that the TimeWaits has been reworked and improved, may be it's a side effect as we acknowledge an increase of timewaits on our haproxy 2.4.
Do you have an idea how to solve the issue?
No response
What is your configuration?
Output of
haproxy -vv
Last Outputs and Backtraces
No response
Additional Information
OS: Centos8-Stream
Systemd service: