Open SpoonOne opened 10 months ago
Have you tried to update to the latest 2.4 as there is a fix for http header. See http://www.haproxy.org/bugs/bugs-2.4.22.html
Have you tried to update to the latest 2.4 as there is a fix for http header. See http://www.haproxy.org/bugs/bugs-2.4.22.html
Sorry, which specific bug are you referencing as I checked through all of them and didn't see any that seemed relevant to this issue?
The SH is documented like this in https://docs.haproxy.org/2.4/configuration.html
SH The server aborted before sending its full HTTP response headers, or
it crashed while processing the request. Since a server aborting at
this moment is very rare, it would be wise to inspect its logs to
control whether it crashed and why. The logged request may indicate a
small set of faulty requests, demonstrating bugs in the application.
Sometimes this might also be caused by an IDS killing the connection
between HAProxy and the server.
I refer to [BUG/MAJOR: http-ana: Get a fresh trash buffer for each header value replacement](http://git.haproxy.org/?p=haproxy-2.4.git;a=commitdiff;h=90f7847)
but it's just a wild guess and it could be any header fix in 2.4.24.
As in the wireshark output is not shown if the C-L header is involved it could be also one of the C-L header. In any case it's the best case to try to use the latest released version and see if the issue still exist.
Any chance to update to 2.8 latest?
Thank you for the bug reference - it's an interesting idea. Do these header errors cause a 502 to be returned before it sends the request to the backend or after it's been sent to the backend, like in my example?
The X-Correlation-ID
header is already present for all of these types of requests to this particular haproxy so it's not having to inject that, just capture it for the log. There is no Content-Length
header (assuming that's what you mean by C-L) for this particular example.
Upgrading will be tricky as we use Ubuntu LTS servers and this version is what comes with 22.04 at the moment.
Actually, because Ubuntu patch security issues, the empty Content-Length header
issue has already been patched in the version we're running.
If I could verify that this is actually fixed in a later release (we were seeing this on version 2.0 also) then I could make the case internally to run a non-LTS release version.
I'm trying the PPA version of 2.8 in our stage environment and will see if we see any reocurrences.
Hello,
it could be possible that your server responds (and closes) too early on a POST (before the end of the POST). The problem that happens there is purely specific to TCP then: subsequent packets in flight after the closure will trigger a TCP RST on the server, which when coming back will destroy data in flight and could result in a truncated or empty response. The fact that you're saying it happens rarely could indicate that it only happens on data that remain in the server's TCP buffers and that the data are destroyed there, before even being sent over the wire. A network capture between haproxy and the server would definitely help.
What makes me think about this is that responses in close mode without content-length generaly come from simple servers that just close at the end (i.e. like plain old CGIs) without taking care of the underlying TCP protocol.
If this is what happens, the solution will only be to make sure the server closes after having drained the request. We know it's not always easy and maybe you have no way to act on the server's code, I don't know.
Hi Willy,
There's a screenshot of a packet capture taken on the haproxy server in the first post. It's showing both the client and server packets. The requests are GETs to a CGI (kannel in this case) and you can see in the capture that haproxy is sending back a 502 to the client after sending the GET to the server but before the server even responds with any packet. The server's response then comes in later but it's too late at that point.
I updated our stage environment to 2.8.4 via PPA last week and over the weekend I still saw a reoccurance of this issue. I don't have a packet capture for this but I might set a rolling one up to try and get the next time it happens.
Ah indeed, that's very strange, and the fact that the server responds indicates it was properly sent over the wire. I'm a bit shocked to see a FIN being emitted to the server, and was wondering if the incoming request had one as well. Unfortunately, wireshark doesn't display flags when it shows contents, so the doubt is permitted. HAProxy may have encountered an error in the lower layers (syscall, or even an internal error) that is reported like this, but I don't see why the request would have been properly delivered in this case. In case you'd manage to reproduce it (but at 1/100k I doubt), it could be useful to get an strace output of it, but I understand it will be quite difficult.
Now you are using the 2.8, you can eventually enable H1 traces at error level. You can add this snippet after your global section:
ring buf1
size 104857600 # 10MB
format timed
backing-file /tmp/blah
global
expose-experimental-directives
trace h1 sink buf1
trace h1 level error
trace h1 verbosity complete
trace h1 start now
This will write the H1 error traces in /tmp/blah file. You can show the traces by running:
strings /tmp/blah | less
If it does not work, you may use haring tool from HAProxy sources. To do so, you should compile it (make dev/haring/haring
). Then run
./dev/haring/haring -f /tmp/blah | less
You can also check your backend stats to check srv_abrt
(server aborts) and eresp
(invalid response) counters. The 502 error should increment one of these counters.
Thanks Christopher, I have enabled the above settings and will wait for the next 502 to see if it shows something useful.
I did manage to verify in a packet capture that 2.8 is behaving the same, in that the 502 is returned to the client before the server responds.
Here's an example similar to the one above.
The captured error trace at that time was:
<0>2023-12-10T11:39:05.732583+00:00 [01|h1|0|mux_h1.c:1927] message aborted, set error on SC : [B,RUN] [MSG_DONE, MSG_RPBEFORE] - req=(.fl=0x00001511 .curr_len=0 .body_len=0) res=(.fl=0x00001404 .curr_len=0 .body_len=0) - h1c=0x7f3b7a8af280(0x80000100) conn=0x7f3b7a8adf00(0x00040300) h1s=0x7f3b7a8ad840(0x00004010) sd=0x7f3b7a813280(0x05018001) sc=0x7f3b7a8a8d80(0x00001411)
Here's a slightly different 502 where haproxy returns the 502 BEFORE sending the request to the server.
The corresponding error is:
<0>2023-12-10T19:44:05.767365+00:00 [01|h1|0|mux_h1.c:3781] reporting error to the app-layer stream : [B,RUN] [MSG_DONE, MSG_RPBEFORE] - req=(.fl=0x00001511 .curr_len=0 .body_len=0) res=(.fl=0x00001404 .curr_len=0 .body_len=0) - h1c=0x7f3b6b176000(0x80000200) conn=0x7f3b6b181cc0(0x001c0300) h1s=0x7f3b6b176840(0x00004010) sd=0x7f3b7a8006c0(0x00020001) sc=0x7f3b7a892c40(0x00001811)
I also am probably wrong about the frequency because these are taken from our stage environment which has hardly any traffic by comparison so based on the numbers there, it would be about 1/1000 to 1/2000. Whereas in production the frequency compared to volume is far less and more like my original 1/100k.
Thanks, we'll try to exploit this!
Thanks.
For the first trace, a shutdown for reads was received. There is CO_FL_SOCK_RD_SH
flag set on the connection. In the second one, it is more or less the same, except an error was also reported (CO_FL_ERROR | CO_FL_SOCK_WR_SH | CO_FL_SOCK_RD_SH
), probably because the shutdown was detected while trying to send the request.
So in both cases, from the HAProxy point of view, it is the expected behavior to return a 502-Bad-Gateway. However, I'm surprised it is not visible in your captures.
Just thinking loud, maybe this happened on a reused connection which was closed before the request arrived and was only detected as such a bit late. This would explain why it's not in the trace.
Hmmm no it shouldn't happen because that was a new client connection and there's no "http-reuse always", so I think the new connection we're seeing in the trace is genuine and really is the one that returned that error. But why, that's a mystery!
@SpoonOne regarding the traces, did you only capture TCP traffic or all the traffic ? I'm asking because we could also imagine an ICMP packet from the server reporting an error to the TCP stack.
I captured everything and I only see ICMP echo requests and replies - no other types.
The screenshots above are from me following the TCP stream in Wireshark so it's everything for those streams.
If it's any help, all 3 machines involved are AWS EC2 instances and the errors occur both within the same AZ and cross-AZ. There doesn't seem to be any pattern there that I can deduce.
OK thanks.
Anything else I can do/provide to help here?
For now I don't know, really :-/ It seems to be the only such report, and the network captures don't seem to confirm the reported flags, so it could be a lot of other things related to the environment, but what ? That's a mystery. Maybe if you spotted another version that doesn't reproduce this issue it could help, but I can also understand that you cannot easily test another version in production. But if you can, maybe comparing with 2.8.5 could be helpful, there are more elements there to help with troubleshooting.
we have same or maybe similar problem, when my backend is connecting to server over h2c or h2 protocol then haproxy returns 502 in like 1/1000 requests, in normal browser user see HTTP2_PROTOCOL_ERROR message and then browser reloads url and everything is alright, when I switch from h2 or h2c to default h1 (on backend server), this error never happens, tcpdump did not show me any connection to backend server, just return 502 in haproxy logs
we have haproxy listening on 443/80 and connecting localy on 127.0.0.1:80 to apache server, so there is not any network problem, same problems with nginx as backend in remote location
Apr 8 12:03:10 hostname haproxy[427122]: 127.0.0.1:39758 [08/Apr/2024:12:03:10.312] ft_owasp default/backend:80 0/0/0/-1/0 502 209 - - SH-- 14/1/0/0/0 0/0 "GET / HTTP/1.0" 0/-/-/-/0 -/-/-
tried every parametr and error code in retry-on
and it did same error, only retry-on all-retryable-errors
helped but it's not good for production server
running on version 2.8 but tried it on 3.0, same
@pa77777, could you share your config please ?
@pa77777, If there is no connection to the backend server, it means it is not the same issue. Please open an new issue filling the template. Thanks
Detailed Description of the Problem
Occasionally - in the order of 1 in 100,000 - we see haproxy returning a 502 (with SH flag in the logs) to the client after passing the request through to the backend and closing the backend TCP connection. The backend however, actually receives the request and processes it, returning the 202 Accepted back to haproxy. The 502 to the client triggers the client to resend the same request later, which is double-processed. There are no
http-response deny
rules present - not that it matters given the backend hasn't even responded at the time haproxy sends back the 502. The size of the request doesn't appear to be problematic either, in relation to buffer size.Expected Behavior
I would have expected haproxy to wait for the response from the server.
Steps to Reproduce the Behavior
I can't reliably reproduce this behaviour, other than sending hundreds of thousands of requests and waiting.
Do you have any idea what may have caused this?
No response
Do you have an idea how to solve the issue?
No response
What is your configuration?
Output of
haproxy -vv
Last Outputs and Backtraces
No response
Additional Information
No response