Haproxy randomly returns 502 Bad Gateway

SpoonOne commented 10 months ago

Detailed Description of the Problem

Occasionally - in the order of 1 in 100,000 - we see haproxy returning a 502 (with SH flag in the logs) to the client after passing the request through to the backend and closing the backend TCP connection. The backend however, actually receives the request and processes it, returning the 202 Accepted back to haproxy. The 502 to the client triggers the client to resend the same request later, which is double-processed. haproxy_502_bug There are no http-response deny rules present - not that it matters given the backend hasn't even responded at the time haproxy sends back the 502. The size of the request doesn't appear to be problematic either, in relation to buffer size.

Expected Behavior

I would have expected haproxy to wait for the response from the server.

Steps to Reproduce the Behavior

I can't reliably reproduce this behaviour, other than sending hundreds of thousands of requests and waiting.

Do you have any idea what may have caused this?

No response

Do you have an idea how to solve the issue?

No response

What is your configuration?

global
  chroot  /var/lib/haproxy
  daemon
  external-check
  group  haproxy
  insecure-fork-wanted
  log  /dev/log local0
  maxconn  4098
  stats  socket /var/lib/haproxy/stats level admin
  user  haproxy

defaults
  log  global
  maxconn  8000
  option  redispatch
  retries  3
  timeout  http-request 10s
  timeout  queue 1m
  timeout  connect 10s
  timeout  client 1m
  timeout  server 1m
  timeout  check 10s
  unique-id-format  %{+X}o\ %pid%cp%fp%Ts%ms%rt

frontend myfrontend
  bind :15613
  mode http
  default_backend mybackend
  acl cid_exists req.hdr(X-Correlation-ID) -m found
  http-request set-header X-Correlation-ID %[unique-id] unless cid_exists
  http-request capture hdr(X-Correlation-ID) len 64
  log-format "%ci:%cp [%tr] %ft %b/%s %Th/%Ti/%TR/%Tw/%Tc/%Tr/%Ta %ST %U %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %sslv %sslc %[capture.req.hdr(0)] %{+Q}r"

backend mybackend
  balance roundrobin
  mode http
  option httpchk GET /?port=15600
  server server-a1 10.1.1.1:15613 check port 8980
  server server-a2 10.1.1.2:15613 check port 8980
  server server-b1 10.1.2.1:15613 check port 8980
  server server-b2 10.1.2.2:15613 check port 8980

Output of `haproxy -vv`

HAProxy version 2.4.22-0ubuntu0.22.04.2 2023/08/14 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.22.html
Running on: Linux 6.2.0-1016-aws #16~22.04.1-Ubuntu SMP Sun Nov  5 20:08:16 UTC 2023 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -O2 -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_SLZ=1 USE_SYSTEMD=1 USE_PROMEX=1
  DEBUG   =

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL +EPOLL -EVPORTS +FUTEX +GETADDRINFO -KQUEUE +LIBCRYPT +LINUX_SPLICE +LINUX_TPROXY +LUA -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OT -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PRIVATE_CACHE -PROCCTL +PROMEX -PTHREAD_PSHARED -QUIC +RT +SLZ -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=2).
Built with OpenSSL version : OpenSSL 3.0.2 15 Mar 2022
Running on OpenSSL version : OpenSSL 3.0.2 15 Mar 2022
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.6
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.39 2021-10-29
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 11.4.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2       flags=HTX|CLEAN_ABRT|HOL_RISK|NO_UPG
            fcgi : mode=HTTP       side=BE        mux=FCGI     flags=HTX|HOL_RISK|NO_UPG
              h1 : mode=HTTP       side=FE|BE     mux=H1       flags=HTX|NO_UPG
       <default> : mode=HTTP       side=FE|BE     mux=H1       flags=HTX
            none : mode=TCP        side=FE|BE     mux=PASS     flags=NO_UPG
       <default> : mode=TCP        side=FE|BE     mux=PASS     flags=

Available services : prometheus-exporter
Available filters :
    [SPOE] spoe
    [CACHE] cache
    [FCGI] fcgi-app
    [COMP] compression
    [TRACE] trace

Last Outputs and Backtraces

No response

Additional Information

No response

git001 commented 10 months ago

Have you tried to update to the latest 2.4 as there is a fix for http header. See http://www.haproxy.org/bugs/bugs-2.4.22.html

SpoonOne commented 10 months ago

Have you tried to update to the latest 2.4 as there is a fix for http header. See http://www.haproxy.org/bugs/bugs-2.4.22.html

Sorry, which specific bug are you referencing as I checked through all of them and didn't see any that seemed relevant to this issue?

git001 commented 10 months ago

The SH is documented like this in https://docs.haproxy.org/2.4/configuration.html

SH   The server aborted before sending its full HTTP response headers, or
          it crashed while processing the request. Since a server aborting at
          this moment is very rare, it would be wise to inspect its logs to
          control whether it crashed and why. The logged request may indicate a
          small set of faulty requests, demonstrating bugs in the application.
          Sometimes this might also be caused by an IDS killing the connection
          between HAProxy and the server.

I refer to [BUG/MAJOR: http-ana: Get a fresh trash buffer for each header value replacement](http://git.haproxy.org/?p=haproxy-2.4.git;a=commitdiff;h=90f7847) but it's just a wild guess and it could be any header fix in 2.4.24.

As in the wireshark output is not shown if the C-L header is involved it could be also one of the C-L header. In any case it's the best case to try to use the latest released version and see if the issue still exist.

Any chance to update to 2.8 latest?

SpoonOne commented 10 months ago

Thank you for the bug reference - it's an interesting idea. Do these header errors cause a 502 to be returned before it sends the request to the backend or after it's been sent to the backend, like in my example?

The X-Correlation-ID header is already present for all of these types of requests to this particular haproxy so it's not having to inject that, just capture it for the log. There is no Content-Length header (assuming that's what you mean by C-L) for this particular example.

Upgrading will be tricky as we use Ubuntu LTS servers and this version is what comes with 22.04 at the moment.

Actually, because Ubuntu patch security issues, the empty Content-Length header issue has already been patched in the version we're running.

If I could verify that this is actually fixed in a later release (we were seeing this on version 2.0 also) then I could make the case internally to run a non-LTS release version.

SpoonOne commented 10 months ago

I'm trying the PPA version of 2.8 in our stage environment and will see if we see any reocurrences.

wtarreau commented 10 months ago

Hello,

it could be possible that your server responds (and closes) too early on a POST (before the end of the POST). The problem that happens there is purely specific to TCP then: subsequent packets in flight after the closure will trigger a TCP RST on the server, which when coming back will destroy data in flight and could result in a truncated or empty response. The fact that you're saying it happens rarely could indicate that it only happens on data that remain in the server's TCP buffers and that the data are destroyed there, before even being sent over the wire. A network capture between haproxy and the server would definitely help.

What makes me think about this is that responses in close mode without content-length generaly come from simple servers that just close at the end (i.e. like plain old CGIs) without taking care of the underlying TCP protocol.

If this is what happens, the solution will only be to make sure the server closes after having drained the request. We know it's not always easy and maybe you have no way to act on the server's code, I don't know.

SpoonOne commented 10 months ago

Hi Willy,

There's a screenshot of a packet capture taken on the haproxy server in the first post. It's showing both the client and server packets. The requests are GETs to a CGI (kannel in this case) and you can see in the capture that haproxy is sending back a 502 to the client after sending the GET to the server but before the server even responds with any packet. The server's response then comes in later but it's too late at that point.

I updated our stage environment to 2.8.4 via PPA last week and over the weekend I still saw a reoccurance of this issue. I don't have a packet capture for this but I might set a rolling one up to try and get the next time it happens.

wtarreau commented 10 months ago

Ah indeed, that's very strange, and the fact that the server responds indicates it was properly sent over the wire. I'm a bit shocked to see a FIN being emitted to the server, and was wondering if the incoming request had one as well. Unfortunately, wireshark doesn't display flags when it shows contents, so the doubt is permitted. HAProxy may have encountered an error in the lower layers (syscall, or even an internal error) that is reported like this, but I don't see why the request would have been properly delivered in this case. In case you'd manage to reproduce it (but at 1/100k I doubt), it could be useful to get an strace output of it, but I understand it will be quite difficult.

capflam commented 10 months ago

Now you are using the 2.8, you can eventually enable H1 traces at error level. You can add this snippet after your global section:

   ring buf1
       size 104857600 # 10MB
       format timed
       backing-file /tmp/blah
   global
       expose-experimental-directives
       trace h1 sink buf1
       trace h1 level error
       trace h1 verbosity complete
       trace h1 start now

This will write the H1 error traces in /tmp/blah file. You can show the traces by running:

     strings /tmp/blah | less

If it does not work, you may use haring tool from HAProxy sources. To do so, you should compile it (make dev/haring/haring). Then run

    ./dev/haring/haring -f /tmp/blah | less

You can also check your backend stats to check srv_abrt (server aborts) and eresp (invalid response) counters. The 502 error should increment one of these counters.

SpoonOne commented 10 months ago

Thanks Christopher, I have enabled the above settings and will wait for the next 502 to see if it shows something useful.

I did manage to verify in a packet capture that 2.8 is behaving the same, in that the 502 is returned to the client before the server responds.

SpoonOne commented 10 months ago

Here's an example similar to the one above. haproxy_502_231210_113905 The captured error trace at that time was: <0>2023-12-10T11:39:05.732583+00:00 [01|h1|0|mux_h1.c:1927] message aborted, set error on SC : [B,RUN] [MSG_DONE, MSG_RPBEFORE] - req=(.fl=0x00001511 .curr_len=0 .body_len=0) res=(.fl=0x00001404 .curr_len=0 .body_len=0) - h1c=0x7f3b7a8af280(0x80000100) conn=0x7f3b7a8adf00(0x00040300) h1s=0x7f3b7a8ad840(0x00004010) sd=0x7f3b7a813280(0x05018001) sc=0x7f3b7a8a8d80(0x00001411)

Here's a slightly different 502 where haproxy returns the 502 BEFORE sending the request to the server. haproxy_502_231210_194405 The corresponding error is: <0>2023-12-10T19:44:05.767365+00:00 [01|h1|0|mux_h1.c:3781] reporting error to the app-layer stream : [B,RUN] [MSG_DONE, MSG_RPBEFORE] - req=(.fl=0x00001511 .curr_len=0 .body_len=0) res=(.fl=0x00001404 .curr_len=0 .body_len=0) - h1c=0x7f3b6b176000(0x80000200) conn=0x7f3b6b181cc0(0x001c0300) h1s=0x7f3b6b176840(0x00004010) sd=0x7f3b7a8006c0(0x00020001) sc=0x7f3b7a892c40(0x00001811)

SpoonOne commented 10 months ago

I also am probably wrong about the frequency because these are taken from our stage environment which has hardly any traffic by comparison so based on the numbers there, it would be about 1/1000 to 1/2000. Whereas in production the frequency compared to volume is far less and more like my original 1/100k.

wtarreau commented 10 months ago

Thanks, we'll try to exploit this!

capflam commented 10 months ago

Thanks.

For the first trace, a shutdown for reads was received. There is CO_FL_SOCK_RD_SH flag set on the connection. In the second one, it is more or less the same, except an error was also reported (CO_FL_ERROR | CO_FL_SOCK_WR_SH | CO_FL_SOCK_RD_SH), probably because the shutdown was detected while trying to send the request.

So in both cases, from the HAProxy point of view, it is the expected behavior to return a 502-Bad-Gateway. However, I'm surprised it is not visible in your captures.

wtarreau commented 10 months ago

Just thinking loud, maybe this happened on a reused connection which was closed before the request arrived and was only detected as such a bit late. This would explain why it's not in the trace.

wtarreau commented 10 months ago

Hmmm no it shouldn't happen because that was a new client connection and there's no "http-reuse always", so I think the new connection we're seeing in the trace is genuine and really is the one that returned that error. But why, that's a mystery!

wtarreau commented 10 months ago

@SpoonOne regarding the traces, did you only capture TCP traffic or all the traffic ? I'm asking because we could also imagine an ICMP packet from the server reporting an error to the TCP stack.

SpoonOne commented 10 months ago

I captured everything and I only see ICMP echo requests and replies - no other types.

The screenshots above are from me following the TCP stream in Wireshark so it's everything for those streams.

If it's any help, all 3 machines involved are AWS EC2 instances and the errors occur both within the same AZ and cross-AZ. There doesn't seem to be any pattern there that I can deduce.

wtarreau commented 10 months ago

OK thanks.

SpoonOne commented 8 months ago

Anything else I can do/provide to help here?

wtarreau commented 8 months ago

For now I don't know, really :-/ It seems to be the only such report, and the network captures don't seem to confirm the reported flags, so it could be a lot of other things related to the environment, but what ? That's a mystery. Maybe if you spotted another version that doesn't reproduce this issue it could help, but I can also understand that you cannot easily test another version in production. But if you can, maybe comparing with 2.8.5 could be helpful, there are more elements there to help with troubleshooting.

pa77777 commented 6 months ago

we have same or maybe similar problem, when my backend is connecting to server over h2c or h2 protocol then haproxy returns 502 in like 1/1000 requests, in normal browser user see HTTP2_PROTOCOL_ERROR message and then browser reloads url and everything is alright, when I switch from h2 or h2c to default h1 (on backend server), this error never happens, tcpdump did not show me any connection to backend server, just return 502 in haproxy logs

we have haproxy listening on 443/80 and connecting localy on 127.0.0.1:80 to apache server, so there is not any network problem, same problems with nginx as backend in remote location

Apr 8 12:03:10 hostname haproxy[427122]: 127.0.0.1:39758 [08/Apr/2024:12:03:10.312] ft_owasp default/backend:80 0/0/0/-1/0 502 209 - - SH-- 14/1/0/0/0 0/0 "GET / HTTP/1.0" 0/-/-/-/0 -/-/-

tried every parametr and error code in retry-on and it did same error, only retry-on all-retryable-errors helped but it's not good for production server

running on version 2.8 but tried it on 3.0, same

capflam commented 6 months ago

@pa77777, could you share your config please ?

capflam commented 6 months ago

@pa77777, If there is no connection to the backend server, it means it is not the same issue. Please open an new issue filling the template. Thanks

haproxy / haproxy