haproxy / haproxy

HAProxy Load Balancer's development branch (mirror of git.haproxy.org)
https://git.haproxy.org/
Other
4.93k stars 795 forks source link

2.8.x: Peers segfault: A bogus APPCTX [0x7f53dc05c5b0] is spinning at 3620630 calls per second and refuses to die, aborting now #2373

Closed idl0r closed 10 months ago

idl0r commented 10 months ago

Detailed Description of the Problem

Hi,

looks like in 2.8(.4) peers is having some issues. We were running 2.7.x for quite some time and had no issues. Recently I upgraded all to 2.8.4 and we've got a first segfault since then.

msg = 0x7f53dc0974e0 "A bogus APPCTX [0x7f53dc05c5b0] is spinning at 3620630 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x7f53dc05bf70,43000 src=192.168.255.27 fe=n095138 be=n095138 dst=<PEER> txn=(nil),0 txn.req=-,0 txn.rsp=-,0 rqf=848000 rqa=0 rpf=80048000 rpa=0 scf=0x7f53dc05bf00,CLO,30444 scb=0x7f53dc05c4f0,EST,1c041 af=(nil),0 sab=0x7f53dc05c5b0,7 cof=0x7f549005bba0,1c0000:PASS(0x7f53dc05be10)/NONE((nil))/NONE(-1) cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1) filters={} applet=0x558f154120a0(main+0x2c7870) handler=0x558f1525ffb0(main+0x115780)]\n"

Expected Behavior

No segfault

Steps to Reproduce the Behavior

N/A

Do you have any idea what may have caused this?

Peers

Do you have an idea how to solve the issue?

No response

What is your configuration?

We have like 53 peers at the moment. If more / specific config parts are helpful, I can provide them.
We have 4 stick-tables we use those peers for.

Output of haproxy -vv

HAProxy version 2.8.4-a4ebf9d 2023/11/17 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2028.
Known bugs: http://www.haproxy.org/bugs/bugs-2.8.4.html
Running on: Linux 5.10.0-26-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment
  OPTIONS = USE_LIBCRYPT=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB= USE_SLZ=1 USE_NS= USE_SYSTEMD=1 USE_PROMEX=1 USE_PCRE= USE_PCRE_JIT= USE_PCRE2=1 USE_PCRE2_JIT=
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH -MEMORY_PROFILING +NETFILTER -NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_WOLFSSL -OT -PCRE +PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=48).
Built with OpenSSL version : OpenSSL 1.1.1w  11 Sep 2023
Running on OpenSSL version : OpenSSL 1.1.1w  11 Sep 2023
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with the Prometheus exporter as a service
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.36 2020-12-04
PCRE2 library supports JIT : no (USE_PCRE2_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 10.2.1 20210110

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : prometheus-exporter
Available filters :
    [BWLIM] bwlim-in
    [BWLIM] bwlim-out
    [CACHE] cache
    [COMP] compression
    [FCGI] fcgi-app
    [SPOE] spoe
    [TRACE] trace

Last Outputs and Backtraces

https://gist.github.com/idl0r/186210587f0162f7a4ba4c20e6e4ad25

Additional Information

Please let me know whom I may send the complete coredump, if required :)

wtarreau commented 10 months ago

Hi Christian,

would you by change have the rest of the stderr output of this dump ? It contains precious flags that will help figure what happened. It's still short, in 2.9 we have significantly improved it, but will already be better than nothing. Thanks!

idl0r commented 10 months ago

Oh, sure:

Dec  2 07:10:35.625 n095138 haproxy[27188]: [NOTICE]   (27188) : New worker (27279) forked
Dec  2 07:10:35.626 n095138 haproxy[27188]: [NOTICE]   (27188) : Loading success.
...
Dec  2 07:10:39.000 n095138 haproxy[27279]: A bogus APPCTX [0x7f53dc05c5b0] is spinning at 3620630 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x7f53dc05bf70,43000 src=192.168.255.27 fe=n095138 be=n095138 dst=<PEER> txn=(nil),0 txn.req=-,0 txn.rsp=-,0 rqf=848000 rqa=0 rpf=80048000 rpa=0 scf=0x7f53dc05bf00,CLO,30444 scb=0x7f53dc05c4f0,EST,1c041 af=(nil),0 sab=0x7f53dc05c5b0,7 cof=0x7f549005bba0,1c0000:PASS(0x7f53dc05be10)/NONE((nil))/NONE(-1) cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1) filters={} applet=0x558f154120a0(main+0x2c7870) handler=0x558f1525ffb0(main+0x115780)]
Dec  2 07:10:39.086 n095138 haproxy[27279]: [ALERT]    (27279) : A bogus APPCTX [0x7f53dc05c5b0] is spinning at 3620630 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x7f53dc05bf70,43000 src=192.168.255.27 fe=n095138 be=n095138 dst=<PEER> txn=(nil),0 txn.req=-,0 txn.rsp=-,0 rqf=848000 rqa=0 rpf=80048000 rpa=0 scf=0x7f53dc05bf00,CLO,30444 scb=0x7f53dc05c4f0,EST,1c041 af=(nil),0 sab=0x7f53dc05c5b0,7 cof=0x7f549005bba0,1c0000:PASS(0x7f53dc05be10)/NONE((nil))/NONE(-1) cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1) filters={} applet=0x558f154120a0(main+0x2c7870) handler=0x558f1525ffb0(main+0x115780)]
Dec  2 07:10:39.086 n095138 haproxy[27279]:   call trace(10):
Dec  2 07:10:39.086 n095138 haproxy[27279]:   | 0x558f151f5298 [0f 0b 66 0f 1f 44 00 00]: stream_dump_and_crash+0x228/0x37a
Dec  2 07:10:39.086 n095138 haproxy[27279]:   | 0x558f1532278a [66 0f 1f 44 00 00 48 8b]: task_run_applet+0x29a/0xc18
Dec  2 07:10:39.086 n095138 haproxy[27279]:   | 0x558f152d65cb [48 89 c3 eb 0e 4c 89 ce]: run_tasks_from_lists+0x33b/0x8eb
Dec  2 07:10:39.086 n095138 haproxy[27279]:   | 0x558f152d6f06 [29 44 24 18 8b 54 24 18]: process_runnable_tasks+0x386/0x6f9
Dec  2 07:10:39.086 n095138 haproxy[27279]:   | 0x558f152a315a [83 3d eb ac 1c 00 01 0f]: run_poll_loop+0x13a/0x55d
Dec  2 07:10:39.086 n095138 haproxy[27279]:   | 0x558f152a3789 [48 8b 1d f0 97 17 00 4c]: main+0x158f59
Dec  2 07:10:39.086 n095138 haproxy[27279]:   | 0x7f56beab0ea7 [64 48 89 04 25 30 06 00]: libpthread:+0x7ea7
Dec  2 07:10:39.086 n095138 haproxy[27279]:   | 0x7f56be9d0a2f [48 89 c7 b8 3c 00 00 00]: libc:clone+0x3f/0x5a
Dec  2 07:10:39.728 n095138 haproxy[27188]: [NOTICE]   (27188) : haproxy version is 2.8.4-a4ebf9d
Dec  2 07:10:39.728 n095138 haproxy[27188]: [NOTICE]   (27188) : path to executable is /usr/sbin/haproxy
Dec  2 07:10:39.728 n095138 haproxy[27188]: [ALERT]    (27188) : Current worker (27279) exited with code 132 (Illegal instruction)
Dec  2 07:10:39.728 n095138 haproxy[27188]: [ALERT]    (27188) : exit-on-failure: killing every processes with SIGTERM
Dec  2 07:10:39.728 n095138 haproxy[27188]: [WARNING]  (27188) : All workers exited. Exiting... (132)
Dec  2 07:10:43.909 n095138 haproxy[27808]: [NOTICE]   (27808) : New worker (27885) forked
Dec  2 07:10:43.909 n095138 haproxy[27808]: [NOTICE]   (27808) : Loading success.
wtarreau commented 10 months ago

Thank you! So what we can see is that hte peers connection was aborted and closed. Maybe there's a race somewhere between error handling and message creation in peers, we need to investigate.

capflam commented 10 months ago

Well reading the code, I'm able to reproduce something similar. I don't know if it is exactly the same bug because my crash happens when the PEER applet is the client. But peer_recv_msg() has a flaw. When a message is truncated on the message length, data are not consumed, the applet waits for more data regardless there is a pending shutdown or not.

@idl0r, if it is possible, you can try the following patch on top of the 2.8.4:

diff --git a/src/peers.c b/src/peers.c
index c4760e68e..306e237ba 100644
--- a/src/peers.c
+++ b/src/peers.c
@@ -2403,7 +2403,7 @@ static inline int peer_recv_msg(struct appctx *appctx, char *msg_head, size_t ms
        return 1;

  incomplete:
-       if (reql < 0) {
+       if (reql < 0 || (sc->flags & (SC_FL_SHUT_DONE|SC_FL_SHUT_WANTED))) {
                /* there was an error or the message was truncated */
                appctx->st0 = PEER_SESS_ST_END;
                return -1;
capflam commented 10 months ago

BTW, I will push this fix because it remains valid. It is most probably your bug. @wtarreau, still good for the 2.9.0 ?

wtarreau commented 10 months ago

Yep, hurry up :-)

capflam commented 10 months ago

too much pressure.... Pushed !

idl0r commented 10 months ago

Well reading the code, I'm able to reproduce something similar. I don't know if it is exactly the same bug because my crash happens when the PEER applet is the client. But peer_recv_msg() has a flaw. When a message is truncated on the message length, data are not consumed, the applet waits for more data regardless there is a pending shutdown or not.

@idl0r, if it is possible, you can try the following patch on top of the 2.8.4:

diff --git a/src/peers.c b/src/peers.c
index c4760e68e..306e237ba 100644
--- a/src/peers.c
+++ b/src/peers.c
@@ -2403,7 +2403,7 @@ static inline int peer_recv_msg(struct appctx *appctx, char *msg_head, size_t ms
        return 1;

  incomplete:
-       if (reql < 0) {
+       if (reql < 0 || (sc->flags & (SC_FL_SHUT_DONE|SC_FL_SHUT_WANTED))) {
                /* there was an error or the message was truncated */
                appctx->st0 = PEER_SESS_ST_END;
                return -1;

Sure, can do. I just doubt I can provide good feedback. It crashed one of ~54 LBs and only once.

capflam commented 10 months ago

Honestly, I pretty sure it is your bug. It is why I marked it has fixed.

idl0r commented 10 months ago

Alright. Thanks! :)

wtarreau commented 10 months ago

Then if it's so rare, don't worry, chances are low that you'll see it again before the next update and the fix will be there. Thanks a lot for your report as usual!

idl0r commented 10 months ago

Then if it's so rare, don't worry, chances are low that you'll see it again before the next update and the fix will be there. Thanks a lot for your report as usual!

Very welcome and thank you guys! :)