Encountered SIGABRT for HAProxy 2.2.24 randomly in dns_process_resolvers

pranikum commented 1 year ago

Detailed Description of the Problem

We see some random SIGABRT in HAProxy 2.2.24 during dns_process_resolvers. We have a setup of HAProxy running on various nodes on few nodes we see HAProxy going down randomly. We enabled core dump for the HAProxy process. Sharing the backtrace below.

Expected Behavior

HAProxy should not be going down.

Steps to Reproduce the Behavior

NA

Do you have any idea what may have caused this?

No response

Do you have an idea how to solve the issue?

No response

What is your configuration?

Some values are removed.

global
    maxconn     30000
    daemon
    tune.bufsize 32678  
    spread-checks 5
    user <>
    log 127.0.0.1 local0 info
    h1-case-adjust-file ENV_ROOT/etc/headers_adjust.map
    set-dumpable

defaults
    mode        http
    timeout client 300s
    timeout connect 1000ms
    timeout http-keep-alive 300s
    option http-server-close
    option redispatch
    option splice-auto

    option httplog

    errorfile 403 ENV_ROOT/etc/es_403.json
    errorfile 500 ENV_ROOT/etc/es_500.json
    errorfile 502 ENV_ROOT/etc/es_502.json
    errorfile 503 ENV_ROOT/etc/es_503.json
    errorfile 504 ENV_ROOT/etc/es_504.json

resolvers local_res
    nameserver dns1 xxx.xxx.xxx.xxx:53
    timeout resolve 30s
    hold valid 60s

frontend single_box
    bind 0.0.0.0:443 interface eth0 SSL_CERT
    bind :::443 interface eth0 SSL_CERT
    bind 0.0.0.0:443 interface lo SSL_CERT
    bind :::443 interface lo SSL_CERT

    maxconn 7000

    http-request add-header <> 1
    http-request del-header <>

    option h1-case-adjust-bogus-client
    acl h_xff_exists req.hdr(X-Forwarded-For) -m found
    http-request replace-header X-Forwarded-For .*,(.*) \1 if h_xff_exists
    http-response del-header Server

    acl allowed_meth method GET HEAD POST PUT DELETE OPTIONS PATCH
    acl is_healthcheck path /some-path
    acl is_proxy_stats path_beg /some-path
    acl is_proxy_health path_beg /some-path
    acl is_<> path_beg /
    acl is_internal_auth path_sub <>
    http-request deny if <>
    http-request set-nice -1024 if is_healthcheck

    acl <> capture.req.uri -m beg /<>
    http-request set-var(txn.https) hdr(X-Forwarded-Proto)
    acl is_https var(txn.https) -m str https
    http-response add-header Strict-Transport-Security "max-age=31536000" if <> is_https
    use_backend dne unless allowed_meth
    use_backend <> if is_healthcheck
    use_backend <> if is_proxy_stats
    use_backend <> if is_proxy_health
    use_backend <> if <>

    default_backend <>
    option httplog
    capture request header <> len 50
    capture request header Host len 50
    capture request header <> len 50

Output of `haproxy -vv`

HA-Proxy version 2.2.24 2022/05/13 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.24.html
Running on: Linux 5.4.209-116.363.amzn2.x86_64 #1 SMP Wed Aug 10 21:19:18 UTC 2022 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = /usr/bin/gcc
  CFLAGS  = -pthread -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -fdiagnostics-color=auto -O3 -march=core2 -msse -mfpmath=sse -g -fPIC -Wchar-subscripts -Wcomment -Wformat -Winit-self -Wmain -Wmissing-braces -Wno-pragmas -Wparentheses -Wreturn-type -Wsequence-point -Wstrict-aliasing -Wswitch -Wtrigraphs -Wuninitialized -Wunknown-pragmas -Wunused-label -Wunused-variable -Wunused-value -Wno-error=missing-include-dirs -Wno-error=narrowing -Wpointer-sign -Wimplicit -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-stringop-overflow -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_SLZ=1
  DEBUG   = 

Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM -ZLIB +SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=2).
Built with OpenSSL version : OpenSSL 1.0.2zf  21 Jun 2022
Running on OpenSSL version : OpenSSL 1.0.2zf  21 Jun 2022
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.4.3
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.33 2019-04-16
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 7.3.1 20180712 (Red Hat 7.3.1-6)

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
              h2 : mode=HTTP       side=FE|BE     mux=H2
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services : none

Available filters :
    [SPOE] spoe
    [COMP] compression
    [TRACE] trace
    [CACHE] cache
    [FCGI] fcgi-app

Last Outputs and Backtraces

Excess command line arguments ignored. (t ...)
GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-36.amzn2.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./haproxy...done.
[New LWP 8914]
[New LWP 8915]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `<path> -p'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fa7276f8ca0 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7fa729390240 (LWP 8914))]
Missing separate debuginfos, use: debuginfo-install <>
(gdb) t a a bt full

Thread 2 (Thread 0x7fa7276c4700 (LWP 8915)):
#0  0x000000000057c8be in dns_resolve_recv (dgram=<optimized out>) at src/dns.c:2310
        __pl_l = 0x23d4060
        __pl_r = <optimized out>
        ns = <optimized out>

#1  0x00000000005d2193 in dgram_fd_handler (fd=<optimized out>) at src/dgram.c:25
        dgram = 0x2486580
#2  0x00000000004285de in fd_update_events (evts=<optimized out>, fd=16) at include/haproxy/fd.h:415
        old = <optimized out>
        new = <optimized out>
        new_flags = <optimized out>
        must_stop = <optimized out>
#3  _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:251
        ev = {events = 2752111, data = {ptr = 0x2a7e667b00000000, fd = 0, u32 = 0, u64 = 3061997475125526528}}
        n = <optimized out>
        e = <optimized out>
        status = <optimized out>
        fd = 16
        count = <optimized out>
        updt_idx = <optimized out>
        old_fd = <optimized out>
#4  0x000000000055db2d in run_poll_loop () at src/haproxy.c:3024
        next = <optimized out>
        wake = <optimized out>
#5  0x000000000055df68 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3142
        ptaf = <optimized out>
        ptif = <optimized out>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
          __align = 0}
        init_cond = {__data = {{__wseq = 3, __wseq32 = {__low = 3, __high = 0}}, {__g1_start = 1, __g1_start32 = {__low = 1, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 4, 
            __wrefs = 0, __g_signals = {0, 0}}, __size = "\003\000\000\000\000\000\000\000\001", '\000' <repeats 23 times>, "\004", '\000' <repeats 14 times>, __align = 3}
#6  0x00007fa7286fe44b in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#7  0x00007fa7277b456f in clone () from /lib64/libc.so.6
No symbol table info available.
---Type <return> to continue, or q <return> to quit---

Thread 1 (Thread 0x7fa729390240 (LWP 8914)):
#0  0x00007fa7276f8ca0 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fa7276fa148 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x0000000000553d70 in ha_panic () at src/debug.c:271
No locals.
#3  0x00000000005c3b4b in wdt_handler (sig=14, si=<optimized out>, arg=<optimized out>) at src/wdt.c:121
        p = 123396768590
        thr = 0
#4  <signal handler called>
No symbol table info available.
#5  0x0000000000577c4a in dns_process_resolvers (t=t@entry=0x24866a0, context=context@entry=0x23c4730, state=state@entry=524) at src/dns.c:2411
        resolvers = 0x23c4730
        res = <optimized out>
        resback = <optimized out>
#6  0x00000000005a880c in run_tasks_from_lists (budgets=budgets@entry=0x7fffd15cb124) at src/task.c:519
        process = 0x5779d0 <dns_process_resolvers>

#7  0x00000000005a9335 in process_runnable_tasks () at src/task.c:713
        tt = 0x9d6580 <task_per_thread>

#8  0x000000000055db88 in run_poll_loop () at src/haproxy.c:2977
        next = <optimized out>
        wake = <optimized out>
#9  0x000000000055df68 in run_thread_poll_loop (data=data@entry=0x0) at src/haproxy.c:3142
        ptaf = <optimized out>
        ptif = <optimized out>
        ptdf = <optimized out>
        ptff = <optimized out>
        init_left = 0
        init_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
          __align = 0}
        init_cond = {__data = {{__wseq = 3, __wseq32 = {__low = 3, __high = 0}}, {__g1_start = 1, __g1_start32 = {__low = 1, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, __g1_orig_size = 4, 
            __wrefs = 0, __g_signals = {0, 0}}, __size = "<>' <repeats 23 times>, "\004", '\000' <repeats 14 times>, __align = 3}
#10 0x00000000004253a5 in main (argc=<optimized out>, argv=0x7fffd15cb598) at src/haproxy.c:3870
        blocked_sig = {__val = {
        i = 2
---Type <return> to continue, or q <return> to quit---
        err = <optimized out>
        retry = <optimized out>
        limit = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        errmsg = <->
        pidfd = <optimized out>
        intovf = <optimized out>

Additional Information

No response

pranikum commented 1 year ago

Is this issue already fixed as part of 2.2.25. Or this is new issue?

wtarreau commented 1 year ago

It looks like an AB/BA locking condition in the resolvers code (i.e. one code path holds a lock and waits for the second one to be released while the other code path does the opposite). It doesn't remind me anything specific but the resolvers code is tentacular and a small number of fixes were applied between 2.2.24 and 2.2.25, in any case you should update, at least to benefit from latest fixes.

Darlelet commented 1 year ago

While there are some architectural changes between the 2 versions, could it be a similar contention issue around dns response handling in both 2.2 and 2.4?

See: https://github.com/haproxy/haproxy/issues/1952

In 2.4@#1952: resolv_process_responses is holding resolvers-lock for too long, probably because of many packets processing of the same batch that prevent recv from returning EGAIN for a long time, thus staying in the loop? According to the gdb trace, stuck thread is at dns_recv_nameserver() while others are waiting on resolvers-lock.

In 2.2: dns_resolve_recv() (resolv_process_responses legacy function) seems to also hold resolvers-lock for too long. In @pranikum trace though, according to gdb stuck thread (thread 2) is waiting on server lock line 2310 while holding resolvers-lock, supporting @wtarreau hypothesis.

But I couldn't find where this lock (server lock) could be held either by thread 2 (itself) or thread 1 (currently stuck at the very beginning of dns_process_resolvers) based on the trace and a quick code inspection:

Perhaps we forgot to release the lock earlier in thread 1 (missing from the trace).
Or thread 2 is not really deadlocking on server lock but simply stuck in the reading loop as in #1952? (and gdb incidentally reported line 2310 because it was the ongoing operation at the time the thread got aborted by wdt?)

wtarreau commented 1 year ago

It's not the same problem, because as Christopher enlightened to me, these are two different locks, that are held by the two threads, so by defintion it cannot be a matter of time, it simply means these ones are definitely blocked. And since there are no other threads it implies that each lock is already held by the other thread, hence the deadlock.

Regarding the fact that you didn't find where the locks are taken, I don't know either and it could indeed be that due to a bug another one was not released. Note that there's still a bug in 2.2.24 where a removal from the ebtree is missing, that can lead to a use-after-free, and possibly to such a case as well indirectly. And I agree that it's theoretically possible that both are only suck in a loop and were caught just at the moment they were taking a lock, except that if they're alone on the lock, it only takes one instruction and is very fast, so the probability to land on such an instruction without contention is almost zero. And here we got two spinlocks at once. Of course, looking at more traces from other crashes could infirm or confirm, but I'd say that the chance-over-1-million to catch a thread on a non-contended lock is sufficiently rare for not taking it as a first hypothesis here.

pranikum commented 1 year ago

@wtarreau ... Yes there is a plan to update to 2.2.25.. However wanted to be sure if this issue is already fixed.. Looks like it's still there Will try to collect more dumps and add to thread.

wtarreau commented 1 year ago

The response is: we don't know. Several stability bugs were addressed between .24 and .25 affecting the resolvers area, and this code is extremely susceptible, what you fix often doesn't have an immediate relation with what you observe, so it's really impossible to say if you're experiencing one of the possible side effects of these bugs. What I can only say is that some nasty bugs were fixed there, and another one was even fixed after .25 (and will be in the next .26 once released). If you're building yourself you could even be interested in taking the latest 2.2-stable which contains all fixes that we're aware for 2.2 and that will become 2.2.26.

capflam commented 1 year ago

The 2.2.27 was released. You should try it.

capflam commented 1 year ago

Any news about this issue ?

pranikum commented 1 year ago

Still we are in process of taking the latest version... We have not seen this issue very frequently on 2.2.24. Will update once we have completed the migration.

capflam commented 1 year ago

copy that. Thanks !

capflam commented 1 year ago

@pranikum, it's me again. Did you get a chance to test a newer version ?

pranikum commented 1 year ago

Yes we have tested with the newer version . We have started to migrate to the newer version. till now we have not seen the issue with it. Will keep monitoring

capflam commented 7 months ago

I'm closing now, it seems fixed. At least not reported by anyone since a while :) Thanks !

haproxy / haproxy