haproxy / haproxy

HAProxy Load Balancer's development branch (mirror of git.haproxy.org)
https://git.haproxy.org/
Other
4.93k stars 795 forks source link

HAproxy services have some sticky table issues due to which HAproxy services go down automatically #2025

Closed sona0108 closed 1 year ago

sona0108 commented 1 year ago

Detailed Description of the Problem

This is the previous case https://github.com/haproxy/haproxy/issues/1720 we have raised, we want to continue the discussion. We have 4 haproxy servers in our environment, and they have some stickiness following issues: 1- we observe in one or 2 months sticky tables graph go high 2- As you can see in the screenshot I have attached both the sticky tables are not aligned even they should increase or decrease parallel .

due to which we face downtime for the customer as all the services go down automatically can you please help us in this how we can resolve the stickiness of HAproxy tables.

here we are raising the request because we think this is likely a bug in HAProxy, even we have upgraded HAproxy to 2.4.18 image sticky tables

Expected Behavior

We have upgraded our HAproxy environment with the version 2.4.18 and we think it should not be like this.

Steps to Reproduce the Behavior

NA

Do you have any idea what may have caused this?

No response

Do you have an idea how to solve the issue?

No response

What is your configuration?

global

    log /dev/log local0
    tune.ssl.default-dh-param 4096
    maxconn 40000
    stats socket /var/run/haproxy.sock mode 600 level admin
    stats timeout 2m

defaults
    log global
    option httplog

    mode http
    option dontlognull
    option redispatch
    option contstats
    option forwardfor
    backlog 10000
    retries 3
    timeout server 2m
    timeout client 2m
    http-reuse safe
    timeout connect 10s
    timeout tunnel 120s
    timeout http-keep-alive 2m
    timeout http-request 15s
    timeout queue 30s
    timeout tarpit 60s
    stats enable
    stats uri /stats
    stats auth *******:********

http-errors custerrors
    errorfile 504 /etc/haproxy/errors/504-json.http

Output of haproxy -vv

HAProxy version 2.4.18-1d80f18 2022/07/27 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.18.html
Running on: Linux 4.18.0-448.el8.x86_64 #1 SMP Wed Jan 18 15:02:46 UTC 2023 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE2=1 USE_LINUX_TPROXY=1 USE_CRYPT_H=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_SLZ=1 USE_SYSTEMD=1 USE_PROMEX=1
  DEBUG   =

Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT +PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM -ZLIB +SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL -PROCCTL +THREAD_DUMP -EVPORTS -OT -QUIC +PROMEX -MEMORY_PROFILING

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.1.1k  FIPS 25 Mar 2021
Running on OpenSSL version : OpenSSL 1.1.1k  FIPS 25 Mar 2021
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.4
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.32 2018-09-10
PCRE2 library supports JIT : no (USE_PCRE2_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 8.5.0 20210514 (Red Hat 8.5.0-4)

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2       flags=HTX|CLEAN_ABRT|HOL_RISK|NO_UPG
            fcgi : mode=HTTP       side=BE        mux=FCGI     flags=HTX|HOL_RISK|NO_UPG
       <default> : mode=HTTP       side=FE|BE     mux=H1       flags=HTX
              h1 : mode=HTTP       side=FE|BE     mux=H1       flags=HTX|NO_UPG
       <default> : mode=TCP        side=FE|BE     mux=PASS     flags=
            none : mode=TCP        side=FE|BE     mux=PASS     flags=NO_UPG

Available services : prometheus-exporter
Available filters :
        [SPOE] spoe
        [CACHE] cache
        [FCGI] fcgi-app
        [COMP] compression
        [TRACE] trace

Last Outputs and Backtraces

HAproxy logs when we saw sticky tables went higher:

Jan 30 13:22:57 ip-10-53-167-219 haproxy[1243]: ::ffff:10.53.1.23:61400 [30/Jan/2023:13:22:57.427] main~ backend_decision/service_decision7 0/0/0/20/20 200 124 - - ---- 358/358/14/0/0 0/0 {lbg:872251814|} "GET /artim-decision/rs/interaction/lbg:872251814/TargetedMessages;channel=004;inp=lmapp1?trace=0&version=2 HTTP/1.1"
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]: Thread 2 is about to kill the process.
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]: >Thread 1 : id=0x7f41569240c0 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             stuck=1 prof=0 harmless=0 wantrdv=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cpu_ns: poll=45338175014787 now=45340268664342 diff=2093649555
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             curr_task=0x5606b27dc360 (task) calls=101145680 last=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:               fct=0x5606b152d1f0(process_table_expire) ctx=0x5606b272b020
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             call trace(15):
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b159f970 [85 c0 75 2c 48 8b 84 24]: ha_dump_backtrace+0x40/0x310
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a0335 [eb 8f 66 0f 1f 84 00 00]: ha_thread_dump+0x275/0x2be
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a03ee [48 8b 05 e3 e3 3b 00 48]: debug_handler+0x6e/0x10e
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x7f4155ec6cf0 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12cf0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15e878d [44 8b 4f 10 8b 47 14 4c]: pool_free_nocache+0xd/0x41
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15e8888 [41 8b 84 24 90 00 00 00]: pool_evict_from_local_cache+0xb8/0xf5
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15e8a53 [64 48 8b 04 25 88 7e ff]: pool_put_to_cache+0xd3/0xfb
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b152d406 [8b 0c 24 e9 72 ff ff ff]: process_table_expire+0x216/0x23c
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]: *>Thread 2 : id=0x7f4152d47700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             stuck=1 prof=0 harmless=0 wantrdv=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cpu_ns: poll=30843526634061 now=30845648546409 diff=2121912348
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             curr_task=0x7f414c118250 (task) calls=2107810504 last=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:               fct=0x5606b15e9080(task_run_applet) ctx=0x7f414c0a4460(<PEER>)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             strm=0x7f414c02ce10,0 src=10.53.38.114 fe=ip-10-53-167-219.eu-west-1.compute.internal be=ip-10-53-167-219.eu-west-1.compute.internal dst=<PEER>
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             txn=(nil),0 txn.req=-,0 txn.rsp=-,0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             rqf=848202 rqa=0 rpf=80048202 rpa=0 sif=EST,200048 sib=EST,204058
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             af=(nil),0 csf=0x7f414c04c5c0,8200
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             ab=0x7f414c0a4460,7 csb=(nil),0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cof=0x5606b2a59530,1300:PASS(0x7f414c13cbd0)/RAW((nil))/tcpv4(141)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             call trace(16):
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15ff557 [eb a7 e8 62 78 e3 ff 66]: wdt_handler+0x137/0x13e
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x7f4155ec6cf0 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12cf0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b154335a [e9 51 fd ff ff 90 48 89]: main+0x10ad7a
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15e9195 [f6 45 04 10 0f 84 91 01]: task_run_applet+0x115/0x652
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]: >Thread 3 : id=0x7f4150e1f700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             stuck=1 prof=0 harmless=0 wantrdv=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cpu_ns: poll=8060319683089 now=8062436351937 diff=2116668848
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             curr_task=0x7f4146cf70f0 (task) calls=3 last=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:               fct=0x5606b14ca3b0(process_stream) ctx=0x7f4144222fd0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             strm=0x7f4144222fd0,10084f src=::ffff:10.53.128.16 fe=main be=backend_celebrus dst=service_col4
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             txn=0x7f41288954d0,40000 txn.req=MSG_DONE,d txn.rsp=MSG_BODY,d
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             rqf=48840000 rqa=8000 rpf=c0008002 rpa=1800000 sif=EST,200028 sib=EST,200138
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             af=(nil),0 csf=0x7f41474c8bf0,104000
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             ab=(nil),0 csb=0x7f4146728160,4000
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cof=0x7f412b366610,80001300:H1(0x7f40ff110540)/SSL(0x7f4103bbda20)/tcpv6(436)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cob=0x7f4111f880e0,2300:H1(0x7f413b25d5d0)/SSL(0x5606e54d6930)/tcpv4(271)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             call trace(13):
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b159f970 [85 c0 75 2c 48 8b 84 24]: ha_dump_backtrace+0x40/0x310
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a0335 [eb 8f 66 0f 1f 84 00 00]: ha_thread_dump+0x275/0x2be
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a03ee [48 8b 05 e3 e3 3b 00 48]: debug_handler+0x6e/0x10e
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x7f4155ec6cf0 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12cf0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b152fb9a [eb bc 0f 1f 40 00 f3 0f]: stktable_set_entry+0x6a/0x6c
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b14c2338 [49 8b 75 00 48 89 c5 48]: main+0x89d58
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b14cbe21 [85 c0 0f 85 8e ef ff ff]: process_stream+0x1a71/0x4530
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]: >Thread 4 : id=0x7f414bfff700 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             stuck=1 prof=0 harmless=0 wantrdv=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cpu_ns: poll=8171314029163 now=8173429514745 diff=2115485582
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             curr_task=0x7f412c292d70 (task) calls=1 last=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:               fct=0x5606b14ca3b0(process_stream) ctx=0x7f412c68e8d0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             strm=0x7f412c68e8d0,808 src=::ffff:10.53.128.16 fe=main be=backend_celebrus dst=unknown
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             txn=0x5606ff7b19b0,40000 txn.req=MSG_BODY,d txn.rsp=MSG_RPBEFORE,0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             rqf=4d08002 rqa=a000 rpf=80000000 rpa=0 sif=EST,200020 sib=INI,30
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             af=(nil),0 csf=0x7f40f78460e0,108000
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             ab=(nil),0 csb=(nil),0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cof=0x7f4137133f60,80001300:H1(0x7f412ec069a0)/SSL(0x7f413b3d3740)/tcpv6(206)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             call trace(13):
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b159f970 [85 c0 75 2c 48 8b 84 24]: ha_dump_backtrace+0x40/0x310
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a0335 [eb 8f 66 0f 1f 84 00 00]: ha_thread_dump+0x275/0x2be
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a03ee [48 8b 05 e3 e3 3b 00 48]: debug_handler+0x6e/0x10e
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x7f4155ec6cf0 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12cf0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b152eee2 [eb ac 66 66 2e 0f 1f 84]: stktable_lookup_key+0x72/0x74
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b14c267c [48 89 c1 48 85 c0 74 1d]: main+0x8a09c
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b14ccceb [85 c0 0f 85 61 ed ff ff]: process_stream+0x293b/0x4530
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]: >Thread 5 : id=0x7f414b7fe700 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             stuck=1 prof=0 harmless=0 wantrdv=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cpu_ns: poll=8079876370243 now=8081970928430 diff=2094558187
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             curr_task=0x7f413c8f1cd0 (task) calls=3 last=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:               fct=0x5606b14ca3b0(process_stream) ctx=0x7f413c7a0a70
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             strm=0x7f413c7a0a70,10084f src=::ffff:10.53.128.16 fe=main be=backend_celebrus dst=service_col2
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             txn=0x7f411108d930,40000 txn.req=MSG_DONE,d txn.rsp=MSG_BODY,d
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             rqf=48840000 rqa=8000 rpf=c0008002 rpa=1800000 sif=EST,200028 sib=EST,200138
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             af=(nil),0 csf=0x7f4107e8e260,104000
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             ab=(nil),0 csb=0x7f413e58a990,4000
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cof=0x7f40ff272fb0,80001300:H1(0x5606d1b6ec60)/SSL(0x7f41464faf60)/tcpv6(266)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cob=0x7f413a1600d0,2300:H1(0x7f40fd6b8050)/SSL(0x7f413a6e5220)/tcpv4(120)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             call trace(13):
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b159f970 [85 c0 75 2c 48 8b 84 24]: ha_dump_backtrace+0x40/0x310
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a0335 [eb 8f 66 0f 1f 84 00 00]: ha_thread_dump+0x275/0x2be
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a03ee [48 8b 05 e3 e3 3b 00 48]: debug_handler+0x6e/0x10e
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x7f4155ec6cf0 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12cf0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b152fb9a [eb bc 0f 1f 40 00 f3 0f]: stktable_set_entry+0x6a/0x6c
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b14c2338 [49 8b 75 00 48 89 c5 48]: main+0x89d58
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b14cbe21 [85 c0 0f 85 8e ef ff ff]: process_stream+0x1a71/0x4530
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]: >Thread 6 : id=0x7f414affd700 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             stuck=1 prof=0 harmless=0 wantrdv=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cpu_ns: poll=8210502541366 now=8212588702519 diff=2086161153
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             curr_task=0x7f4143f9b8d0 (task) calls=1 last=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:               fct=0x5606b14ca3b0(process_stream) ctx=0x7f41401d5880
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             strm=0x7f41401d5880,808 src=::ffff:10.53.128.16 fe=main be=backend_celebrus dst=unknown
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             txn=0x7f40eeaf2df0,40000 txn.req=MSG_BODY,d txn.rsp=MSG_RPBEFORE,0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             rqf=4d08002 rqa=a000 rpf=80000000 rpa=0 sif=EST,200020 sib=INI,30
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             af=(nil),0 csf=0x7f40f43991d0,108000
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             ab=(nil),0 csb=(nil),0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cof=0x7f40ed7d9ec0,80001300:H1(0x56070050dd60)/SSL(0x7f40f4a88270)/tcpv6(393)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             call trace(13):
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b159f970 [85 c0 75 2c 48 8b 84 24]: ha_dump_backtrace+0x40/0x310
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a0335 [eb 8f 66 0f 1f 84 00 00]: ha_thread_dump+0x275/0x2be
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a03ee [48 8b 05 e3 e3 3b 00 48]: debug_handler+0x6e/0x10e
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x7f4155ec6cf0 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12cf0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b152eee2 [eb ac 66 66 2e 0f 1f 84]: stktable_lookup_key+0x72/0x74
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b14c267c [48 89 c1 48 85 c0 74 1d]: main+0x8a09c
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b14ccceb [85 c0 0f 85 61 ed ff ff]: process_stream+0x293b/0x4530
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]: >Thread 7 : id=0x7f414a7fc700 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             stuck=1 prof=0 harmless=0 wantrdv=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cpu_ns: poll=8220943689419 now=8223053497186 diff=2109807767
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             curr_task=0x5606b27d3880 (task) calls=777252934 last=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:               fct=0x5606b1544ea0(process_peer_sync) ctx=0x5606b2721570
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             call trace(11):
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b159f970 [85 c0 75 2c 48 8b 84 24]: ha_dump_backtrace+0x40/0x310
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a0335 [eb 8f 66 0f 1f 84 00 00]: ha_thread_dump+0x275/0x2be
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a03ee [48 8b 05 e3 e3 3b 00 48]: debug_handler+0x6e/0x10e
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x7f4155ec6cf0 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12cf0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15450b7 [e9 2c fe ff ff 80 e6 08]: process_peer_sync+0x217/0x8ea
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]: >Thread 8 : id=0x7f4149ffb700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             stuck=1 prof=0 harmless=0 wantrdv=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cpu_ns: poll=7978544254803 now=7980673840428 diff=2129585625
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             curr_task=0x7f40fd2157e0 (task) calls=3 last=0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:               fct=0x5606b14ca3b0(process_stream) ctx=0x7f41382cb3d0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             strm=0x7f41382cb3d0,10084f src=::ffff:10.53.128.16 fe=main be=backend_celebrus dst=service_col2
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             txn=0x5606cf37f880,40000 txn.req=MSG_DONE,d txn.rsp=MSG_BODY,d
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             rqf=48840000 rqa=8000 rpf=c0008002 rpa=1800000 sif=EST,200028 sib=EST,200138
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             af=(nil),0 csf=0x7f40ff794620,104000
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             ab=(nil),0 csb=0x7f40ff9b91e0,4000
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cof=0x5606ff419b40,80001300:H1(0x7f414ce519d0)/SSL(0x7f413c381810)/tcpv6(55)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             cob=0x7f412b4b9700,2300:H1(0x5606bf1ebd80)/SSL(0x56072d3d2080)/tcpv4(273)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             call trace(13):
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b159f970 [85 c0 75 2c 48 8b 84 24]: ha_dump_backtrace+0x40/0x310
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a0335 [eb 8f 66 0f 1f 84 00 00]: ha_thread_dump+0x275/0x2be
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b15a03ee [48 8b 05 e3 e3 3b 00 48]: debug_handler+0x6e/0x10e
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x7f4155ec6cf0 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12cf0
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b152fb9a [eb bc 0f 1f 40 00 f3 0f]: stktable_set_entry+0x6a/0x6c
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b14c2338 [49 8b 75 00 48 89 c5 48]: main+0x89d58
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1243]:             | 0x5606b14cbe21 [85 c0 0f 85 8e ef ff ff]: process_stream+0x1a71/0x4530
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1183]: [NOTICE]   (1183) : haproxy version is 2.4.18-1d80f18
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1183]: [NOTICE]   (1183) : path to executable is /usr/sbin/haproxy
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1183]: [ALERT]    (1183) : Current worker #1 (1243) exited with code 134 (Aborted)
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1183]: [ALERT]    (1183) : exit-on-failure: killing every processes with SIGTERM
Jan 30 13:22:59 ip-10-53-167-219 haproxy[1183]: [WARNING]  (1183) : All workers exited. Exiting... (134)

Additional Information

We did not applied any patch No change in environment. nothing happened that can cause the issue.

wtarreau commented 1 year ago

Hello,

Thanks for your report! It seems that there's a bug causing an infinite loop in the peers applet (which means that your config is larger and has a "peers" section). By the way, it's fine to trim the config for privacy purposes, please just think about mentioning a few points such as "in addition there's a peers section with 2 peers" or "we have 12 frontends and 40 backends" etc, that does help eliminate some possibilities sometimes.

Could you please retrieve the core file and open it with haproxy under gdb, then issue t a a bt full so that we know where the threads were running ? I'm particularly interested in resolving that address 0x5606b154335a to a line number. Just out of curiosity, is it a distro package or is it a version that you built internally ?

Thanks!

sona0108 commented 1 year ago

Hello,

It is a distro package.

latest stable package, which isn’t available from any distro. it is the standard stable tgz pulled from haproxy and replaced in the distro rpm (with the 1 patch removed as it was already present in the source).

“we also enlarged the stick tables to give us a longer run time before failure. We found that we could run a ‘reload’ and it would briefly spike (likely the result of the old process copying stick tables into the new process) and then return to normal operation and clean up the old entries.

and I will share the core file under gdb later.

wtarreau commented 1 year ago

OK. Be careful not to share the core file itself, just the gdb's output!

wtarreau commented 1 year ago

@sona0108, in ticket #2034 @alekseyp-amzn is right and found a bug in the expiration algorithm that will trigger every 49.7 days. In addition, all expired entries are purged in one loop, so I suspect that the issue of extraneous elements rotting in the table can cause a violent purge at one moment and trigger the watchdog. We're currently working on a patch (tested on 2.4 and 2.8-dev, now needs to be finished, cleaned and merged). I think I'll also see how to implement some limits to avoid purging millions of entries at once. No need for the core anymore :-)

sona0108 commented 1 year ago

Hello,

Thank you for the update, I think it does make sense Please let me know once you have the new patch.

wtarreau commented 1 year ago

We currently have the patch in 2.8-dev and 2.7-maint. It still needs to be backported to older releases. I don't think it will trivially apply to 2.4 so I need to check first. The last occurrence of the wrapping was on Jan 30th and the next one is for Mar 20th 6:10pm so we still have a bit of margin.

sona0108 commented 1 year ago

currently our HAproxy version is 2.4.18 or so do you think once the new patch is ready for the 2.4 it will work? or we need to upgrade it with the latest version in which the patch is already available

wtarreau commented 1 year ago

Yes it will work, as I could already test it there when I first reproduced the problem. No need to rush an upgrade yet ;-)

sona0108 commented 1 year ago

sure Thank you very much!

sona0108 commented 1 year ago

as I know you are working on patch till then we will not upgrade anything. but one thing, do we have same patch on version 2.6? in future we are planning to upgrade HAproxy, so please let me know what is your thought.

wtarreau commented 1 year ago

Yes the patch is available here if you want: http://git.haproxy.org/?p=haproxy-2.6.git;a=commitdiff_plain;h=75cf53393

But we'll issue another 2.6 tomorrow due to a security issue so I'd suggest to wait for next version (that shouldn't prevent you from trying the patch above on your side though).

sona0108 commented 1 year ago

Next version what is would be?

wtarreau commented 1 year ago

Unless I'm mistaken it should be 2.6.9.

sona0108 commented 1 year ago

Ok then I will suggest my team to wait for 2.6.9 do you have any ETA?

wtarreau commented 1 year ago

As I said, it's tomorrow. We planned 5pm CET.

sona0108 commented 1 year ago

Thank you very much! please let me know once it is ready.

wtarreau commented 1 year ago

It will be announce like other ones anyway.

wtarreau commented 1 year ago

2.6.9 is out now.

wtarreau commented 1 year ago

I'm marking the problem as fixed and we can wait a day or two before closing.

sona0108 commented 1 year ago

Thank you very much for the update! please do not close the case now, give us same days, will update you here.

Darlelet commented 1 year ago

Hi @sona0108, were you able to try out 2.6.9 or later to confirm that the issue is gone? Thanks

capflam commented 1 year ago

A gentle ping

capflam commented 1 year ago

The 2.6.14 was released. I'm closing the issue because a fix was provided. Feel free to reopen it to fill more info if the issue is not fixed. Thanks !