haproxy / haproxy

HAProxy Load Balancer's development branch (mirror of git.haproxy.org)
https://git.haproxy.org/
Other
4.96k stars 796 forks source link

haproxy 2.6.16 : connections stuck in close_wait state #2520

Closed DumitruNiculai closed 6 months ago

DumitruNiculai commented 7 months ago

Detailed Description of the Problem

We observe a steady growth of sockets in CLOSE_WAIT state right from the moment HAProxy 2.6.16 starts. It continues until the total number of sockets reaches maxconn and then the problem becomes worse as now new proper connections do not get accepted due to all the CLOSE_WAIT ones filling the slots.

haproxy -version

HAProxy version 2.6.16-c6a7346 2023/12/13 - https://haproxy.org/ Status: long-term supported branch - will stop receiving fixes around Q2 2027. Known bugs: http://www.haproxy.org/bugs/bugs-2.6.16.html Running on: Linux 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023 x86_64

netstat -a -n -o -l -p | grep CLOSE_WAIT | grep 1037/haproxy | wc -l

436

Expected Behavior

No accumulation of CLOSE_WAIT sockets.

Steps to Reproduce the Behavior

  1. Start HAProxy
  2. Observe the number of CLOSE_WAIT sockets

Do you have any idea what may have caused this?

No response

Do you have an idea how to solve the issue?

No response

What is your configuration?

# cat haproxy.cfg
global
    log         127.0.0.1 local0
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     50000
    user        haproxy
    group       haproxy
    daemon
    stats socket /var/lib/haproxy/stats

defaults
    mode                    tcp
    log                     global
    retries                 3
    timeout queue           1m
    timeout connect         10s
    timeout client          7200m
    timeout server          7200m
    timeout check           10s
    maxconn                 50000

frontend psql-in1
    mode tcp
    bind *:5400
#    option tcplog
    default_backend             psql-back1

frontend psql-in2
    mode tcp
    bind *:5500
#    option tcplog
    default_backend             psql-back2

frontend psql-in3
    mode tcp
    bind *:5432
#    option tcplog
    default_backend             psql-back3

backend psql-back1
    mode        tcp
#    option      tcplog
    option      httpchk
    http-check  expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server 10.172.235.23 10.172.235.23:5432 maxconn 10000 check check-ssl verify none port 8008 crt /etc/ssl/certs/patroni-api-prod.pem
    server 10.172.235.24 10.172.235.24:5432 maxconn 10000 check check-ssl verify none port 8008 crt /etc/ssl/certs/patroni-api-prod.pem

backend psql-back2
    mode        tcp
#    option      tcplog
    option      httpchk
    http-check  expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server 10.172.235.25 10.172.235.25:5432 maxconn 5000 check check-ssl verify none port 8008 crt /etc/ssl/certs/patroni-api-prod.pem
    server 10.172.235.26 10.172.235.26:5432 maxconn 5000 check check-ssl verify none port 8008 crt /etc/ssl/certs/patroni-api-prod.pem

backend psql-back3
    mode        tcp
#    option      tcplog
    option      httpchk
    http-check  expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server 10.172.235.23 10.172.235.23:5432 maxconn 1000 check check-ssl verify none port 8008 crt /etc/ssl/certs/patroni-api-prod.pem
    server 10.172.235.24 10.172.235.24:5432 maxconn 1000 check check-ssl verify none port 8008 crt /etc/ssl/certs/patroni-api-prod.pem

listen stats
    bind :9000
    mode http
    stats enable
    stats hide-version
    stats realm Haproxy\ Statistics
    stats uri /haproxy-stats

Output of haproxy -vv

HAProxy version 2.6.16-c6a7346 2023/12/13 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2027.
Known bugs: http://www.haproxy.org/bugs/bugs-2.6.16.html
Running on: Linux 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment
  OPTIONS = USE_OPENSSL=yes
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE +LIBCRYPT +LINUX_SPLICE +LINUX_TPROXY -LUA -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OT -PCRE -PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL -PROMEX -QUIC +RT +SLZ -STATIC_PCRE -STATIC_PCRE2 -SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=2).
Built with OpenSSL version : OpenSSL 1.1.1k  FIPS 25 Mar 2021
Running on OpenSSL version : OpenSSL 1.1.1k  FIPS 25 Mar 2021
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with network namespace support.
Support for malloc_trim() is enabled.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built without PCRE or PCRE2 support (using libc's regex instead)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 8.5.0 20210514 (Red Hat 8.5.0-10)

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : none

Available filters :
        [CACHE] cache
        [COMP] compression
        [FCGI] fcgi-app
        [SPOE] spoe
        [TRACE] trace

Last Outputs and Backtraces

No response

Additional Information

No response

capflam commented 7 months ago

Could you share ouput for show fd and show sess all CLI commands please ? Are these sockets on the frontend or the backend side ?

wtarreau commented 7 months ago

Other points, you're running with a large pair of timeout values of 2 hours. Do these sockets continue to accumulate past the two hours or do they remain stable ? It could be possible that these are "just" the result of some clients disappearing from the net after having sent only a FIN after their request.

Do you have any firewall anywhere in the chain, e.g. on the other side of these CLOSE_WAIT ? What could happen is that a client closes with a shutdown, the shutdown is passed to the other side, triggers a lower timeout on a firewall, that quicky closes the connection while for various reasons the server doesn't receive it. You could quickly end up with a FIN_WAIT1 on one side and a CLOSE_WAIT on the other side, waiting for the first timeout to trigger.

"option abortonclose" could be used to terminate half-closed connections, though this might or might not be what you want on TCP communications. Alternately you may also set "timeout client-fin" and "timeout server-fin" to much lower values to shorten the timeouts once a FIN was transmitted, in order to better deal with vanishing machines.

DumitruNiculai commented 6 months ago

Good day Willy.

The issue has been resolved.

As you suggested we setup these parameters that helped us to get rid of those CLOSE_WAIT stuck connections: "timeout client-fin 5s" and "timeout server-fin 5s"

We are running High Availability PostgreSQL clusters in our Production system based on Patroni with HA Proxy as proxy server for user/application connections. We don't have any firewalls at the server sides. However, all the multiply applications are running from Kubernetes pods that sometimes fail and create the CLOSE_WAIT stuck connections

Thank you Willy for your help and support