haproxy / haproxy

HAProxy Load Balancer's development branch (mirror of git.haproxy.org)
https://git.haproxy.org/
Other
5.05k stars 803 forks source link

observe on-error is inconsistent with redispatch #2679

Open windymindy opened 3 months ago

windymindy commented 3 months ago

Detailed Description of the Problem

Hi. Debugging different failure scenarios I observed that 'on-error' mode actions only apply to a backend server that was initially chosen for a request. If another redispatched server failed to respond successfully no action is taken for that server.

Expected Behavior

I expect 'observe' 'error-limit' 'on-error' configurations to be a deterministic state machine that changes backend server state based on observed communications.

Steps to Reproduce the Behavior

  1. start an environment with two or more backends that indicate healthy in healthchecks.
  2. request a faulty endpoint that replies with something that is not successfully verified. I used 503 status code.
  3. observe mark-down, sudden-death or fastinter modes applied to the initial backend server only

Do you have any idea what may have caused this?

No response

Do you have an idea how to solve the issue?

No response

What is your configuration?

global
    node ***
    description haproxy 1

    chroot /var/lib/haproxy
    pidfile /var/run/haproxy.pid
    user haproxy
    group haproxy

    zero-warning

    log stderr format raw daemon info

    maxconn 4000
    #fd-hard-limit
    #ulimit-n

    #cache

    #cluster-secret secret

    ssl-default-bind-options ssl-min-ver TLSv1.2
    ssl-load-extra-del-ext
    ssl-load-extra-files key

    #stats socket /var/lib/haproxy/stats

defaults defaults
    log stdout format raw daemon info
    option httplog
    #option dontlognull
    option log-health-checks

    maxconn 3000

    option http-keep-alive
    option redispatch
    retries 2

    option tcpka
    option clitcpka

    timeout check 10s
    timeout client 1m
    timeout client-fin 10s
    timeout connect 10s
    timeout http-keep-alive 1m
    timeout http-request 10s
    timeout queue 1m
    timeout server 1m
    timeout server-fin 10s
    #timeout tarpit 1m
    timeout tunnel 24h

    compression direction response
    #compression direction both
    compression algo-res gzip
    #compression algo-req gzip

defaults http_reverse_proxy from defaults
    mode http
    option forwarded
    option forwardfor
    http-request set-header X-Forwarded-Host %[req.hdr(host)]
    http-request set-header X-Forwarded-Scheme https
    http-request set-header X-Forwarded-Proto https
    #http-request set-header X-Forwarded-Port %[dst_port]
    http-request set-header X-Real-IP %[src]

defaults

#cache

#http-errors
    #errorfile

frontend main from defaults
    mode http
    bind *:443 no-alpn ssl crt ***

    filter compression

    use_backend backend_test_1 if { path_beg /test_1 }

    default_backend not_found

    stats enable
    stats uri /haproxy
    stats auth ***
    stats refresh 10s
    stats admin if TRUE
    stats show-node
    stats show-desc
    stats show-legends
    stats show-modules

frontend redirect_insecure from defaults
    mode http
    bind *:80
    http-request redirect scheme https code 301 #unless { ssl_fc }

backend not_found from defaults
    mode http
    http-request deny deny_status 404

backend backend_test_1 from http_reverse_proxy
    default-server check inter 15s fastinter 1s fall 2 rise 2 observe layer7 error-limit 1 on-error sudden-death
    server server_1 server-01.local.lab:80
    server server_2 server-02.local.lab:80

    option httpchk
    http-check send meth GET uri /test_1/***/health hdr Authorization "Bearer ***"
    http-check expect status 200

    retry-on all-retryable-errors

Output of haproxy -vv

HAProxy version 2.8.6-f6bd011 2024/02/15 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2028.
Known bugs: http://www.haproxy.org/bugs/bugs-2.8.6.html
Running on: Linux 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64
Build options :
  TARGET  = linux-musl
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment
  OPTIONS = USE_PTHREAD_EMULATION=1 USE_LINUX_TPROXY=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_SLZ=1 USE_TFO=1 USE_QUIC=1 USE_PROMEX=1 USE_PCRE2=1 USE_PCRE2_JIT=1 USE_QUIC_OPENSSL_COMPAT=1
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : -51DEGREES +ACCEPT4 -BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_WOLFSSL -OT -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX +PTHREAD_EMULATION +QUIC +QUIC_OPENSSL_COMPAT +RT +SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 -SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=4).
Built with OpenSSL version : OpenSSL 3.1.4 24 Oct 2023
Running on OpenSSL version : OpenSSL 3.1.4 24 Oct 2023
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
OpenSSL providers loaded : default
Built with Lua version : Lua 5.4.6
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.42 2022-12-11
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 12.2.1 20220924

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
       quic : mode=HTTP  side=FE     mux=QUIC  flags=HTX|NO_UPG|FRAMED
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : prometheus-exporter
Available filters :
        [BWLIM] bwlim-in
        [BWLIM] bwlim-out
        [CACHE] cache
        [COMP] compression
        [FCGI] fcgi-app
        [SPOE] spoe
        [TRACE] trace

Last Outputs and Backtraces

[WARNING]  (40) : Health check for server backend_test_1/server_2 failed, reason: Health analyze, info: "Detected 1 consecutive errors, last one was: Wrong http response", status: 1/2 UP.
Health check for server backend_test_1/server_2 failed, reason: Health analyze, info: "Detected 1 consecutive errors, last one was: Wrong http response", status: 1/2 UP.
172.16.2.110:56300 [09/Aug/2024:17:16:49.862] main~ backend_test_1/server_2 0/20/0/7/27 503 191 - - ---- 4/4/0/0/+2 0/0 "GET /test_1/*** HTTP/1.1"
Health check for server backend_test_1/server_2 succeeded, reason: Layer7 check passed, code: 200, check duration: 7ms, status: 2/2 UP.
[WARNING]  (40) : Health check for server backend_test_1/server_2 succeeded, reason: Layer7 check passed, code: 200, check duration: 7ms, status: 2/2 UP.

Additional Information

No response

wtarreau commented 3 months ago

Hello,

thanks for reporting. I'm a little bit confused becasuse the traces above seem to indicate it worked (server2 marked as failed and was redispatched), but you could very well be facing a more subtle corner case. At the moment I have no idea what could cause this since the counters are normall per server. Maybe the event is not detected after a redispatch, I don't know.

windymindy commented 3 months ago

thanks for reporting. I'm a little bit confused becasuse the traces above seem to indicate it worked (server2 marked as failed and was redispatched), but you could very well be facing a more subtle corner case. At the moment I have no idea what could cause this since the counters are normall per server. Maybe the event is not detected after a redispatch, I don't know.

server_2 was the initial backend that the request has been sent to. The endpoint is implemented to always return http error in both instances of the service. The proxy is configured to 'redispatch'. So what happened is that both backends were queried, both returned an error. But only one was marked as down. I encourage one to test this themselves.

Let me know if a sequence diagram is needed.

capflam commented 3 weeks ago

I checked and indeed there is a difference between L4 and L7 errors. The servers health status are adjusted before the L4 retries but after the L7 retries. So, with observer layer4, the status of all tested servers is adjusted. With observer layer7, only the status of the last tested server is adjusted.

capflam commented 1 week ago

It should be fixed now. Thanks !