High CPU usage on 2.6.4

felipewd commented 2 years ago

Detailed Description of the Problem

Hello,

We've start experiencing high CPU loads (%usr) on haproxy using 2.6.4.

The machine is an 80-core machine delivering just over 3Gbps of traffic over HTTP/1.1 with no TLS, so nothing major.

Here's what we could gather from the machine:

  15.68%  haproxy                  [.] run_tasks_from_lists
  10.81%  haproxy                  [.] __tasklet_wakeup_on
   6.24%  haproxy                  [.] wake_expired_tasks
   5.78%  haproxy                  [.] _do_poll
   5.35%  haproxy                  [.] sc_notify
   4.12%  haproxy                  [.] conn_subscribe
   2.92%  haproxy                  [.] h1_wake_stream_for_send.part.0
   2.77%  haproxy                  [.] sc_conn_send
   2.54%  haproxy                  [.] clock_update_date
   1.91%  haproxy                  [.] sc_conn_process
   1.90%  haproxy                  [.] h1_process
   1.90%  haproxy                  [.] h1_io_cb
   1.82%  haproxy                  [.] h1_snd_buf
   1.53%  haproxy                  [.] h1_refresh_timeout
   1.50%  haproxy                  [.] sc_conn_io_cb
   1.29%  haproxy                  [.] h1_subscribe
   1.25%  haproxy                  [.] h1_send
   1.06%  haproxy                  [.] eb32_lookup_ge
   1.02%  haproxy                  [.] sc_conn_recv
   0.87%  [kernel]                 [k] syscall_return_via_sysret
   0.84%  [kernel]                 [k] copy_user_enhanced_fast_string
   0.83%  [vdso]                   [.] __vdso_clock_gettime
   0.79%  haproxy                  [.] stream_release_buffers
   0.75%  haproxy                  [.] sock_check_events
   0.71%  [kernel]                 [k] check_preemption_disabled
   0.68%  haproxy                  [.] process_runnable_tasks
   0.65%  [vdso]                   [.] 0x0000000000000695
   0.53%  [kernel]                 [k] _raw_spin_lock

show activity shows:

date_now: 1661464929.137665
ctxsw: 2402129637 [ 460032878 382303042 339833674 497531011 198973360 523230510 468875889 274392993 490520047 298709381 283762001 501287043 463090146 343411033 466254932 479458610 572310367 512222991 496548105 468513762 430742489 369447258 484682599 410820840 220012024 525381154 575272330 398159702 477773836 449536630 492879521 554127144 470592994 570939554 335401675 ]
tasksw: 17109 [ 584 512 431 654 499 372 451 465 337 401 467 531 556 629 389 500 389 665 367 508 461 501 652 386 541 594 284 453 204 782 353 648 320 527 696 ]
empty_rq: 11289 [ 208 390 2129 69 322 92 155 120 84 773 244 132 226 175 51 49 66 69 40 1333 278 1764 61 69 362 37 112 277 938 95 44 40 68 45 372 ]
long_rq: 305741183 [ 9200762 7645966 6796703 9950742 3979207 10464735 9377539 5487783 9810469 5974232 5675108 10025858 9261799 6868176 9325178 9589233 11446335 10244577 9931058 9370350 8614435 7388932 9693737 8216459 4400086 10507758 11505566 7963075 9555514 8990795 9857700 11082681 9411924 11418951 6707760 ]
loops: 305826680 [ 9202188 7650800 6800150 9951773 3984975 10465613 9380038 5490717 9812244 5975942 5679059 10026433 9264099 6871047 9326731 9590454 11447498 10245814 9931734 9372665 8623770 7392853 9694900 8217805 4404339 10508658 11506950 7967840 9558296 8992258 9858160 11083156 9413008 11419552 6715161 ]
wake_tasks: 305741224 [ 9200765 7645969 6796706 9950744 3979209 10464736 9377540 5487784 9810471 5974232 5675109 10025859 9261800 6868176 9325179 9589234 11446336 10244578 9931059 9370351 8614436 7388933 9693737 8216460 4400087 10507758 11505568 7963075 9555516 8990796 9857701 11082682 9411924 11418953 6707761 ]
wake_signal: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_io: 365865 [ 4921 12595 17660 6444 7937 7656 8234 10973 4161 16580 7588 8932 11687 11110 15750 6347 7208 9673 7711 13076 23279 14585 6426 6586 16554 7735 5951 18540 9197 10959 10591 5397 4154 6325 23343 ]
poll_exp: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_drop_fd: 5432 [ 190 155 140 212 158 117 139 142 103 130 143 179 186 202 134 172 121 209 101 160 137 154 204 122 167 191 75 157 61 245 99 216 90 178 243 ]
poll_skip_fd: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
conn_dead: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
stream_calls: 9621 [ 321 290 240 360 285 214 254 264 193 224 265 294 314 353 216 277 218 377 210 287 263 287 367 220 301 329 165 253 121 436 205 359 187 288 384 ]
pool_fail: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
buf_wait: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_tot: 253 [ 0 0 19 15 15 14 0 2 0 0 15 0 23 3 0 15 15 15 2 1 0 0 1 26 1 23 0 1 1 15 0 0 15 1 15 ]
cpust_ms_1s: 1 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_15s: 3 [ 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
avg_loop_us: 9 [ 8 7 10 7 10 8 8 11 8 12 9 9 9 13 8 7 8 8 10 9 9 9 9 10 9 10 7 9 7 9 9 7 9 7 7 ]
accepted: 6529 [ 0 222 2102 1 0 0 0 0 2 254 1 266 44 0 1 2 1 2 6 2006 32 682 11 40 1 3 1 134 706 2 1 1 3 1 1 ]
accq_pushed: 6529 [ 229 187 170 258 188 139 168 173 126 156 181 210 211 238 157 196 151 246 127 191 170 182 248 149 207 235 101 180 78 299 127 254 113 210 274 ]
accq_full: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
accq_ring: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
fd_takeover: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

I have show fd as well, but don't know if it's relevant (1009 conns)...but I can post here.

Expected Behavior

No loop :-)

Steps to Reproduce the Behavior

Well, not sure: We can go to 1-3 cores to 100% to a little over 20 cores to 100%, so...don't know.

Do you have any idea what may have caused this?

not really.

Do you have an idea how to solve the issue?

No response

What is your configuration?

Config is pretty much the same we use with 2.4, with just `jwt_verify` added:

global
    nbthread 35
    hard-stop-after 7d
    maxconn 900000
    ulimit-n 3150919
    user haproxy
    group haproxy
    daemon
    unix-bind user haproxy mode 777

    node haproxy-frontend

    tune.idletimer 1000
    tune.bufsize 65536

    stats socket /var/run/haproxy_conexoes.sock mode 666 level admin expose-fd listeners
    stats timeout 10m

    log /var/tmp/logtape.unix local0

defaults
    mode    http

    option abortonclose
    retries 1
    maxconn 900000
    timeout connect 1s
    timeout server 100s
    timeout server-fin 600ms
    timeout client 100s
    timeout client-fin 600ms
    timeout http-request 6s
    timeout http-keep-alive 40s
    timeout tunnel 300s

frontend ... (a bunch of ACLs to validate user's IP address, etc...pretty boiler plate)

backend streaming
    http-reuse never
    server v01 abns@/tmp/experimental.sock

Output of `haproxy -vv`

$ /usr/sbin/haproxy -vv
HAProxy version 2.6.4-2a2078c 2022/08/22 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2027.
Known bugs: http://www.haproxy.org/bugs/bugs-2.6.4.html
Running on: Linux 5.15.60 #3 SMP PREEMPT Fri Aug 12 12:27:55 -00 2022 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment
  OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_NS=1
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL +THREAD +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H -ENGINE +GETADDRINFO +OPENSSL +LUA +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL -PROCCTL +THREAD_DUMP -EVPORTS -OT -QUIC -PROMEX -MEMORY_PROFILING

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=64).
Built with OpenSSL version : OpenSSL 1.1.1q  5 Jul 2022
Running on OpenSSL version : OpenSSL 1.1.1q  5 Jul 2022
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.4.4
Built with network namespace support.
Support for malloc_trim() is enabled.
Built with zlib version : 1.2.12
Running on zlib version : 1.2.12
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.45 2021-06-15
Running on PCRE version : 8.45 2021-06-15
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 12.1.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : none

Available filters :
    [CACHE] cache
    [COMP] compression
    [FCGI] fcgi-app
    [SPOE] spoe
    [TRACE] trace

$ uname -a
Linux ndt-spo-05 5.15.60 #3 SMP PREEMPT Fri Aug 12 12:27:55 -00 2022 x86_64 Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz GenuineIntel GNU/Linux

felipewd commented 2 years ago

So, now we have a little more pressure:

  14.44%  haproxy                  [.] run_tasks_from_lists
   9.83%  haproxy                  [.] __tasklet_wakeup_on
   4.92%  haproxy                  [.] sc_notify
   4.91%  haproxy                  [.] _do_poll
   3.93%  haproxy                  [.] conn_subscribe
   3.73%  haproxy                  [.] wake_expired_tasks
   2.83%  haproxy                  [.] h1_wake_stream_for_send.part.0
   2.65%  haproxy                  [.] sc_conn_send
   2.32%  haproxy                  [.] clock_update_date
   1.87%  haproxy                  [.] sc_conn_process
   1.87%  haproxy                  [.] h1_io_cb
   1.80%  haproxy                  [.] h1_process
   1.78%  haproxy                  [.] h1_snd_buf
   1.66%  [kernel]                 [k] copy_user_enhanced_fast_string
   1.57%  haproxy                  [.] sc_conn_io_cb
   1.29%  haproxy                  [.] h1_refresh_timeout
   1.26%  haproxy                  [.] h1_subscribe
   1.23%  haproxy                  [.] h1_send
   1.05%  haproxy                  [.] sc_conn_recv
   0.85%  [kernel]                 [k] check_preemption_disabled
   0.84%  [kernel]                 [k] syscall_return_via_sysret
   0.84%  [vdso]                   [.] __vdso_clock_gettime
   0.80%  haproxy                  [.] stream_release_buffers
   0.75%  haproxy                  [.] sock_check_events
   0.69%  [kernel]                 [k] _raw_spin_lock_irqsave
   0.66%  haproxy                  [.] eb32_lookup_ge
   0.66%  [kernel]                 [k] _raw_spin_lock
   0.61%  [vdso]                   [.] 0x0000000000000695
   0.54%  [kernel]                 [k] __fget_light
   0.53%  [kernel]                 [k] tasklet_action_common.constprop.0

with a pretty htop picture to show the trouble :-)

the new show activity:

thread_id: 15 (1..35)
date_now: 1661466496.509509
ctxsw: 1372606352 [ 2751990889 4052939029 1702347210 3001496378 2846162639 2209864929 3640547537 2435428430 3014082503 3206866113 3087953261 2816838642 2506207637 2421360451 2764658636 1212046041 3086260444 2743796466 1215782719 2809535673 2568010263 3874744991 2141278815 3143812904 1950730873 2085440325 3316869333 1998315654 2640787318 2550662231 3555974515 4028018279 3335187923 1929122843 3216764970 ]
tasksw: 273160 [ 6737 9324 8626 8001 7358 5918 7827 6928 8471 6426 7318 9301 10579 6687 8303 9563 8359 6169 6343 8625 7205 8544 8414 8027 6731 7480 9752 7494 6727 7644 8146 7751 8557 6242 7583 ]
empty_rq: 797207 [ 18780 17141 78206 14702 20003 18246 20337 15846 16812 34001 13511 34372 18908 16540 15641 13332 14553 13238 13237 72837 15764 55942 20115 14062 13735 18588 17023 17412 50423 16736 14659 12791 17375 14769 17570 ]
long_rq: 2088406030 [ 55022487 81041033 34026069 60009097 56901709 44179823 72792750 48690003 60265178 64122176 61741539 56318214 50104004 48408759 55276882 110124431 61709408 54860430 110203726 56168929 51339409 77480686 42804293 62859978 38997682 41688548 66317272 39946804 52796660 50995232 71099449 80544922 66686593 38563298 64318557 ]
loops: 2100263178 [ 55346638 81371678 34464563 60389153 57295848 44502777 73132515 49029408 60571971 64419788 62063392 56676328 50469490 48751487 55580501 110421742 62003259 55148278 110420869 56621388 51723021 77787285 43193752 63160689 39306974 42059047 66692510 40303808 53181633 51324293 71466523 80833854 67006215 38911871 64630630 ]
wake_tasks: 2088406216 [ 55022493 81041038 34026075 60009103 56901717 44179828 72792755 48690008 60265185 64122181 61741544 56318218 50104010 48408763 55276886 110124435 61709413 54860435 110203732 56168934 51339414 77480692 42804298 62859983 38997687 41688552 66317276 39946808 52796666 50995238 71099457 80544926 66686599 38563304 64318563 ]
wake_signal: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_io: 15848925 [ 422789 468268 524592 491679 479408 404930 459685 432436 437285 430888 422784 444899 514116 410054 389513 568838 387446 361619 383142 580167 523031 466742 451104 417192 465245 461995 462440 437869 529134 418724 485545 437786 408989 406508 462083 ]
poll_exp: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_drop_fd: 59319 [ 1395 2079 1816 1680 1499 1229 1716 1520 1813 1395 1718 1971 2425 1427 1777 2229 1781 1295 1245 1942 1521 1936 1825 1757 1474 1656 2043 1640 1396 1696 1811 1681 1932 1318 1681 ]
poll_skip_fd: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
conn_dead: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
stream_calls: 159429 [ 3964 5396 5029 4696 4322 3498 4529 4083 4953 3722 4257 5404 6146 3914 4876 5489 4896 3633 3640 5006 4239 4970 4915 4678 3960 4371 5678 4399 3977 4471 4759 4516 4968 3679 4396 ]
pool_fail: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
buf_wait: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_tot: 631 [ 23 3 22 18 39 37 25 24 3 5 17 22 34 9 9 39 20 16 6 5 5 21 5 32 3 27 24 7 3 20 2 23 38 23 22 ]
cpust_ms_1s: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_15s: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
avg_loop_us: 16 [ 22 23 24 24 6 22 22 22 6 21 23 22 5 22 22 5 24 5 5 22 24 6 23 6 6 22 23 21 5 6 6 22 6 22 5 ]
accepted: 82841 [ 4 664 41417 6 4 5 3 12 3 2632 5 885 120 4 4 7 2 6 49 20324 89 10372 27 214 16 4 7 475 5446 9 9 5 3 3 6 ]
accq_pushed: 82841 [ 2000 2863 2620 2411 2183 1753 2391 2067 2567 1925 2298 2831 3284 2033 2509 2940 2511 1861 1802 2670 2153 2621 2586 2451 2030 2295 2950 2286 1991 2324 2479 2336 2644 1876 2300 ]
accq_full: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
accq_ring: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
fd_takeover: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

I'll post fd.log here as an attachment: fd.log

felipewd commented 2 years ago

I also did a show threads to try and help here:

  Thread 1 : id=0x7f743382cfc0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/1    stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=539791111616 now=539791128097 diff=16481
             curr_task=0
  Thread 2 : id=0x7f74338096c0 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/2    stuck=0 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=682584850854 now=682584865013 diff=14159
             curr_task=0x7f742c5b7b30 (tasklet) calls=15789944
               fct=0x49fc30(h1_io_cb) ctx=0x7f742c6abd30

  Thread 3 : id=0x7f74315346c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
      1/3    stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=526722981077 now=526723071838 diff=90761
             curr_task=0
* Thread 4 : id=0x7f741ad4b6c0 act=1 glob=0 wq=1 rq=1 tl=1 tlsz=0 rqsz=2
      1/4    stuck=0 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=559794720313 now=559794871342 diff=151029
             curr_task=0x7f7404368270 (task) calls=2 last=0
               fct=0x5bfcc0(task_run_applet) ctx=0x7f73fc0a4160(<CLI>)
             strm=0x7f73fc3d3a80,8 src=unix fe=GLOBAL be=GLOBAL dst=<CLI>
             txn=(nil),0 txn.req=-,0 txn.rsp=-,0
             rqf=c0c023 rqa=0 rpf=80008000 rpa=0
             scf=0x7f73fc2379d0,EST,0 scb=0x7f73fc2b9510,EST,1
             af=(nil),0 sab=0x7f73fc0a4160,9
             cof=0x7f7404926c60,40300:PASS(0x7f73fc31c800)/RAW((nil))/unix_stream(541)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1)

  Thread 5 : id=0x7f741a54a6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/5    stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=493712370361 now=493712525740 diff=155379
             curr_task=0
  Thread 6 : id=0x7f7419d496c0 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/6    stuck=0 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=485399557521 now=485399707714 diff=150193
             curr_task=0x7f74001f1610 (tasklet) calls=32978096
               fct=0x562140(sc_conn_io_cb) ctx=0x7f74001ef790
             strm=0x7f74005ab990,43c0a src=170.80.58.66 fe=ssl be=hyper_streaming dst=v01
             txn=0x7f74001e8880,3000 txn.req=MSG_DONE,4c txn.rsp=MSG_DATA,d
             rqf=4804e060 rqa=0 rpf=8004c060 rpa=0
             scf=0x7f74001ef790,EST,0 scb=0x7f7400134f00,CLO,219
             af=(nil),0 sab=(nil),0
             cof=0x7f73b0787ed0,801c0300:H1(0x7f73b023e8e0)/RAW((nil))/tcpv4(2585)
             cob=0x7f74044c9110,10080300:H1(0x7f73b00ee7b0)/RAW((nil))/unix_stream(2821)

  Thread 7 : id=0x7f74195486c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=2
      1/7    stuck=0 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=559474594077 now=559474756463 diff=162386
             curr_task=0
  Thread 8 : id=0x7f7418d476c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
      1/8    stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=541279191436 now=541279356512 diff=165076
             curr_task=0
  Thread 9 : id=0x7f74185466c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/9    stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=638118984122 now=638119157680 diff=173558
             curr_task=0
  Thread 10: id=0x7f7417d456c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/10   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=495671580046 now=495671757255 diff=177209
             curr_task=0
  Thread 11: id=0x7f74175446c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=3
      1/11   stuck=0 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=726126873197 now=726127050381 diff=177184
             curr_task=0
  Thread 12: id=0x7f7416d436c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/12   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=540441479324 now=540441682578 diff=203254
             curr_task=0
  Thread 13: id=0x7f74165426c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=4
      1/13   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=602301788886 now=602301971826 diff=182940
             curr_task=0
  Thread 14: id=0x7f7415d416c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/14   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=372565060411 now=372565256775 diff=196364
             curr_task=0
  Thread 15: id=0x7f74155406c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/15   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=431565543435 now=431565724915 diff=181480
             curr_task=0
  Thread 16: id=0x7f7414d3f6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/16   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=832710319322 now=832710645390 diff=326068
             curr_task=0
  Thread 17: id=0x7f741453e6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/17   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=540287123702 now=540287358250 diff=234548
             curr_task=0
  Thread 18: id=0x7f7413d3d6c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
      1/18   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=736694801720 now=736695008381 diff=206661
             curr_task=0
  Thread 19: id=0x7f741353c6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/19   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=1000267779058 now=1000268006877 diff=227819
             curr_task=0
  Thread 20: id=0x7f7412d3b6c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
      1/20   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=554938771371 now=554938983617 diff=212246
             curr_task=0
  Thread 21: id=0x7f741253a6c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
      1/21   stuck=0 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=597860267189 now=597860478353 diff=211164
             curr_task=0x7f73f04e8a50 (tasklet) calls=25659714
               fct=0x562140(sc_conn_io_cb) ctx=0x7f73f0145ef0
             strm=0x7f73f01b2b50,43c0a src=170.80.58.66 fe=ssl be=hyper_streaming dst=v01
             txn=0x7f73f04d06a0,3000 txn.req=MSG_DONE,4c txn.rsp=MSG_DATA,d
             rqf=4804e060 rqa=0 rpf=8004c060 rpa=0
             scf=0x7f73f0145ef0,EST,0 scb=0x7f73f04e45a0,CLO,219
             af=(nil),0 sab=(nil),0
             cof=0x7f73b04ee570,801c0300:H1(0x7f7404114f80)/RAW((nil))/tcpv4(3134)
             cob=0x7f73f04e8790,10080300:H1(0x7f74045eb220)/RAW((nil))/unix_stream(3286)

  Thread 22: id=0x7f7411d396c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/22   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=812174704897 now=812174942077 diff=237180
             curr_task=0
  Thread 23: id=0x7f74115386c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/23   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=347139385924 now=347139623569 diff=237645
             curr_task=0
  Thread 24: id=0x7f7410d376c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
      1/24   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=713553442243 now=713553677946 diff=235703
             curr_task=0
  Thread 25: id=0x7f74105366c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=2
      1/25   stuck=0 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=527119604979 now=527119844516 diff=239537
             curr_task=0x7f73c4185a20 (tasklet) calls=3986372
               fct=0x49fc30(h1_io_cb) ctx=0x7f73c418c5b0

  Thread 26: id=0x7f740fd356c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
      1/26   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=565362961004 now=565363251370 diff=290366
             curr_task=0
  Thread 27: id=0x7f740f5346c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/27   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=471811449531 now=471811741222 diff=291691
             curr_task=0
  Thread 28: id=0x7f740ed336c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/28   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=422315547211 now=422315865362 diff=318151
             curr_task=0
  Thread 29: id=0x7f740e5326c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/29   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=513688550811 now=513688860417 diff=309606
             curr_task=0
  Thread 30: id=0x7f740dd316c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/30   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=485147869408 now=485148188583 diff=319175
             curr_task=0
  Thread 31: id=0x7f740d5306c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/31   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=580571993751 now=580572318316 diff=324565
             curr_task=0
  Thread 32: id=0x7f740cd2f6c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
      1/32   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=651206521989 now=651206836749 diff=314760
             curr_task=0
  Thread 33: id=0x7f740c52e6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/33   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=509814804991 now=509815117988 diff=312997
             curr_task=0
  Thread 34: id=0x7f740bd2d6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/34   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=450111160223 now=450111503420 diff=343197
             curr_task=0
  Thread 35: id=0x7f740b52c6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/35   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=461606548056 now=461606890596 diff=342540
             curr_task=0

rnsanchez commented 2 years ago

Here is a flamegraph for one of the hot threads:

https://gist.githubusercontent.com/rnsanchez/dce0660a304a6c002af52494e125a068/raw/f62e231a3ff30e9ccfa5db09def18459c6e6d8f9/haproxy_thread.svg

rnsanchez commented 2 years ago

There seems to be a substantial contention:

Samples: 120K of event 'cycles', Event count (approx.): 85267650382
  Children      Self  Command  Shared Object     Symbol
-   87.53%     0.47%  haproxy  haproxy           [.] run_poll_loop
   - 87.06% run_poll_loop
      - 63.03% process_runnable_tasks
         - 62.45% run_tasks_from_lists <================================================.
            - 28.32% sc_conn_io_cb                                                      |
               - 27.99% sc_conn_io_cb                                                   |
                  - 17.54% sc_conn_send                                                 |
                     - 12.66% h1_subscribe                                              |
                        - 11.42% conn_subscribe                                         |
                           - 9.63% _tasklet_wakeup_on (inlined)                         |
                                6.23% __tasklet_wakeup_on                               |
                             0.65% sock_check_events                                    |
                       2.26% h1_snd_buf                                                 |
                  - 8.51% sc_conn_process                                               |
                                                                                        |
                                                                                        |
       |     activity[tid].ctxsw++;                                                     |
  0.76 |130:   mov        tid@@Base+0x68a9c0,%rsi                                       |
       |     budgets[queue]--;                                                          |
  0.00 |       sub        $0x1,%edx                                                     |
       |     process = t->process;                                                      |
  0.01 |       mov        -0x10(%r11),%r9                                               |
       |     t = (struct task *)LIST_ELEM(tl_queues[queue].n, struct tasklet *, list);  |
  0.75 |       lea        -0x18(%r11),%rbp                                              |
       |     budgets[queue]--;                                                          |
  0.81 |       mov        %edx,(%rcx)                                                   |
       |     ctx = t->context;                                                          |
  0.02 |       mov        -0x8(%r11),%rcx                                               |
       |     activity[tid].ctxsw++;                                                     |
  0.06 |       mov        %fs:(%rsi),%edx                                               |
  0.80 |       shl        $0x7,%rdx                                                     |
  1.68 |       addl       $0x1,0x7315e4(%rdx)                                           |
       |     t->calls++;                                                                |
  0.73 |       addl       $0x1,-0x14(%r11)                                              |
       |     th_ctx->flags &= ~TH_FL_STUCK; // this thread is still running             |
  0.88 |       andl       $0xfffffffe,0x34(%rax)                                        |
       |     th_ctx->current = t;                                                       |
  0.00 |       mov        %rbp,0x20(%rax)                                               |
       |     _HA_ATOMIC_DEC(&th_ctx->rq_total);                                         |
 36.32 |       lock       subl       $0x1,0xd0(%rax)  <=================================+
       |     LIST_DEL_INIT(&((struct tasklet *)t)->list);                               |
  0.07 |       mov        (%r11),%rdx                                                   |
  0.01 |       mov        0x8(%r11),%rax                                                |
       |     if (t->state & TASK_F_TASKLET) {                                           |
  1.67 |       testb      $0x80,-0x17(%r11)                                             |
       |     LIST_DEL_INIT(&((struct tasklet *)t)->list);                               |
       |       mov        %rax,0x8(%rdx)                                                |
  0.00 |       mov        %rdx,(%rax)                                                   |
       |       movq       %r11,%xmm0                                                    |
  1.67 |       punpcklqdq %xmm0,%xmm0                                                   |
       |       movups     %xmm0,(%r11)                                                  |
       |     if (t->state & TASK_F_TASKLET) {                                           |
       |       je         230                                                           |
       |     if (unlikely(task_profiling_mask & tid_bit)) {                             |
  0.00 |       mov        tid_bit@@Base+0x68ce00,%rdi                                   |
  1.75 |       mov        task_profiling_mask,%rax                                      |
       |       and        %fs:(%rdi),%rax                                               |
       |       mov        %rax,0x10(%rsp)                                               |
       |       jne        348                                                           |
       |     state = _HA_ATOMIC_FETCH_AND(&t->state, TASK_PERSISTENT);                  |
  1.79 |1ae:   mov        -0x18(%r11),%eax                                              |
  0.00 |1b2:   mov        %eax,%esi                                                     |
       |       mov        %eax,%edx                                                     |
       |       and        $0x180b8,%esi                                                 |
 35.36 |       lock       cmpxchg    %esi,0x0(%rbp)  <=================================='
  0.08 |       jne        1b2
       |     done++;
  1.47 |       add        $0x1,%r12d
       |     if (likely(!(state & TASK_KILLED))) {
  0.00 |       test       $0x20,%dl
       |       jne        310
       |     process(t, ctx, state);
  0.01 |       mov        %rbp,%rdi
  0.01 |       mov        %rcx,%rsi
  1.59 |       call       *%r9
       |     if (unlikely(task_profiling_mask & tid_bit)) {
  0.08 |       mov        tid_bit@@Base+0x68ce00,%rdi
  0.62 |       mov        task_profiling_mask,%rax
  0.02 |       mov        th_ctx@@Base+0x68ce08,%r10
  1.26 |       and        %fs:(%rdi),%rax
       |       jne        4e0
       |     th_ctx->current = NULL;
  0.07 |1f8:   mov        %fs:(%r10),%rax
  1.27 |1fc:   movq       $0x0,0x20(%rax)

wtarreau commented 2 years ago

Thanks for the details. It looks like there's something causing a wakeup in loops. This could be a mishandled event that we disable then re-enable or something like this. It looks related to h1_wake_stream_for_send() but that could also be just an artefact. However this one does have the tasklet_wakeup() call so there really seems to be something there.

Do you remember which last version didn't exhibit this bad behavior ? Was this 2.6.something or 2.4 ? Maybe one new error check somewhere cannot escape from a bad situation.

I might ask you to try to produce traces, but this can be heavy (especially in such a loop) so we may have to check the best option. One such option could be that I backport the file-backed rings to 2.6 because that significantly eases it.

felipewd commented 2 years ago

Hi @wtarreau thanks for the reply.

This is a new deploy, and because we needed 2.6-only features this is a fresh setup...so nothing to really compare it to.

Right we're around ~10Gbps traffic with all 35 threads with 100%. (We're suppose to increase to ~55-60Gbps on this machine, but we're hanging on until this is fixed).

We're happy to collect any metrics you think might help.

wtarreau commented 2 years ago

Wow, 100% on 35 threads for "only" 10G is a lot :-(

Then we could start with something non-intrusive. On the CLI, please send: set profiling tasks on (it could slightly increase CPU usage by ~1% but in your case that will not be noticeable). Please wait 10 seconds or so and issue show profiling tasks. Then you can send set profiling tasks off.

This will indicate various function names that are task handlers and that are woken up. The counts will hopefully help us. I'm having some improvements in development (not finished) that I hope I'll get soon that will indicate who wakes up each function. But in your case it seems to be h1_wake_some_streams_for_send() or something like this.

Also, thinking about something, as you're saying it's a new deployment, could you please check with top/htop/taskset/whatever that haproxy is properly running on a single physical CPU socket so that we can be certain there are no atomic ops shared over the QPI bus ? I'm seeing 35 threads but if they're neither bound by taskset nor by a cpu-map directive, this could be devastating. Note that with such a setup you could be interested in giving 2.7-dev a try from time to time, but then I'll guide you through this.

felipewd commented 2 years ago

Right, pretty simple, here you go:

Tasks activity:
  function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
  sc_conn_io_cb              90154616   1.698m    1.130us      -         -   
  h1_io_cb                   89799037   2.554m    1.706us      -         -   
  process_stream                 8558   93.00ms   10.87us   1.702s    198.8us
  h1_timeout_task                6337   8.274us   1.000ns   374.4ms   59.08us
  accept_queue_process           5822   39.18ms   6.729us      -         -   
  task_run_applet                  37   718.3us   19.41us   967.1us   26.14us
  main+0xfbf00                     18   23.19us   1.288us   1.759ms   97.70us
  srv_cleanup_idle_conns            4   10.81us   2.701us   313.2us   78.30us
  srv_cleanup_toremove_conns        1   37.51us   37.51us   23.99us   23.99us

This was with set profiling tasks on ; sleep 10; show profiling tasks :-)

felipewd commented 2 years ago

I did not pin specifically to a single socket, but can do it, sure. Since this behavior started happening with around ~800Mbps I didn't this as a possibility. I'll do this now.

felipewd commented 2 years ago

Ok, so now I've binded all threads to the same physical CPU using taskset, but no change in the behavior.

I repeated the profiling from before, now it's like this

Tasks activity:
  function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
  sc_conn_io_cb             265481197   1.279m    288.0ns      -         -   
  h1_io_cb                  263855587   1.225m    278.0ns      -         -   
  process_stream                28627   259.4ms   9.060us   129.1ms   4.507us
  h1_timeout_task               21405   13.88us      -      221.6ms   10.35us
  accept_queue_process          19497   113.3ms   5.809us      -         -   
  task_run_applet                 155   1.787ms   11.53us   777.1us   5.013us
  ssl_sock_io_cb                  102   13.37ms   131.0us      -         -   
  main+0xfbf00                     53   40.59us   765.0ns   534.6us   10.09us
  srv_cleanup_idle_conns           10   5.810us   581.0ns   95.13us   9.512us

This is the current htop scenario:

felipewd commented 2 years ago

@wtarreau Also, not sure if it's relevant/helps, but on a similar machine (better CPU) we get way less %usr and way more %sys on ~85Gbps of real traffic.

But all our other machines are on 2.4 branch. This new one is our 1st real-traffic-attempt to dip our toes on 2.6 :-)

The only news on the 2.6 conf is that we've transferred JWT validation from our applications to haproxy, the rest is pretty much the same. The traffic pattern is also the same.

wtarreau commented 2 years ago

Thanks. So that's 9 million calls per second to the I/O handler, that sounds quite a lot! At 10 Gbps, that's roughly one call per 100 bytes, that cannot match realistic traffic, so something must be causing spurious wakeups.

I've extracted some stats from your fd.log that might help us, I'm pasting them here so that it's easier to get back to them:

$ < /dev/shm/fd.log sed -ne 's,^.*\(back=[^ ]*\).*\(cflg=[^ ]*\).*\(req.state=[^ ]*\).*\(res.state=[^ ]*\).*\(sd.flg=[^ ]*\).*,\1 \2 \3 \4 \5,p'  | sort |uniq -c|sort -nr | column -t
1167  back=1  cflg=0x10000300  req.state=MSG_DONE  res.state=MSG_DATA      sd.flg=0x009c0001
1074  back=0  cflg=0x80000300  req.state=MSG_DONE  res.state=MSG_DATA      sd.flg=0x02004001
93    back=0  cflg=0x80000300  req.state=MSG_DONE  res.state=MSG_DATA      sd.flg=0x02005001
14    back=1  cflg=0x10080300  req.state=MSG_DONE  res.state=MSG_DATA      sd.flg=0x009c0a01
9     back=0  cflg=0x801c0300  req.state=MSG_DONE  res.state=MSG_DATA      sd.flg=0x0200c001
6     back=0  cflg=0x80040300  req.state=MSG_DONE  res.state=MSG_DATA      sd.flg=0x0200c001
2     back=1  cflg=0x10000300  req.state=MSG_DONE  res.state=MSG_RPBEFORE  sd.flg=0x02800001
2     back=1  cflg=0x10000300  req.state=MSG_DONE  res.state=MSG_DATA      sd.flg=0x02840001
2     back=0  cflg=0x80000300  req.state=MSG_DONE  res.state=MSG_RPBEFORE  sd.flg=0x02804001
2     back=0  cflg=0x80000300  req.state=MSG_DONE  res.state=MSG_DATA      sd.flg=0x02804001
1     back=1  cflg=0x10080300  req.state=MSG_DONE  res.state=MSG_DATA      sd.flg=0x02840a01

A quick look shows that the 14 backend conns have sent the write shutdown, which is unusual enough to raise some awareness here, and the 9 frontend ones are shut in both directions, and again that's suspicious enough to see so many in ratio. I suspect that the wakeups come from read shutdowns that are reported by the poller and ignored for whatever reason by the mux because something would prevent it from handling them. That's very strange as we haven't had such an issue in a long time now. CCing @capflam so that he can also follow and chime in in case he has any idea.

I'm pretty sure your CPU binding is unrelated for now. By default, it's supposed to automatically bind to a single NUMA node but I asked as I wanted to be certain you didn't have something odd.

wtarreau commented 2 years ago

Thanks for the info. JWT should be totally unrelated. Here I'm really suspecting a loginc error dealing with a shutdown report.

By the way that would explain the stream wakeups. Maybe the mux receives the shutdown, doesn't disable reading but wakes the stream up. Past a certain load the short time during which the even may trigger again starts to count and could matter. That might explain that it could have remained undetected.

BTW, out of curiosity and totally unrelated, what NIC are you using for your 85G ? An mlx connectx-4 or 5 I guess, or may be an intel ICE ?

felipewd commented 2 years ago

The 2.6 machine is with a Mellanox connectx-4, our heavier ones (2.4) are with connectx-5..

wtarreau commented 2 years ago

A closer look by focusing on connections whose flags are 0x801c0300 shows several following exactly this pattern:

   2449 : st=0x001d22(cl HEOpI W:sRa R:sRa) tmask=0x20000000 umask=0x0 owner=0x7f7394446640 iocb=0x5adbe0(sock_conn_iocb) back=0 cflg=0x801c0300 fam=ipv4 lport=80 fe=ssl mux=H1 ctx=0x7f739470c050 h1c.flg=0x102202 .sub=0 .ibuf=2@0x7f73c86df380+593/65536 .obuf=0@(nil)+0/0 h1s=0x7f739470d5d0 h1s.flg=0x104048 .sd.flg=0x200c001 .req.state=MSG_DONE .res.state=MSG_DATA .meth=GET status=206 .sd.flg=0x0200c001 .sc.flg=0x00000000 .sc.app=0x7f73c825077e .subs=0x7f73c861a350(ev=2 tl=0x7f73c824d1e0 tl.calls=178435763 tl.ctx=0x7f73c861a340 tl.fct=sc_conn_io_cb) xprt=RAW !

I.e.:

h1c flags indicates "output full" while the output buffer is empty and released
h1s flags 0x104048 indicating the stream doesn't want to process more input data and is waiting for a sync with the output, and that the client is in close mode
a GET most often with a 206 status
a few (often 2) bytes in the input buffer (likely an extraneous CRLF)

The most suspicious ones are easy to spot using grep 'tl.calls=[0-9]\{8\}' fd.log.

OK I managed to reproduce it by hand. It's a bit complicated to describe as it's very timing sensitive, I'll check this tomorrow with Christopher.

wtarreau commented 2 years ago

I'm currently bisecting it, because it doesn't affect 2.4.0 but does affect latest 2.4, so please do not rush on updating your 2.4 yet as you could face the problem as well.

wtarreau commented 2 years ago

So that's caused by this one in 2.4.16, which originated in 2.6 as ba25116c1:

commit b3666ef018a94a5cbcc3a143702a7b3d0b108f4c
Author: Willy Tarreau <w@1wt.eu>
Date:   Thu Mar 17 16:19:09 2022 +0100

    BUG/MEDIUM: stream-int: do not rely on the connection error once established

We're trading a bug for another one :-(

It seems to indicate that there's one remaining case in the H1 mux where a connection error is not reported as its equivalent stream error. We'll check that tomorrow.

wtarreau commented 2 years ago

@felipewd I could find and test a solution to this bug, but I'm not merging it yet because it gave me the opportunity to analyze something possibly deeper, that's just suboptimal in error processing but that I want to sort out to understand the whole picture. In the mean time, I'm attaching the patch here that works on 2.7 and that should trivially apply to 2.6 as well, so feel free to apply it to your 2.6 branch.

0001-BUG-MEDIUM-mux-h1-do-not-refrain-from-signaling-erro.patch.txt

felipewd commented 2 years ago

Hi @wtarreau great, thanks!

We've deployed on the machine, let's observe. But a quick profiling after show it's really promising:

Tasks activity:
  function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
  sc_conn_io_cb               3396744   11.06s    3.254us      -         -   
  h1_io_cb                     458179   345.4ms   753.0ns      -         -   
  process_stream                15897   361.4ms   22.73us   16.87ms   1.061us
  h1_timeout_task                7177      -         -      113.4ms   15.80us
  accept_queue_process           4593   29.00ms   6.313us      -         -   
  task_run_applet                  95   1.437ms   15.12us   514.4us   5.414us
  ssl_sock_io_cb                   52   6.730ms   129.4us      -         -   
  main+0xfbef0                     31   25.00us   806.0ns   97.66us   3.150us
  srv_cleanup_idle_conns            6   7.271us   1.211us   33.27us   5.544us

I'll report back after a few hours.

wtarreau commented 2 years ago

Ah much better indeed, thank you for the quick feedback! It's safe to keep this patch for now, I'm just not certain whether it will still be needed or not once I get down to related issues I found at lower layers, which is why I prefer not to merge it yet.

wtarreau commented 2 years ago

So finally I pushed it. The other one I thought it depended on was of much less importance and not directly related. You can definitely keep running on that one until 2.6.5.

felipewd commented 2 years ago

@wtarreau so, after 3 hours, the differences are staggering: The machine can handle a lot more traffic than before with way less pressure. The %usr was around 30,000% to push 10Gbps of data to around 405% to handle 17Gbps currently. This is the profiling for the higher traffic:

Tasks activity:
  function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
  sc_conn_io_cb               2942999   10.30s    3.499us      -         -   
  h1_io_cb                     397679   303.4ms   762.0ns      -         -   
  process_stream                11834   267.6ms   22.61us   14.53ms   1.228us
  h1_timeout_task                5211      -         -      86.26ms   16.55us
  accept_queue_process           3383   20.83ms   6.157us      -         -   
  task_run_applet                  44   612.4us   13.92us   196.7us   4.471us
  ssl_sock_io_cb                   22   3.457ms   157.2us      -         -   
  main+0xfbef0                     14   14.08us   1.005us   104.0us   7.431us
  srv_cleanup_idle_conns            3   2.132us   710.0ns   3.553us   1.184us

So as far as our side goes, this is a complete success. Thanks a lot! :-)

rnsanchez commented 2 years ago

Here's how (drastic) it looks; it should be pretty easy to spot when the patched version was rolled out:

wtarreau commented 2 years ago

Thanks a lot guys for the great feedback. Indeed the difference on the graph is impressive. The peak in the middle was due to my bug :-) I've merged the fix in mainline. Just reopening the issue so that we can track the fact that the fix needs to be backported as far as 2.5 and will close once done.

wtarreau commented 2 years ago

Ah Lukas just beat me to retagging. Thanks Lukas!

wtarreau commented 2 years ago

By the way I'll have to retest more carefully with 2.4 because I'm pretty sure I originally managed to reproduce it on 2.4 and even mentioned it above but yesterday I failed at it. So maybe we'll have to backport all this as far as 2.0 in the end. Let's not close this once it lands in 2.5.

felipewd commented 2 years ago

Hi @wtarreau for what it's worth, the graph posted by @rnsanchez is on 2.4.

We had downgraded from 2.6 to 2.4 using our usual production conf in hope this was contained on 2.6....but as you can see, 2.6 with the fix is way better than 2.4-latest. It's clearly way more difficult to cause this bug on 2.4, but the %usr usage seems to indicate it's possible.

wtarreau commented 2 years ago

Ah so you mean that the period on the left was 2.4, then the middle was 2.6 and finally on the right was 2.6+fix ? Or even maybe the opposite for the first two, i.e. 2.6, 2.4 then 2.6+fix ?

felipewd commented 2 years ago

The last one. The weekend we switched back to 2.4 (disabling JWT validation altogether) until we got the patch. So it was 2.6, then 2.4 on Friday evening until 2.6+fix finally yesterday.

So the entire graph (pre-patch) is 2.4

wtarreau commented 2 years ago

OK so it confirms my previous observation that 2.4-latest had the issue as well. I don't know why I didn't reproduce it lately, I need to recheck more carefully.

OK now understood. The binary I was running for 2.4 during last test was the one resulting from my bisect session, just before the faulty backport. I can now confirm that 2.4-latest is bogus, which is conform to your observation. 2.3 and older do not have the problem. I'm tagging for 2.4 as well.

felipewd commented 2 years ago

@wtarreau Great! Since we have a fairly high-traffic setup here, if you want us to give 2.4 a spin on a proposed patch, we're happy to test it.

wtarreau commented 2 years ago

Yeah good point, I'll try to provide you with a backport shortly so that you can clean up your production (I'm on several issues in parallel at the moment). But I'm confident that my test was sufficient to confirm it worked there anyway since it worked on 2.6 already.

capflam commented 2 years ago

FYI, the fix was backported to 2.4.

haproxy / haproxy