Closed felipewd closed 2 years ago
So, now we have a little more pressure:
14.44% haproxy [.] run_tasks_from_lists
9.83% haproxy [.] __tasklet_wakeup_on
4.92% haproxy [.] sc_notify
4.91% haproxy [.] _do_poll
3.93% haproxy [.] conn_subscribe
3.73% haproxy [.] wake_expired_tasks
2.83% haproxy [.] h1_wake_stream_for_send.part.0
2.65% haproxy [.] sc_conn_send
2.32% haproxy [.] clock_update_date
1.87% haproxy [.] sc_conn_process
1.87% haproxy [.] h1_io_cb
1.80% haproxy [.] h1_process
1.78% haproxy [.] h1_snd_buf
1.66% [kernel] [k] copy_user_enhanced_fast_string
1.57% haproxy [.] sc_conn_io_cb
1.29% haproxy [.] h1_refresh_timeout
1.26% haproxy [.] h1_subscribe
1.23% haproxy [.] h1_send
1.05% haproxy [.] sc_conn_recv
0.85% [kernel] [k] check_preemption_disabled
0.84% [kernel] [k] syscall_return_via_sysret
0.84% [vdso] [.] __vdso_clock_gettime
0.80% haproxy [.] stream_release_buffers
0.75% haproxy [.] sock_check_events
0.69% [kernel] [k] _raw_spin_lock_irqsave
0.66% haproxy [.] eb32_lookup_ge
0.66% [kernel] [k] _raw_spin_lock
0.61% [vdso] [.] 0x0000000000000695
0.54% [kernel] [k] __fget_light
0.53% [kernel] [k] tasklet_action_common.constprop.0
with a pretty htop
picture to show the trouble :-)
the new show activity
:
thread_id: 15 (1..35)
date_now: 1661466496.509509
ctxsw: 1372606352 [ 2751990889 4052939029 1702347210 3001496378 2846162639 2209864929 3640547537 2435428430 3014082503 3206866113 3087953261 2816838642 2506207637 2421360451 2764658636 1212046041 3086260444 2743796466 1215782719 2809535673 2568010263 3874744991 2141278815 3143812904 1950730873 2085440325 3316869333 1998315654 2640787318 2550662231 3555974515 4028018279 3335187923 1929122843 3216764970 ]
tasksw: 273160 [ 6737 9324 8626 8001 7358 5918 7827 6928 8471 6426 7318 9301 10579 6687 8303 9563 8359 6169 6343 8625 7205 8544 8414 8027 6731 7480 9752 7494 6727 7644 8146 7751 8557 6242 7583 ]
empty_rq: 797207 [ 18780 17141 78206 14702 20003 18246 20337 15846 16812 34001 13511 34372 18908 16540 15641 13332 14553 13238 13237 72837 15764 55942 20115 14062 13735 18588 17023 17412 50423 16736 14659 12791 17375 14769 17570 ]
long_rq: 2088406030 [ 55022487 81041033 34026069 60009097 56901709 44179823 72792750 48690003 60265178 64122176 61741539 56318214 50104004 48408759 55276882 110124431 61709408 54860430 110203726 56168929 51339409 77480686 42804293 62859978 38997682 41688548 66317272 39946804 52796660 50995232 71099449 80544922 66686593 38563298 64318557 ]
loops: 2100263178 [ 55346638 81371678 34464563 60389153 57295848 44502777 73132515 49029408 60571971 64419788 62063392 56676328 50469490 48751487 55580501 110421742 62003259 55148278 110420869 56621388 51723021 77787285 43193752 63160689 39306974 42059047 66692510 40303808 53181633 51324293 71466523 80833854 67006215 38911871 64630630 ]
wake_tasks: 2088406216 [ 55022493 81041038 34026075 60009103 56901717 44179828 72792755 48690008 60265185 64122181 61741544 56318218 50104010 48408763 55276886 110124435 61709413 54860435 110203732 56168934 51339414 77480692 42804298 62859983 38997687 41688552 66317276 39946808 52796666 50995238 71099457 80544926 66686599 38563304 64318563 ]
wake_signal: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_io: 15848925 [ 422789 468268 524592 491679 479408 404930 459685 432436 437285 430888 422784 444899 514116 410054 389513 568838 387446 361619 383142 580167 523031 466742 451104 417192 465245 461995 462440 437869 529134 418724 485545 437786 408989 406508 462083 ]
poll_exp: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_drop_fd: 59319 [ 1395 2079 1816 1680 1499 1229 1716 1520 1813 1395 1718 1971 2425 1427 1777 2229 1781 1295 1245 1942 1521 1936 1825 1757 1474 1656 2043 1640 1396 1696 1811 1681 1932 1318 1681 ]
poll_skip_fd: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
conn_dead: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
stream_calls: 159429 [ 3964 5396 5029 4696 4322 3498 4529 4083 4953 3722 4257 5404 6146 3914 4876 5489 4896 3633 3640 5006 4239 4970 4915 4678 3960 4371 5678 4399 3977 4471 4759 4516 4968 3679 4396 ]
pool_fail: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
buf_wait: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_tot: 631 [ 23 3 22 18 39 37 25 24 3 5 17 22 34 9 9 39 20 16 6 5 5 21 5 32 3 27 24 7 3 20 2 23 38 23 22 ]
cpust_ms_1s: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_15s: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
avg_loop_us: 16 [ 22 23 24 24 6 22 22 22 6 21 23 22 5 22 22 5 24 5 5 22 24 6 23 6 6 22 23 21 5 6 6 22 6 22 5 ]
accepted: 82841 [ 4 664 41417 6 4 5 3 12 3 2632 5 885 120 4 4 7 2 6 49 20324 89 10372 27 214 16 4 7 475 5446 9 9 5 3 3 6 ]
accq_pushed: 82841 [ 2000 2863 2620 2411 2183 1753 2391 2067 2567 1925 2298 2831 3284 2033 2509 2940 2511 1861 1802 2670 2153 2621 2586 2451 2030 2295 2950 2286 1991 2324 2479 2336 2644 1876 2300 ]
accq_full: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
accq_ring: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
fd_takeover: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
I'll post fd.log here as an attachment: fd.log
I also did a show threads
to try and help here:
Thread 1 : id=0x7f743382cfc0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/1 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=539791111616 now=539791128097 diff=16481
curr_task=0
Thread 2 : id=0x7f74338096c0 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/2 stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=682584850854 now=682584865013 diff=14159
curr_task=0x7f742c5b7b30 (tasklet) calls=15789944
fct=0x49fc30(h1_io_cb) ctx=0x7f742c6abd30
Thread 3 : id=0x7f74315346c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
1/3 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=526722981077 now=526723071838 diff=90761
curr_task=0
* Thread 4 : id=0x7f741ad4b6c0 act=1 glob=0 wq=1 rq=1 tl=1 tlsz=0 rqsz=2
1/4 stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=559794720313 now=559794871342 diff=151029
curr_task=0x7f7404368270 (task) calls=2 last=0
fct=0x5bfcc0(task_run_applet) ctx=0x7f73fc0a4160(<CLI>)
strm=0x7f73fc3d3a80,8 src=unix fe=GLOBAL be=GLOBAL dst=<CLI>
txn=(nil),0 txn.req=-,0 txn.rsp=-,0
rqf=c0c023 rqa=0 rpf=80008000 rpa=0
scf=0x7f73fc2379d0,EST,0 scb=0x7f73fc2b9510,EST,1
af=(nil),0 sab=0x7f73fc0a4160,9
cof=0x7f7404926c60,40300:PASS(0x7f73fc31c800)/RAW((nil))/unix_stream(541)
cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1)
Thread 5 : id=0x7f741a54a6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/5 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=493712370361 now=493712525740 diff=155379
curr_task=0
Thread 6 : id=0x7f7419d496c0 act=1 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/6 stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=485399557521 now=485399707714 diff=150193
curr_task=0x7f74001f1610 (tasklet) calls=32978096
fct=0x562140(sc_conn_io_cb) ctx=0x7f74001ef790
strm=0x7f74005ab990,43c0a src=170.80.58.66 fe=ssl be=hyper_streaming dst=v01
txn=0x7f74001e8880,3000 txn.req=MSG_DONE,4c txn.rsp=MSG_DATA,d
rqf=4804e060 rqa=0 rpf=8004c060 rpa=0
scf=0x7f74001ef790,EST,0 scb=0x7f7400134f00,CLO,219
af=(nil),0 sab=(nil),0
cof=0x7f73b0787ed0,801c0300:H1(0x7f73b023e8e0)/RAW((nil))/tcpv4(2585)
cob=0x7f74044c9110,10080300:H1(0x7f73b00ee7b0)/RAW((nil))/unix_stream(2821)
Thread 7 : id=0x7f74195486c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=2
1/7 stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=559474594077 now=559474756463 diff=162386
curr_task=0
Thread 8 : id=0x7f7418d476c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
1/8 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=541279191436 now=541279356512 diff=165076
curr_task=0
Thread 9 : id=0x7f74185466c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/9 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=638118984122 now=638119157680 diff=173558
curr_task=0
Thread 10: id=0x7f7417d456c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/10 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=495671580046 now=495671757255 diff=177209
curr_task=0
Thread 11: id=0x7f74175446c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=3
1/11 stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=726126873197 now=726127050381 diff=177184
curr_task=0
Thread 12: id=0x7f7416d436c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/12 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=540441479324 now=540441682578 diff=203254
curr_task=0
Thread 13: id=0x7f74165426c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=4
1/13 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=602301788886 now=602301971826 diff=182940
curr_task=0
Thread 14: id=0x7f7415d416c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/14 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=372565060411 now=372565256775 diff=196364
curr_task=0
Thread 15: id=0x7f74155406c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/15 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=431565543435 now=431565724915 diff=181480
curr_task=0
Thread 16: id=0x7f7414d3f6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/16 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=832710319322 now=832710645390 diff=326068
curr_task=0
Thread 17: id=0x7f741453e6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/17 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=540287123702 now=540287358250 diff=234548
curr_task=0
Thread 18: id=0x7f7413d3d6c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
1/18 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=736694801720 now=736695008381 diff=206661
curr_task=0
Thread 19: id=0x7f741353c6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/19 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=1000267779058 now=1000268006877 diff=227819
curr_task=0
Thread 20: id=0x7f7412d3b6c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
1/20 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=554938771371 now=554938983617 diff=212246
curr_task=0
Thread 21: id=0x7f741253a6c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
1/21 stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=597860267189 now=597860478353 diff=211164
curr_task=0x7f73f04e8a50 (tasklet) calls=25659714
fct=0x562140(sc_conn_io_cb) ctx=0x7f73f0145ef0
strm=0x7f73f01b2b50,43c0a src=170.80.58.66 fe=ssl be=hyper_streaming dst=v01
txn=0x7f73f04d06a0,3000 txn.req=MSG_DONE,4c txn.rsp=MSG_DATA,d
rqf=4804e060 rqa=0 rpf=8004c060 rpa=0
scf=0x7f73f0145ef0,EST,0 scb=0x7f73f04e45a0,CLO,219
af=(nil),0 sab=(nil),0
cof=0x7f73b04ee570,801c0300:H1(0x7f7404114f80)/RAW((nil))/tcpv4(3134)
cob=0x7f73f04e8790,10080300:H1(0x7f74045eb220)/RAW((nil))/unix_stream(3286)
Thread 22: id=0x7f7411d396c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/22 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=812174704897 now=812174942077 diff=237180
curr_task=0
Thread 23: id=0x7f74115386c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/23 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=347139385924 now=347139623569 diff=237645
curr_task=0
Thread 24: id=0x7f7410d376c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
1/24 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=713553442243 now=713553677946 diff=235703
curr_task=0
Thread 25: id=0x7f74105366c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=2
1/25 stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=527119604979 now=527119844516 diff=239537
curr_task=0x7f73c4185a20 (tasklet) calls=3986372
fct=0x49fc30(h1_io_cb) ctx=0x7f73c418c5b0
Thread 26: id=0x7f740fd356c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
1/26 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=565362961004 now=565363251370 diff=290366
curr_task=0
Thread 27: id=0x7f740f5346c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/27 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=471811449531 now=471811741222 diff=291691
curr_task=0
Thread 28: id=0x7f740ed336c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/28 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=422315547211 now=422315865362 diff=318151
curr_task=0
Thread 29: id=0x7f740e5326c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/29 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=513688550811 now=513688860417 diff=309606
curr_task=0
Thread 30: id=0x7f740dd316c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/30 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=485147869408 now=485148188583 diff=319175
curr_task=0
Thread 31: id=0x7f740d5306c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/31 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=580571993751 now=580572318316 diff=324565
curr_task=0
Thread 32: id=0x7f740cd2f6c0 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=0 rqsz=1
1/32 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=651206521989 now=651206836749 diff=314760
curr_task=0
Thread 33: id=0x7f740c52e6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/33 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=509814804991 now=509815117988 diff=312997
curr_task=0
Thread 34: id=0x7f740bd2d6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/34 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=450111160223 now=450111503420 diff=343197
curr_task=0
Thread 35: id=0x7f740b52c6c0 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
1/35 stuck=0 prof=0 harmless=1 wantrdv=0
cpu_ns: poll=461606548056 now=461606890596 diff=342540
curr_task=0
Here is a flamegraph for one of the hot threads:
There seems to be a substantial contention:
Samples: 120K of event 'cycles', Event count (approx.): 85267650382
Children Self Command Shared Object Symbol
- 87.53% 0.47% haproxy haproxy [.] run_poll_loop
- 87.06% run_poll_loop
- 63.03% process_runnable_tasks
- 62.45% run_tasks_from_lists <================================================.
- 28.32% sc_conn_io_cb |
- 27.99% sc_conn_io_cb |
- 17.54% sc_conn_send |
- 12.66% h1_subscribe |
- 11.42% conn_subscribe |
- 9.63% _tasklet_wakeup_on (inlined) |
6.23% __tasklet_wakeup_on |
0.65% sock_check_events |
2.26% h1_snd_buf |
- 8.51% sc_conn_process |
|
|
| activity[tid].ctxsw++; |
0.76 |130: mov tid@@Base+0x68a9c0,%rsi |
| budgets[queue]--; |
0.00 | sub $0x1,%edx |
| process = t->process; |
0.01 | mov -0x10(%r11),%r9 |
| t = (struct task *)LIST_ELEM(tl_queues[queue].n, struct tasklet *, list); |
0.75 | lea -0x18(%r11),%rbp |
| budgets[queue]--; |
0.81 | mov %edx,(%rcx) |
| ctx = t->context; |
0.02 | mov -0x8(%r11),%rcx |
| activity[tid].ctxsw++; |
0.06 | mov %fs:(%rsi),%edx |
0.80 | shl $0x7,%rdx |
1.68 | addl $0x1,0x7315e4(%rdx) |
| t->calls++; |
0.73 | addl $0x1,-0x14(%r11) |
| th_ctx->flags &= ~TH_FL_STUCK; // this thread is still running |
0.88 | andl $0xfffffffe,0x34(%rax) |
| th_ctx->current = t; |
0.00 | mov %rbp,0x20(%rax) |
| _HA_ATOMIC_DEC(&th_ctx->rq_total); |
36.32 | lock subl $0x1,0xd0(%rax) <=================================+
| LIST_DEL_INIT(&((struct tasklet *)t)->list); |
0.07 | mov (%r11),%rdx |
0.01 | mov 0x8(%r11),%rax |
| if (t->state & TASK_F_TASKLET) { |
1.67 | testb $0x80,-0x17(%r11) |
| LIST_DEL_INIT(&((struct tasklet *)t)->list); |
| mov %rax,0x8(%rdx) |
0.00 | mov %rdx,(%rax) |
| movq %r11,%xmm0 |
1.67 | punpcklqdq %xmm0,%xmm0 |
| movups %xmm0,(%r11) |
| if (t->state & TASK_F_TASKLET) { |
| je 230 |
| if (unlikely(task_profiling_mask & tid_bit)) { |
0.00 | mov tid_bit@@Base+0x68ce00,%rdi |
1.75 | mov task_profiling_mask,%rax |
| and %fs:(%rdi),%rax |
| mov %rax,0x10(%rsp) |
| jne 348 |
| state = _HA_ATOMIC_FETCH_AND(&t->state, TASK_PERSISTENT); |
1.79 |1ae: mov -0x18(%r11),%eax |
0.00 |1b2: mov %eax,%esi |
| mov %eax,%edx |
| and $0x180b8,%esi |
35.36 | lock cmpxchg %esi,0x0(%rbp) <=================================='
0.08 | jne 1b2
| done++;
1.47 | add $0x1,%r12d
| if (likely(!(state & TASK_KILLED))) {
0.00 | test $0x20,%dl
| jne 310
| process(t, ctx, state);
0.01 | mov %rbp,%rdi
0.01 | mov %rcx,%rsi
1.59 | call *%r9
| if (unlikely(task_profiling_mask & tid_bit)) {
0.08 | mov tid_bit@@Base+0x68ce00,%rdi
0.62 | mov task_profiling_mask,%rax
0.02 | mov th_ctx@@Base+0x68ce08,%r10
1.26 | and %fs:(%rdi),%rax
| jne 4e0
| th_ctx->current = NULL;
0.07 |1f8: mov %fs:(%r10),%rax
1.27 |1fc: movq $0x0,0x20(%rax)
Thanks for the details. It looks like there's something causing a wakeup in loops. This could be a mishandled event that we disable then re-enable or something like this. It looks related to h1_wake_stream_for_send() but that could also be just an artefact. However this one does have the tasklet_wakeup() call so there really seems to be something there.
Do you remember which last version didn't exhibit this bad behavior ? Was this 2.6.something or 2.4 ? Maybe one new error check somewhere cannot escape from a bad situation.
I might ask you to try to produce traces, but this can be heavy (especially in such a loop) so we may have to check the best option. One such option could be that I backport the file-backed rings to 2.6 because that significantly eases it.
Hi @wtarreau thanks for the reply.
This is a new deploy, and because we needed 2.6-only features this is a fresh setup...so nothing to really compare it to.
Right we're around ~10Gbps traffic with all 35 threads with 100%. (We're suppose to increase to ~55-60Gbps on this machine, but we're hanging on until this is fixed).
We're happy to collect any metrics you think might help.
Wow, 100% on 35 threads for "only" 10G is a lot :-(
Then we could start with something non-intrusive. On the CLI, please send: set profiling tasks on
(it could slightly increase CPU usage by ~1% but in your case that will not be noticeable). Please wait 10 seconds or so and issue show profiling tasks
. Then you can send set profiling tasks off
.
This will indicate various function names that are task handlers and that are woken up. The counts will hopefully help us. I'm having some improvements in development (not finished) that I hope I'll get soon that will indicate who wakes up each function. But in your case it seems to be h1_wake_some_streams_for_send() or something like this.
Also, thinking about something, as you're saying it's a new deployment, could you please check with top/htop/taskset/whatever that haproxy is properly running on a single physical CPU socket so that we can be certain there are no atomic ops shared over the QPI bus ? I'm seeing 35 threads but if they're neither bound by taskset nor by a cpu-map directive, this could be devastating. Note that with such a setup you could be interested in giving 2.7-dev a try from time to time, but then I'll guide you through this.
Right, pretty simple, here you go:
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
sc_conn_io_cb 90154616 1.698m 1.130us - -
h1_io_cb 89799037 2.554m 1.706us - -
process_stream 8558 93.00ms 10.87us 1.702s 198.8us
h1_timeout_task 6337 8.274us 1.000ns 374.4ms 59.08us
accept_queue_process 5822 39.18ms 6.729us - -
task_run_applet 37 718.3us 19.41us 967.1us 26.14us
main+0xfbf00 18 23.19us 1.288us 1.759ms 97.70us
srv_cleanup_idle_conns 4 10.81us 2.701us 313.2us 78.30us
srv_cleanup_toremove_conns 1 37.51us 37.51us 23.99us 23.99us
This was with set profiling tasks on
; sleep 10
; show profiling tasks
:-)
I did not pin specifically to a single socket, but can do it, sure. Since this behavior started happening with around ~800Mbps I didn't this as a possibility. I'll do this now.
Ok, so now I've binded all threads to the same physical CPU using taskset
, but no change in the behavior.
I repeated the profiling from before, now it's like this
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
sc_conn_io_cb 265481197 1.279m 288.0ns - -
h1_io_cb 263855587 1.225m 278.0ns - -
process_stream 28627 259.4ms 9.060us 129.1ms 4.507us
h1_timeout_task 21405 13.88us - 221.6ms 10.35us
accept_queue_process 19497 113.3ms 5.809us - -
task_run_applet 155 1.787ms 11.53us 777.1us 5.013us
ssl_sock_io_cb 102 13.37ms 131.0us - -
main+0xfbf00 53 40.59us 765.0ns 534.6us 10.09us
srv_cleanup_idle_conns 10 5.810us 581.0ns 95.13us 9.512us
This is the current htop scenario:
@wtarreau Also, not sure if it's relevant/helps, but on a similar machine (better CPU) we get way less %usr and way more %sys on ~85Gbps of real traffic.
But all our other machines are on 2.4 branch. This new one is our 1st real-traffic-attempt to dip our toes on 2.6 :-)
The only news on the 2.6 conf is that we've transferred JWT validation from our applications to haproxy, the rest is pretty much the same. The traffic pattern is also the same.
Thanks. So that's 9 million calls per second to the I/O handler, that sounds quite a lot! At 10 Gbps, that's roughly one call per 100 bytes, that cannot match realistic traffic, so something must be causing spurious wakeups.
I've extracted some stats from your fd.log that might help us, I'm pasting them here so that it's easier to get back to them:
$ < /dev/shm/fd.log sed -ne 's,^.*\(back=[^ ]*\).*\(cflg=[^ ]*\).*\(req.state=[^ ]*\).*\(res.state=[^ ]*\).*\(sd.flg=[^ ]*\).*,\1 \2 \3 \4 \5,p' | sort |uniq -c|sort -nr | column -t
1167 back=1 cflg=0x10000300 req.state=MSG_DONE res.state=MSG_DATA sd.flg=0x009c0001
1074 back=0 cflg=0x80000300 req.state=MSG_DONE res.state=MSG_DATA sd.flg=0x02004001
93 back=0 cflg=0x80000300 req.state=MSG_DONE res.state=MSG_DATA sd.flg=0x02005001
14 back=1 cflg=0x10080300 req.state=MSG_DONE res.state=MSG_DATA sd.flg=0x009c0a01
9 back=0 cflg=0x801c0300 req.state=MSG_DONE res.state=MSG_DATA sd.flg=0x0200c001
6 back=0 cflg=0x80040300 req.state=MSG_DONE res.state=MSG_DATA sd.flg=0x0200c001
2 back=1 cflg=0x10000300 req.state=MSG_DONE res.state=MSG_RPBEFORE sd.flg=0x02800001
2 back=1 cflg=0x10000300 req.state=MSG_DONE res.state=MSG_DATA sd.flg=0x02840001
2 back=0 cflg=0x80000300 req.state=MSG_DONE res.state=MSG_RPBEFORE sd.flg=0x02804001
2 back=0 cflg=0x80000300 req.state=MSG_DONE res.state=MSG_DATA sd.flg=0x02804001
1 back=1 cflg=0x10080300 req.state=MSG_DONE res.state=MSG_DATA sd.flg=0x02840a01
A quick look shows that the 14 backend conns have sent the write shutdown, which is unusual enough to raise some awareness here, and the 9 frontend ones are shut in both directions, and again that's suspicious enough to see so many in ratio. I suspect that the wakeups come from read shutdowns that are reported by the poller and ignored for whatever reason by the mux because something would prevent it from handling them. That's very strange as we haven't had such an issue in a long time now. CCing @capflam so that he can also follow and chime in in case he has any idea.
I'm pretty sure your CPU binding is unrelated for now. By default, it's supposed to automatically bind to a single NUMA node but I asked as I wanted to be certain you didn't have something odd.
Thanks for the info. JWT should be totally unrelated. Here I'm really suspecting a loginc error dealing with a shutdown report.
By the way that would explain the stream wakeups. Maybe the mux receives the shutdown, doesn't disable reading but wakes the stream up. Past a certain load the short time during which the even may trigger again starts to count and could matter. That might explain that it could have remained undetected.
BTW, out of curiosity and totally unrelated, what NIC are you using for your 85G ? An mlx connectx-4 or 5 I guess, or may be an intel ICE ?
The 2.6 machine is with a Mellanox connectx-4, our heavier ones (2.4) are with connectx-5..
A closer look by focusing on connections whose flags are 0x801c0300 shows several following exactly this pattern:
2449 : st=0x001d22(cl HEOpI W:sRa R:sRa) tmask=0x20000000 umask=0x0 owner=0x7f7394446640 iocb=0x5adbe0(sock_conn_iocb) back=0 cflg=0x801c0300 fam=ipv4 lport=80 fe=ssl mux=H1 ctx=0x7f739470c050 h1c.flg=0x102202 .sub=0 .ibuf=2@0x7f73c86df380+593/65536 .obuf=0@(nil)+0/0 h1s=0x7f739470d5d0 h1s.flg=0x104048 .sd.flg=0x200c001 .req.state=MSG_DONE .res.state=MSG_DATA .meth=GET status=206 .sd.flg=0x0200c001 .sc.flg=0x00000000 .sc.app=0x7f73c825077e .subs=0x7f73c861a350(ev=2 tl=0x7f73c824d1e0 tl.calls=178435763 tl.ctx=0x7f73c861a340 tl.fct=sc_conn_io_cb) xprt=RAW !
I.e.:
The most suspicious ones are easy to spot using grep 'tl.calls=[0-9]\{8\}' fd.log
.
OK I managed to reproduce it by hand. It's a bit complicated to describe as it's very timing sensitive, I'll check this tomorrow with Christopher.
I'm currently bisecting it, because it doesn't affect 2.4.0 but does affect latest 2.4, so please do not rush on updating your 2.4 yet as you could face the problem as well.
So that's caused by this one in 2.4.16, which originated in 2.6 as ba25116c1:
commit b3666ef018a94a5cbcc3a143702a7b3d0b108f4c
Author: Willy Tarreau <w@1wt.eu>
Date: Thu Mar 17 16:19:09 2022 +0100
BUG/MEDIUM: stream-int: do not rely on the connection error once established
We're trading a bug for another one :-(
It seems to indicate that there's one remaining case in the H1 mux where a connection error is not reported as its equivalent stream error. We'll check that tomorrow.
@felipewd I could find and test a solution to this bug, but I'm not merging it yet because it gave me the opportunity to analyze something possibly deeper, that's just suboptimal in error processing but that I want to sort out to understand the whole picture. In the mean time, I'm attaching the patch here that works on 2.7 and that should trivially apply to 2.6 as well, so feel free to apply it to your 2.6 branch.
0001-BUG-MEDIUM-mux-h1-do-not-refrain-from-signaling-erro.patch.txt
Hi @wtarreau great, thanks!
We've deployed on the machine, let's observe. But a quick profiling after show it's really promising:
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
sc_conn_io_cb 3396744 11.06s 3.254us - -
h1_io_cb 458179 345.4ms 753.0ns - -
process_stream 15897 361.4ms 22.73us 16.87ms 1.061us
h1_timeout_task 7177 - - 113.4ms 15.80us
accept_queue_process 4593 29.00ms 6.313us - -
task_run_applet 95 1.437ms 15.12us 514.4us 5.414us
ssl_sock_io_cb 52 6.730ms 129.4us - -
main+0xfbef0 31 25.00us 806.0ns 97.66us 3.150us
srv_cleanup_idle_conns 6 7.271us 1.211us 33.27us 5.544us
I'll report back after a few hours.
Ah much better indeed, thank you for the quick feedback! It's safe to keep this patch for now, I'm just not certain whether it will still be needed or not once I get down to related issues I found at lower layers, which is why I prefer not to merge it yet.
So finally I pushed it. The other one I thought it depended on was of much less importance and not directly related. You can definitely keep running on that one until 2.6.5.
@wtarreau so, after 3 hours, the differences are staggering: The machine can handle a lot more traffic than before with way less pressure. The %usr was around 30,000% to push 10Gbps of data to around 405% to handle 17Gbps currently. This is the profiling for the higher traffic:
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
sc_conn_io_cb 2942999 10.30s 3.499us - -
h1_io_cb 397679 303.4ms 762.0ns - -
process_stream 11834 267.6ms 22.61us 14.53ms 1.228us
h1_timeout_task 5211 - - 86.26ms 16.55us
accept_queue_process 3383 20.83ms 6.157us - -
task_run_applet 44 612.4us 13.92us 196.7us 4.471us
ssl_sock_io_cb 22 3.457ms 157.2us - -
main+0xfbef0 14 14.08us 1.005us 104.0us 7.431us
srv_cleanup_idle_conns 3 2.132us 710.0ns 3.553us 1.184us
So as far as our side goes, this is a complete success. Thanks a lot! :-)
Here's how (drastic) it looks; it should be pretty easy to spot when the patched version was rolled out:
Thanks a lot guys for the great feedback. Indeed the difference on the graph is impressive. The peak in the middle was due to my bug :-) I've merged the fix in mainline. Just reopening the issue so that we can track the fact that the fix needs to be backported as far as 2.5 and will close once done.
Ah Lukas just beat me to retagging. Thanks Lukas!
By the way I'll have to retest more carefully with 2.4 because I'm pretty sure I originally managed to reproduce it on 2.4 and even mentioned it above but yesterday I failed at it. So maybe we'll have to backport all this as far as 2.0 in the end. Let's not close this once it lands in 2.5.
Hi @wtarreau for what it's worth, the graph posted by @rnsanchez is on 2.4.
We had downgraded from 2.6 to 2.4 using our usual production conf in hope this was contained on 2.6....but as you can see, 2.6 with the fix is way better than 2.4-latest. It's clearly way more difficult to cause this bug on 2.4, but the %usr usage seems to indicate it's possible.
Ah so you mean that the period on the left was 2.4, then the middle was 2.6 and finally on the right was 2.6+fix ? Or even maybe the opposite for the first two, i.e. 2.6, 2.4 then 2.6+fix ?
The last one. The weekend we switched back to 2.4 (disabling JWT validation altogether) until we got the patch. So it was 2.6, then 2.4 on Friday evening until 2.6+fix finally yesterday.
So the entire graph (pre-patch) is 2.4
OK so it confirms my previous observation that 2.4-latest had the issue as well. I don't know why I didn't reproduce it lately, I need to recheck more carefully.
OK now understood. The binary I was running for 2.4 during last test was the one resulting from my bisect session, just before the faulty backport. I can now confirm that 2.4-latest is bogus, which is conform to your observation. 2.3 and older do not have the problem. I'm tagging for 2.4 as well.
@wtarreau Great! Since we have a fairly high-traffic setup here, if you want us to give 2.4 a spin on a proposed patch, we're happy to test it.
Yeah good point, I'll try to provide you with a backport shortly so that you can clean up your production (I'm on several issues in parallel at the moment). But I'm confident that my test was sufficient to confirm it worked there anyway since it worked on 2.6 already.
FYI, the fix was backported to 2.4.
Detailed Description of the Problem
Hello,
We've start experiencing high CPU loads (%usr) on haproxy using 2.6.4.
The machine is an 80-core machine delivering just over 3Gbps of traffic over HTTP/1.1 with no TLS, so nothing major.
Here's what we could gather from the machine:
show activity
shows:I have
show fd
as well, but don't know if it's relevant (1009 conns)...but I can post here.Expected Behavior
No loop :-)
Steps to Reproduce the Behavior
Well, not sure: We can go to 1-3 cores to 100% to a little over 20 cores to 100%, so...don't know.
Do you have any idea what may have caused this?
not really.
Do you have an idea how to solve the issue?
No response
What is your configuration?
Output of
haproxy -vv