Haproxy stops routing during stress test

dianabarsan commented 1 year ago

Describe the bug During e2e tests, I have frequently noticed the haproxy container reaching absurd CPU numbers on my machine (I have an AMD Ryzen 9 5900X). Very frequently, haproxy would take up to 100% of one CPU core.

Then, during a stress test on an AWS hosted distributed setup, everytime I loaded the instance with more than 100 users, the suite failed because haproxy had stopped responding. The container was not killed, or restarted, but the instance was not reachable and API reported not being able to connect to haproxy.

Upon inspecting haproxy logs, at first I noticed:

[NOTICE]   (1) : haproxy version is 2.6.9-3a3700a
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[ALERT]    (1) : Current worker (25) exited with code 137 (Killed)
[ALERT]    (1) : exit-on-failure: killing every processes with SIGTERM

On a subsequent retry, I saw:

Thread 2 is about to kill the process.
  Thread 1 : id=0x7fb8dce74f40 act=1 glob=0 wq=1 rq=1 tl=1 tlsz=21 rqsz=760
      1/1    stuck=0 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=20789220579 now=21014641652 diff=225421073
             curr_task=0x55705bb4b020 (task) calls=1 last=0
               fct=0x557059fe3950(process_stream) ctx=0x55709982b210
             strm=0x55709982b210,c00 src=10.0.1.4 fe=http-in be=http-in dst=unknown
             txn=0x7fb7fbea2830,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=40d08002 rqa=30 rpf=80000000 rpa=0
             scf=0x55705c4590f0,EST,20 scb=0x55705bb660e0,INI,21
             af=(nil),0 sab=(nil),0
             cof=0x7fb85b606e50,80000300:H1(0x55705c46e3b0)/RAW((nil))/tcpv4(5635)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1)
             Current executing Lua from a stream analyser -- stack traceback:

*>Thread 2 : id=0x7fb8dce69700 act=1 glob=1 wq=1 rq=1 tl=1 tlsz=2 rqsz=493
      1/2    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=19770093994 now=86782386464 diff=67012292470
             curr_task=0x7fb7d77d19d0 (task) calls=1 last=0
               fct=0x557059fe3950(process_stream) ctx=0x7fb7d77d15e0
             strm=0x7fb7d77d15e0,c00 src=10.0.1.4 fe=http-in be=http-in dst=unknown
             txn=0x7fb7d77d1c30,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=40d08002 rqa=30 rpf=80000000 rpa=0
             scf=0x7fb813f12be0,EST,20 scb=0x7fb7d77d1af0,INI,21
             af=(nil),0 sab=(nil),0
             cof=0x7fb893e46040,80000300:H1(0x7fb8da8cd700)/RAW((nil))/tcpv4(11218)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1)
             Current executing Lua from a stream analyser -- stack traceback:

             call trace(22):
             | 0x55705a0aefcb [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x55705a0afa66 [48 8b 05 e3 aa 1d 00 48]: debug_handler+0x66/0x10b
             | 0x7fb8dd601140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x7fb8dd600fe1 [48 8b 84 24 08 01 00 00]: libpthread:raise+0x141/0x179
             | 0x55705a0ae3f7 [64 48 8b 53 10 64 48 8b]: main+0x165437
             | 0x55705a0ae448 [0f 1f 84 00 00 00 00 00]: main+0x165488
             | 0x7fb8dd601140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x557059f82403 [48 85 d0 75 20 48 89 f0]: lua_take_global_lock+0x23/0x4c
             | 0x557059f8842f [e9 f3 f9 ff ff 0f 1f 40]: main+0x3f46f
             | 0x55705a01e534 [85 c0 74 69 48 8b 5d 20]: sample_process+0x54/0x107
             | 0x55705a01f342 [48 85 c0 74 59 64 48 63]: sample_fetch_as_type+0x42/0xc1
             | 0x55705a03b50b [48 89 c6 48 85 c0 0f 84]: sess_build_logline+0x248b/0x40f3
             | 0x55705a098ff4 [4c 8b a3 88 00 00 00 48]: main+0x150034
             | 0x557059ff92f9 [4c 8b 0c 24 48 8b 4c 24]: main+0xb0339
             | 0x557059ffb9d6 [4c 8b 44 24 18 83 f8 07]: http_process_req_common+0xa6/0x1035
             | 0x557059fe58b8 [85 c0 0f 85 fc f7 ff ff]: process_stream+0x1f68/0x35e9
 >Thread 3 : id=0x7fb8d3924700 act=1 glob=1 wq=1 rq=1 tl=1 tlsz=2 rqsz=498
      1/3    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=20556386735 now=89985480430 diff=69429093695
             curr_task=0x7fb833fd1390 (task) calls=1 last=0
               fct=0x557059fe3950(process_stream) ctx=0x7fb833fd0fa0
             strm=0x7fb833fd0fa0,c00 src=10.0.1.4 fe=http-in be=http-in dst=unknown
             txn=0x7fb833fd15f0,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=40d08002 rqa=30 rpf=80000000 rpa=0
             scf=0x7fb833fd0f30,EST,20 scb=0x7fb833fd14b0,INI,21
             af=(nil),0 sab=(nil),0
             cof=0x7fb83741b110,80000300:H1(0x7fb85544fe40)/RAW((nil))/tcpv4(11845)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1)
             Current executing Lua from a stream analyser -- 
             call trace(19):
             | 0x55705a0aefcb [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x55705a0afa66 [48 8b 05 e3 aa 1d 00 48]: debug_handler+0x66/0x10b
             | 0x7fb8dd601140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x557059f8242a [eb d4 0f 1f 40 00 41 54]: lua_take_global_lock+0x4a/0x4c
             | 0x557059f85d97 [e9 f2 fe ff ff 0f 1f 40]: hlua_ctx_init+0x167/0x1e4
             | 0x557059f8840c [85 c0 0f 84 bc 00 00 00]: main+0x3f44c
             | 0x55705a01e534 [85 c0 74 69 48 8b 5d 20]: sample_process+0x54/0x107
             | 0x55705a01f342 [48 85 c0 74 59 64 48 63]: sample_fetch_as_type+0x42/0xc1
             | 0x55705a03b50b [48 89 c6 48 85 c0 0f 84]: sess_build_logline+0x248b/0x40f3
             | 0x55705a098ff4 [4c 8b a3 88 00 00 00 48]: main+0x150034
             | 0x557059ff92f9 [4c 8b 0c 24 48 8b 4c 24]: main+0xb0339
             | 0x557059ffb9d6 [4c 8b 44 24 18 83 f8 07]: http_process_req_common+0xa6/0x1035
             | 0x557059fe58b8 [85 c0 0f 85 fc f7 ff ff]: process_stream+0x1f68/0x35e9
  Thread 4 : id=0x7fb8d3123700 act=1 glob=0 wq=1 rq=1 tl=1 tlsz=29 rqsz=778
      1/4    stuck=0 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=21138813085 now=95953444002 diff=74814630917
             curr_task=0x7fb86c80ae60 (task) calls=1 last=0
               fct=0x557059fe3950(process_stream) ctx=0x7fb86c80aa70
             strm=0x7fb86c80aa70,c00 src=10.0.1.4 fe=http-in be=http-in dst=unknown
             txn=0x7fb86c80b0c0,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=40d08002 rqa=30 rpf=80000000 rpa=0
             scf=0x7fb86c80aa00,EST,20 scb=0x7fb86c80af80,INI,21
             af=(nil),0 sab=(nil),0
             cof=0x7fb853f53e20,80000300:H1(0x7fb8b12b27f0)/RAW((nil))/tcpv4(6201)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1)
             Current executing Lua from a stream analyser -- 
 >Thread 5 : id=0x7fb8d2922700 act=1 glob=1 wq=1 rq=1 tl=1 tlsz=10 rqsz=793
      1/5    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=21689679358 now=99475683274 diff=77786003916
             curr_task=0x7fb6d7fe4400 (task) calls=1 last=0
               fct=0x557059fe3950(process_stream) ctx=0x7fb6d7fe4010
             strm=0x7fb6d7fe4010,c00 src=10.0.1.4 fe=http-in be=http-in dst=unknown
             txn=0x7fb6d7fe4660,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=40d08002 rqa=30 rpf=80000000 rpa=0
             scf=0x7fb6d7fe3fa0,EST,20 scb=0x7fb6d7fe4520,INI,21
             af=(nil),0 sab=(nil),0
             cof=0x7fb8730bcd20,80000300:H1(0x7fb8730bcf90)/RAW((nil))/tcpv4(6219)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1)
             Current executing Lua from a stream analyser -- 
             call trace(19):
             | 0x55705a0aefcb [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x55705a0afa66 [48 8b 05 e3 aa 1d 00 48]: debug_handler+0x66/0x10b
             | 0x7fb8dd601140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x557059f82428 [f3 90 eb d4 0f 1f 40 00]: lua_take_global_lock+0x48/0x4c
             | 0x557059f85d97 [e9 f2 fe ff ff 0f 1f 40]: hlua_ctx_init+0x167/0x1e4
             | 0x557059f8840c [85 c0 0f 84 bc 00 00 00]: main+0x3f44c
             | 0x55705a01e534 [85 c0 74 69 48 8b 5d 20]: sample_process+0x54/0x107
             | 0x55705a01f342 [48 85 c0 74 59 64 48 63]: sample_fetch_as_type+0x42/0xc1
             | 0x55705a03b50b [48 89 c6 48 85 c0 0f 84]: sess_build_logline+0x248b/0x40f3
             | 0x55705a098ff4 [4c 8b a3 88 00 00 00 48]: main+0x150034
             | 0x557059ff92f9 [4c 8b 0c 24 48 8b 4c 24]: main+0xb0339
             | 0x557059ffb9d6 [4c 8b 44 24 18 83 f8 07]: http_process_req_common+0xa6/0x1035
             | 0x557059fe58b8 [85 c0 0f 85 fc f7 ff ff]: process_stream+0x1f68/0x35e9
 >Thread 6 : id=0x7fb8d2121700 act=1 glob=1 wq=1 rq=1 tl=1 tlsz=9 rqsz=706
      1/6    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=20990474533 now=100959756450 diff=79969281917
             curr_task=0x7fb73bea7160 (task) calls=1 last=0
               fct=0x557059fe3950(process_stream) ctx=0x7fb73bea6d70
             strm=0x7fb73bea6d70,c00 src=10.0.1.4 fe=http-in be=http-in dst=unknown
             txn=0x7fb73bea73c0,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=40d08002 rqa=30 rpf=80000000 rpa=0
             scf=0x7fb73bea6d00,EST,20 scb=0x7fb73bea7280,INI,21
             af=(nil),0 sab=(nil),0
             cof=0x7fb7e3f0bf90,80000300:H1(0x7fb89bf955e0)/RAW((nil))/tcpv4(6349)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1)
             Current executing Lua from a stream analyser -- 
             call trace(19):
             | 0x55705a0aefcb [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x55705a0afa66 [48 8b 05 e3 aa 1d 00 48]: debug_handler+0x66/0x10b
             | 0x7fb8dd601140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x557059f8242a [eb d4 0f 1f 40 00 41 54]: lua_take_global_lock+0x4a/0x4c
             | 0x557059f85d97 [e9 f2 fe ff ff 0f 1f 40]: hlua_ctx_init+0x167/0x1e4
             | 0x557059f8840c [85 c0 0f 84 bc 00 00 00]: main+0x3f44c
             | 0x55705a01e534 [85 c0 74 69 48 8b 5d 20]: sample_process+0x54/0x107
             | 0x55705a01f342 [48 85 c0 74 59 64 48 63]: sample_fetch_as_type+0x42/0xc1
             | 0x55705a03b50b [48 89 c6 48 85 c0 0f 84]: sess_build_logline+0x248b/0x40f3
             | 0x55705a098ff4 [4c 8b a3 88 00 00 00 48]: main+0x150034
             | 0x557059ff92f9 [4c 8b 0c 24 48 8b 4c 24]: main+0xb0339
             | 0x557059ffb9d6 [4c 8b 44 24 18 83 f8 07]: http_process_req_common+0xa6/0x1035
             | 0x557059fe58b8 [85 c0 0f 85 fc f7 ff ff]: process_stream+0x1f68/0x35e9
 >Thread 7 : id=0x7fb8d1920700 act=1 glob=0 wq=1 rq=1 tl=1 tlsz=3 rqsz=503
      1/7    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=20144821466 now=96965659318 diff=76820837852
             curr_task=0x7fb74125de80 (task) calls=1 last=0
               fct=0x557059fe3950(process_stream) ctx=0x7fb74125da90
             strm=0x7fb74125da90,c00 src=10.0.1.4 fe=http-in be=http-in dst=unknown
             txn=0x7fb74125dff0,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=40d08002 rqa=30 rpf=80000000 rpa=0
             scf=0x7fb7692a0e10,EST,20 scb=0x7fb74125df50,INI,21
             af=(nil),0 sab=(nil),0
             cof=0x7fb8cf6f1b10,80000300:H1(0x7fb7692aa2f0)/RAW((nil))/tcpv4(10173)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1)
             Current executing Lua from a stream analyser -- stack traceback:

             call trace(18):
             | 0x55705a0aefcb [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x55705a0afa66 [48 8b 05 e3 aa 1d 00 48]: debug_handler+0x66/0x10b
             | 0x7fb8dd601140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x557059f8242a [eb d4 0f 1f 40 00 41 54]: lua_take_global_lock+0x4a/0x4c
             | 0x557059f8842f [e9 f3 f9 ff ff 0f 1f 40]: main+0x3f46f
             | 0x55705a01e534 [85 c0 74 69 48 8b 5d 20]: sample_process+0x54/0x107
             | 0x55705a01f342 [48 85 c0 74 59 64 48 63]: sample_fetch_as_type+0x42/0xc1
             | 0x55705a03b50b [48 89 c6 48 85 c0 0f 84]: sess_build_logline+0x248b/0x40f3
             | 0x55705a098ff4 [4c 8b a3 88 00 00 00 48]: main+0x150034
             | 0x557059ff92f9 [4c 8b 0c 24 48 8b 4c 24]: main+0xb0339
             | 0x557059ffb9d6 [4c 8b 44 24 18 83 f8 07]: http_process_req_common+0xa6/0x1035
             | 0x557059fe58b8 [85 c0 0f 85 fc f7 ff ff]: process_stream+0x1f68/0x35e9
  Thread 8 : id=0x7fb8d111f700 act=1 glob=0 wq=1 rq=1 tl=1 tlsz=25 rqsz=504
      1/8    stuck=0 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=20361036268 now=84727927263 diff=64366890995
             curr_task=0x7fb8c0933120 (task) calls=1 last=0
               fct=0x557059fe3950(process_stream) ctx=0x7fb8c0932d30
             strm=0x7fb8c0932d30,c00 src=10.0.1.4 fe=http-in be=http-in dst=unknown
             txn=0x7fb8c3ce9a40,3000 txn.req=MSG_BODY,4c txn.rsp=MSG_RPBEFORE,0
             rqf=40d08002 rqa=30 rpf=80000000 rpa=0
             scf=0x7fb8c3cebee0,EST,20 scb=0x7fb8c3ce9930,INI,21
             af=(nil),0 sab=(nil),0
             cof=0x7fb8abf12e60,80000300:H1(0x7fb8c3ce4af0)/RAW((nil))/tcpv4(8629)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1)
             Current executing Lua from a stream analyser -- stack traceback:
                 /usr/local/etc/haproxy/parse_basic.lua:12: in function line 9
                 [C]: in method 'gsub'
                 /usr/local/etc/haproxy/parse_basic.lua:9: in global 'dec'
                 /usr/local/etc/haproxy/parse_basic.lua:27: in function line 22
[NOTICE]   (1) : haproxy version is 2.6.9-3a3700a
[NOTICE]   (1) : path to executable is /usr/local/sbin/haproxy
[ALERT]    (1) : Current worker (25) exited with code 134 (Aborted)
[ALERT]    (1) : exit-on-failure: killing every processes with SIGTERM

Right before this happened, this is a snapshot of docker stats on the AWS instance:

Stats: haproxy is using 282% cpu and 4.349G of memory.

To Reproduce Steps to reproduce the behavior:

Run scalability suite with 100 concurrent users.

Expected behavior Haproxy shouldn't take up as many resources. Also, the container should somehow restart automatically, instead of just hang when this issue happens.

Environment

Instance: local & aws test instance
App: haproxt
Version: master

garethbowen commented 1 year ago

@dianabarsan I'm having trouble reproducing this locally. This is my equivalent screenshot running master also on a Ryzen 9 5900X

This is similar to when I run 4.1.x.

The differences that stand out comparing mine vs yours:

CPU: 10% vs 100% MEM USAGE: 122 MiB vs 2 MiB Net IO and Block IO: yours is basically non-existent

Firstly, let's make sure we're doing the same thing. Are you just running npm run wdio-local? Was that screenshot taken during startup, or while running the tests? Can you reproduce it and check the haproxy logs?

Any other ideas?

dianabarsan commented 1 year ago

I've installed latest master on https://march1-archv3-scalability.dev.medicmobile.org and I'm running the scalability test against it, and I'm seeing these numbers:

CONTAINER ID   NAME                    CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
8d3c55f16902   compose_nginx_1         0.55%     189.5MiB / 14.64GiB   1.26%     34.6MB / 32.5MB   3.26MB / 16.4kB   9
91cb5aa3c250   compose_api_1           121.96%   550.3MiB / 14.64GiB   3.67%     197MB / 61.7MB    0B / 16.4kB       12
3b154e400283   compose_sentinel_1      0.00%     46.57MiB / 14.64GiB   0.31%     1.12MB / 84.4kB   1.78MB / 0B       12
73cbfbf7a697   compose_haproxy_1       521.87%   5.884GiB / 14.64GiB   40.19%    228MB / 227MB     0B / 24.6kB       13
bf878b915513   compose_healthcheck_1   0.23%     23.91MiB / 14.64GiB   0.16%     1.69MB / 875kB    28.3MB / 12.6MB   3

I added a script that will query docker stats every second and write to a file. This was the last entry before the whole AWS instance became unresponsive.

garethbowen commented 1 year ago

Was there anything in the haproxy logs? 0B written to BLOCK I/O just seems wrong?

garethbowen commented 1 year ago

Theory... lua is locking up on this line:

| 0x557059f82403 [48 85 d0 75 20 48 89 f0]: lua_take_global_lock+0x23/0x4c

In 2.4 haproxy introduced multithreading in lua: https://www.haproxy.com/blog/announcing-haproxy-2-4/

They also introduced a new directive lua-load-per-thread which runs the lua script per thread which is what we want for parsing each request. Replacing lua-load with lua-load-per-thread seems to work but because I didn't manage to reproduce this the first time I'm not sure whether it's better or the same. This thread lock explanation would also help explain why it's showing up in the scalability suite so badly.

More info in this haproxy issue: https://github.com/haproxy/haproxy/issues/1625

@dianabarsan Can you try this and see if it's any better?

dianabarsan commented 1 year ago

I've made the change in the haproxy config, and I'm running the scalability test locally:

So far, these are the stats while Sentinel is processing all pushed docs:

I ran the scalability suite against a local instance and recorded docker states for the duration. These are the cpu % usage for the haproxy container:

I'll run the same test against the test instance (which is now unresponsive).

garethbowen commented 1 year ago

@mrjones-plip This change does not affect any released version of CHT as it was introduced with the not-yet-released haproxy upgrade.

mrjones-plip commented 1 year ago

Thanks for catching this @garethbowen !

dianabarsan commented 1 year ago

I ran the scalability test against this branch https://github.com/medic/cht-core/tree/8130-haproxy-lua-check

and this is the graph of haproxy CPU usage with 100 concurrent users.

The instance didn't crash and the test finished successfully, though the CPU usage is still very high.

dianabarsan commented 1 year ago

I've run the scalability suite on the same machine as above, but deployed 4.1.0 and this is the CPU usage for haproxy:

They are strikingly similar, so I believe that we've resolved the regression with the lua script change.

It's still worrying that so much CPU time is used for haproxy.

garethbowen commented 1 year ago

Thanks @dianabarsan - that's great testing.

Let's get this issue resolved before 4.2.0, and re-evaluate our auditing solution later?

garethbowen commented 1 year ago

Ready for AT in 8130-haproxy-lua-lock

This is a performance improvement which should not change functionality at all. The AT steps are the same as discussed in multiple comments: https://github.com/medic/cht-core/issues/8076#issuecomment-1439060210

garethbowen commented 1 year ago

@dianabarsan and I have agreed between us we've ATed this so no more needed.

medic / cht-core

Haproxy stops routing during stress test #8130