Closed idl0r closed 2 years ago
If you guys need a coredump, just let me know and I'll mail it.
Log:
n095147 is the host itself.
Thread 1 is about to kill the process.
*>Thread 1 : id=0x7f1f0d95cfc0 act=1 glob=1 wq=1 rq=1 tl=1 tlsz=2 rqsz=142
stuck=1 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=53690908167 now=56299777281 diff=2608869114
curr_task=0x564e124471c0 (task) calls=1406 last=0
fct=0x564e0cb5e830(process_peer_sync) ctx=0x564e0ddeeea0
call trace(16):
| 0x564e0cc123e7 [48 83 c4 10 5b 5d 41 5c]: wdt_handler+0x107/0x114
| 0x7f1f0e323730 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x12730
| 0x564e0cb5ea47 [e9 24 fe ff ff 80 e6 08]: process_peer_sync+0x217/0x8fa
Thread 2 : id=0x7f1f0d939700 act=1 glob=0 wq=1 rq=1 tl=1 tlsz=0 rqsz=7
stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=25392108115 now=27134040820 diff=1741932705
curr_task=0x7f1f0809d9e0 (task) calls=8 last=0
fct=0x564e0cae7660(process_stream) ctx=0x7f1f08099f00
strm=0x7f1f08099f00,25006 src=<PEER> fe=n095147 be=n095147 dst=unknown
txn=(nil),0 txn.req=-,0 txn.rsp=-,0
rqf=80a860 rqa=0 rpf=8040a028 rpa=0 sif=EST,2c0060 sib=CLO,80072
af=0x564e19ef6570,6 csf=(nil),0
ab=(nil),0 csb=0x7f1f08071f10,8200
cof=(nil),0:NONE((nil))/NONE((nil))/NONE(0)
cob=0x7f1edc203a70,105c2300:PASS(0x7f1f080a8080)/RAW((nil))/tcpv4(2562)
Thread 3 : id=0x7f1ef117c700 act=1 glob=0 wq=1 rq=1 tl=1 tlsz=1 rqsz=8
stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=42699764579 now=42752268106 diff=52503527
curr_task=0x7f1ee80788e0 (task) calls=1 last=0
fct=0x564e0cae7660(process_stream) ctx=0x7f1ee80bdf60
strm=0x7f1ee80bdf60,e src=127.0.0.1 fe=genlisten_62041-provid_health_check_cache_tls be=genlisten_62041-provid_health_check_cache_tls dst=n095022
txn=(nil),0 txn.req=-,0 txn.rsp=-,0
rqf=908000 rqa=0 rpf=80000000 rpa=0 sif=EST,200020 sib=ASS,30
af=(nil),0 csf=0x7f1ee806eca0,8200
ab=(nil),0 csb=0x7f1ee80aaa40,0
cof=0x7f1ed8073eb0,80001300:PASS(0x7f1ee808a930)/RAW((nil))/tcpv4(2547)
cob=0x7f1ee007d450,18c00000:NONE(0x7f1ee80aaa40)/SSL(0x7f1edc2200c0)/NONE(2864)
Thread 4 : id=0x7f1ef097b700 act=1 glob=1 wq=1 rq=0 tl=1 tlsz=19 rqsz=24
stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=7565911646 now=7622932969 diff=57021323
curr_task=0x564e148d97c0 (task) calls=2589 last=0
fct=0x564e0cb87d40(process_chk) ctx=0x564e17994be8
Thread 5 : id=0x7f1eef234700 act=1 glob=0 wq=1 rq=1 tl=1 tlsz=2 rqsz=21
stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=21345221572 now=21410249319 diff=65027747
curr_task=0x7f1edc1d1910 (task) calls=1 last=0
fct=0x564e0cae7660(process_stream) ctx=0x7f1edc1b7840
strm=0x7f1edc1b7840,e src=127.0.0.1 fe=genlisten_62041-provid_health_check_cache_tls be=genlisten_62041-provid_health_check_cache_tls dst=n095021
txn=(nil),0 txn.req=-,0 txn.rsp=-,0
rqf=908000 rqa=0 rpf=80000000 rpa=0 sif=EST,200020 sib=ASS,30
af=(nil),0 csf=0x7f1edc17ed90,8200
ab=(nil),0 csb=0x7f1edc115cc0,0
cof=0x7f1edc093c90,80001300:PASS(0x7f1edc094d20)/RAW((nil))/tcpv4(2828)
cob=0x7f1edc0edc50,18c00000:NONE(0x7f1edc115cc0)/SSL(0x7f1edc134af0)/NONE(2863)
Thread 6 : id=0x7f1eeea33700 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=2 rqsz=3
stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=29571746163 now=29655252504 diff=83506341
curr_task=0x7f1ed80d1450 (task) calls=1 last=0
fct=0x564e0cae7660(process_stream) ctx=0x7f1ed80cb4d0
strm=0x7f1ed80cb4d0,e src=127.0.0.1 fe=genlisten_62041-provid_health_check_cache_tls be=genlisten_62041-provid_health_check_cache_tls dst=n095022
txn=(nil),0 txn.req=-,0 txn.rsp=-,0
rqf=908000 rqa=0 rpf=80000000 rpa=0 sif=EST,200020 sib=ASS,30
af=(nil),0 csf=0x7f1ed8056160,8200
ab=(nil),0 csb=0x7f1ed80dfe80,0
cof=0x7f1edc095cc0,80001300:PASS(0x7f1ed805b0f0)/RAW((nil))/tcpv4(2827)
cob=0x7f1ed80e1520,18c00000:NONE(0x7f1ed80dfe80)/SSL(0x7f1edc121260)/NONE(2862)
Thread 7 : id=0x7f1eee232700 act=1 glob=0 wq=1 rq=0 tl=1 tlsz=12 rqsz=15
stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=7607516498 now=7658831733 diff=51315235
curr_task=0x7f1ed409e5e0 (task) calls=1 last=0
fct=0x564e0cae7660(process_stream) ctx=0x7f1ed4099dd0
strm=0x7f1ed4099dd0,e src=127.0.0.1 fe=genlisten_62041-provid_health_check_cache_tls be=genlisten_62041-provid_health_check_cache_tls dst=n095020
txn=(nil),0 txn.req=-,0 txn.rsp=-,0
rqf=908000 rqa=0 rpf=80000000 rpa=0 sif=EST,200020 sib=ASS,30
af=(nil),0 csf=0x7f1ed40e9830,8200
ab=(nil),0 csb=0x7f1ed4096a80,0
cof=0x7f1edc062620,80001300:PASS(0x7f1ed408aa50)/RAW((nil))/tcpv4(2551)
cob=0x7f1ed41cb1a0,18c00000:NONE(0x7f1ed4096a80)/SSL(0x564e19eed860)/NONE(2580)
Thread 8 : id=0x7f1eeda31700 act=1 glob=0 wq=0 rq=1 tl=1 tlsz=0 rqsz=4
stuck=0 prof=0 harmless=0 wantrdv=0
cpu_ns: poll=8160125688 now=8222233286 diff=62107598
curr_task=0x7f1ecc08ca70 (task) calls=4 last=0
fct=0x564e0cae7660(process_stream) ctx=0x7f1ecc06f0d0
strm=0x7f1ecc06f0d0,4304e src=127.0.0.1 fe=genlisten_62041-provid_health_check_cache_tls be=genlisten_62041-provid_health_check_cache_tls dst=n095022
txn=(nil),0 txn.req=-,0 txn.rsp=-,0
rqf=84a029 rqa=0 rpf=8000a800 rpa=0 sif=CLO,280022 sib=EST,2000b0
af=(nil),0 csf=0x7f1ecc0a8330,91a0
ab=(nil),0 csb=0x7f1ecc0a88a0,8080
cof=0x7f1ed41c9030,801c1300:PASS(0x7f1ecc06c9d0)/RAW((nil))/tcpv4(2619)
cob=0x7f1edc092bf0,10082000:PASS(0x7f1ecc08e410)/NONE((nil))/NONE(2528)
[NOTICE] (777) : haproxy version is 2.4.12-4b7772e
[NOTICE] (777) : path to executable is /usr/sbin/haproxy
[ALERT] (777) : Current worker #1 (780) exited with code 134 (Aborted)
[ALERT] (777) : exit-on-failure: killing every processes with SIGTERM
[WARNING] (777) : All workers exited. Exiting... (134)
Thank you Christian. If you could go up into process_peers_sync() and issue "p/x ps->lock", that would be cool. The function has only one lock and one unlock and it cannot bypass it, so it seriously looks like the lock itself got corrupted, likely the same thing we're looking for at other places. And since these locks usually have few bits set we could possibly figure that what you find there looks like a pointer or like an extra flag that will allow us to narrow the issue down.
I'm afraid we have to wait until it happens again. The dump was in /tmp/ and my college rebootet that host so /tmp/ got wiped. I haven't saved it to a different location :(
No problem Christian, don't feel sad for this :-) I'm totally convinced it's the same issue as the other one anyway.
So in the end there was one bug in the scheduler affecting peers and a few others, and more importantly a use-after-free issue in SPOE causing random memory corruption.
Oh and the cause was finally the same as #1494, so marking duplicate and closing.
Detailed Description of the Problem
Details are not clear yet. Looks like it is peers related though
Expected Behavior
Shouldn't crash
Steps to Reproduce the Behavior
As I said, not clear yet
Do you have any idea what may have caused this?
No response
Do you have an idea how to solve the issue?
No response
What is your configuration?
Output of
haproxy -vv
Last Outputs and Backtraces
Additional Information
No response