Closed idl0r closed 10 months ago
Hi Christian,
would you by change have the rest of the stderr output of this dump ? It contains precious flags that will help figure what happened. It's still short, in 2.9 we have significantly improved it, but will already be better than nothing. Thanks!
Oh, sure:
Dec 2 07:10:35.625 n095138 haproxy[27188]: [NOTICE] (27188) : New worker (27279) forked
Dec 2 07:10:35.626 n095138 haproxy[27188]: [NOTICE] (27188) : Loading success.
...
Dec 2 07:10:39.000 n095138 haproxy[27279]: A bogus APPCTX [0x7f53dc05c5b0] is spinning at 3620630 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x7f53dc05bf70,43000 src=192.168.255.27 fe=n095138 be=n095138 dst=<PEER> txn=(nil),0 txn.req=-,0 txn.rsp=-,0 rqf=848000 rqa=0 rpf=80048000 rpa=0 scf=0x7f53dc05bf00,CLO,30444 scb=0x7f53dc05c4f0,EST,1c041 af=(nil),0 sab=0x7f53dc05c5b0,7 cof=0x7f549005bba0,1c0000:PASS(0x7f53dc05be10)/NONE((nil))/NONE(-1) cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1) filters={} applet=0x558f154120a0(main+0x2c7870) handler=0x558f1525ffb0(main+0x115780)]
Dec 2 07:10:39.086 n095138 haproxy[27279]: [ALERT] (27279) : A bogus APPCTX [0x7f53dc05c5b0] is spinning at 3620630 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x7f53dc05bf70,43000 src=192.168.255.27 fe=n095138 be=n095138 dst=<PEER> txn=(nil),0 txn.req=-,0 txn.rsp=-,0 rqf=848000 rqa=0 rpf=80048000 rpa=0 scf=0x7f53dc05bf00,CLO,30444 scb=0x7f53dc05c4f0,EST,1c041 af=(nil),0 sab=0x7f53dc05c5b0,7 cof=0x7f549005bba0,1c0000:PASS(0x7f53dc05be10)/NONE((nil))/NONE(-1) cob=(nil),0:NONE((nil))/NONE((nil))/NONE(-1) filters={} applet=0x558f154120a0(main+0x2c7870) handler=0x558f1525ffb0(main+0x115780)]
Dec 2 07:10:39.086 n095138 haproxy[27279]: call trace(10):
Dec 2 07:10:39.086 n095138 haproxy[27279]: | 0x558f151f5298 [0f 0b 66 0f 1f 44 00 00]: stream_dump_and_crash+0x228/0x37a
Dec 2 07:10:39.086 n095138 haproxy[27279]: | 0x558f1532278a [66 0f 1f 44 00 00 48 8b]: task_run_applet+0x29a/0xc18
Dec 2 07:10:39.086 n095138 haproxy[27279]: | 0x558f152d65cb [48 89 c3 eb 0e 4c 89 ce]: run_tasks_from_lists+0x33b/0x8eb
Dec 2 07:10:39.086 n095138 haproxy[27279]: | 0x558f152d6f06 [29 44 24 18 8b 54 24 18]: process_runnable_tasks+0x386/0x6f9
Dec 2 07:10:39.086 n095138 haproxy[27279]: | 0x558f152a315a [83 3d eb ac 1c 00 01 0f]: run_poll_loop+0x13a/0x55d
Dec 2 07:10:39.086 n095138 haproxy[27279]: | 0x558f152a3789 [48 8b 1d f0 97 17 00 4c]: main+0x158f59
Dec 2 07:10:39.086 n095138 haproxy[27279]: | 0x7f56beab0ea7 [64 48 89 04 25 30 06 00]: libpthread:+0x7ea7
Dec 2 07:10:39.086 n095138 haproxy[27279]: | 0x7f56be9d0a2f [48 89 c7 b8 3c 00 00 00]: libc:clone+0x3f/0x5a
Dec 2 07:10:39.728 n095138 haproxy[27188]: [NOTICE] (27188) : haproxy version is 2.8.4-a4ebf9d
Dec 2 07:10:39.728 n095138 haproxy[27188]: [NOTICE] (27188) : path to executable is /usr/sbin/haproxy
Dec 2 07:10:39.728 n095138 haproxy[27188]: [ALERT] (27188) : Current worker (27279) exited with code 132 (Illegal instruction)
Dec 2 07:10:39.728 n095138 haproxy[27188]: [ALERT] (27188) : exit-on-failure: killing every processes with SIGTERM
Dec 2 07:10:39.728 n095138 haproxy[27188]: [WARNING] (27188) : All workers exited. Exiting... (132)
Dec 2 07:10:43.909 n095138 haproxy[27808]: [NOTICE] (27808) : New worker (27885) forked
Dec 2 07:10:43.909 n095138 haproxy[27808]: [NOTICE] (27808) : Loading success.
Thank you! So what we can see is that hte peers connection was aborted and closed. Maybe there's a race somewhere between error handling and message creation in peers, we need to investigate.
Well reading the code, I'm able to reproduce something similar. I don't know if it is exactly the same bug because my crash happens when the PEER
applet is the client. But peer_recv_msg()
has a flaw. When a message is truncated on the message length, data are not consumed, the applet waits for more data regardless there is a pending shutdown or not.
@idl0r, if it is possible, you can try the following patch on top of the 2.8.4:
diff --git a/src/peers.c b/src/peers.c
index c4760e68e..306e237ba 100644
--- a/src/peers.c
+++ b/src/peers.c
@@ -2403,7 +2403,7 @@ static inline int peer_recv_msg(struct appctx *appctx, char *msg_head, size_t ms
return 1;
incomplete:
- if (reql < 0) {
+ if (reql < 0 || (sc->flags & (SC_FL_SHUT_DONE|SC_FL_SHUT_WANTED))) {
/* there was an error or the message was truncated */
appctx->st0 = PEER_SESS_ST_END;
return -1;
BTW, I will push this fix because it remains valid. It is most probably your bug. @wtarreau, still good for the 2.9.0 ?
Yep, hurry up :-)
too much pressure.... Pushed !
Well reading the code, I'm able to reproduce something similar. I don't know if it is exactly the same bug because my crash happens when the
PEER
applet is the client. Butpeer_recv_msg()
has a flaw. When a message is truncated on the message length, data are not consumed, the applet waits for more data regardless there is a pending shutdown or not.@idl0r, if it is possible, you can try the following patch on top of the 2.8.4:
diff --git a/src/peers.c b/src/peers.c index c4760e68e..306e237ba 100644 --- a/src/peers.c +++ b/src/peers.c @@ -2403,7 +2403,7 @@ static inline int peer_recv_msg(struct appctx *appctx, char *msg_head, size_t ms return 1; incomplete: - if (reql < 0) { + if (reql < 0 || (sc->flags & (SC_FL_SHUT_DONE|SC_FL_SHUT_WANTED))) { /* there was an error or the message was truncated */ appctx->st0 = PEER_SESS_ST_END; return -1;
Sure, can do. I just doubt I can provide good feedback. It crashed one of ~54 LBs and only once.
Honestly, I pretty sure it is your bug. It is why I marked it has fixed.
Alright. Thanks! :)
Then if it's so rare, don't worry, chances are low that you'll see it again before the next update and the fix will be there. Thanks a lot for your report as usual!
Then if it's so rare, don't worry, chances are low that you'll see it again before the next update and the fix will be there. Thanks a lot for your report as usual!
Very welcome and thank you guys! :)
Detailed Description of the Problem
Hi,
looks like in 2.8(.4) peers is having some issues. We were running 2.7.x for quite some time and had no issues. Recently I upgraded all to 2.8.4 and we've got a first segfault since then.
Expected Behavior
No segfault
Steps to Reproduce the Behavior
N/A
Do you have any idea what may have caused this?
Peers
Do you have an idea how to solve the issue?
No response
What is your configuration?
Output of
haproxy -vv
Last Outputs and Backtraces
Additional Information
Please let me know whom I may send the complete coredump, if required :)