tacho / conman

Automatically exported from code.google.com/p/conman
GNU General Public License v3.0
1 stars 0 forks source link

conmand segfaults when using libipmiconsole connections #7

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

First with conman 0.2.5 and then with 0.2.6 and the latest freeipmi, I get a 
segfault when I start conmand - whether through an init script or straight on 
the command line. /var/log/messages shows the following:
kernel: conmand[487]: segfault at 00007fff567b3ff0 rip 00002b365c0b75f4 rsp 
00007fff567b3fe0 error 6

Can you point me to how I might be able to get around this? This is on RHEL 5.6 
x86_64. I've attached an strace.
Thanks,
Cameron

Original issue reported on code.google.com by cameron....@gmail.com on 24 Feb 2011 at 5:46

Attachments:

GoogleCodeExporter commented 9 years ago
Sorry - this is probably the wrong place for this and I can't see how to delete 
the ticket. I saw the users group link after submitting this.

Original comment by cameron....@gmail.com on 24 Feb 2011 at 5:54

GoogleCodeExporter commented 9 years ago
No problem.  I'll leave this open until the root cause has been identified.

Discussion is at 
<http://groups.google.com/group/conman-users/t/e20dba51e31c57e5>.

Original comment by chris.m.dunlap on 24 Feb 2011 at 9:25

GoogleCodeExporter commented 9 years ago
Once the ipmi.ctx has been submitted via ipmiconsole_engine_submit() from 
within initiate_ipmi_connect(), there is a race condition where the 
connect_ipmi_obj() callback can be invoked by the ipmiconsole_engine thread 
before ipmi.state has been updated in the next line.  This can cause a pending 
connection to be re-initiated ad nauseam.

Also, the connect_ipmi_obj() callback can be invoked from either the timer 
thread or the ipmiconsole_engine thread.  If fail_ipmi_connect() is called from 
within the ipmiconsole_engine thread, then the ipmi.ctx memory can be 
de-allocated while it is still in use by the ipmiconsole_engine thereby making 
the engine very unhappy.

Original comment by chris.m.dunlap on 4 Mar 2011 at 1:38

GoogleCodeExporter commented 9 years ago
A while back, Matthieu Pérotin encountered the same problem and sent me a most 
thorough analysis.  My bad for not getting this fixed sooner.

Matthieu writes:

The problem here resides in the fact that the patch I submitted destroys
the contexts in fail_ipmi_connect, in order to maintain a consitency
between the state of the IPMI connection and the need of a context. The
source of the problem here was that it was impossible to run this part
of conman in a debugger: an assert in create_ipmi_ctx would fail, the
function being called on a pre existing context.

Anyway, what I did not forsee, and did not witness due to a lack of
sufficient testing, was that fail_ipmi_connect could be called from the
libfreeipmi, hence from the context. Hence, destroying the context in
here was a very bad idea... => SEGFAULT.

The whole backtrace is as follow:
#0  0xb7fe2424 in __kernel_vsyscall ()
#1  0xb7e49751 in *__GI_raise (sig=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0xb7e4cb82 in *__GI_abort () at abort.c:92
#3  0xb7e428b8 in *__GI___assert_fail (assertion=0x8069738 "ipmi->aux.ipmi.ctx 
== ((void *)0)", 
    file=0x80693b8 "server-ipmi.c", line=451, function=0x8069937 "create_ipmi_ctx") at assert.c:81
#4  0x0805c427 in create_ipmi_ctx (ipmi=0x806fdf8) at server-ipmi.c:451
#5  0x0805c2bd in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:396
#6  0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#7  0xb7f7bd64 in __ipmiconsole_ctx_connection_cleanup (c=0x806e5c0, 
    session_submitted=<value optimized out>) at ipmiconsole_ctx.c:837
#8  0xb7f714d1 in ipmiconsole_engine_submit (c=0x806e5c0, callback=0x805c196 
<connect_ipmi_obj>, 
    callback_arg=0x806fdf8) at ipmiconsole.c:229
#9  0x0805c316 in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:402
#10 0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#11 0x0805bfd6 in open_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:309
#12 0x080538ae in reopen_obj (obj=0x806fdf8) at server-obj.c:266
#13 0x0804b95b in open_objs (conf=0x806d028) at server.c:699
#14 0x0804a745 in main (argc=2, argv=0xbffff794) at server.c:152

In this backtrace the same problem than with issue 1 occurs, the assert
fails, and prevents debugging... lets comment out the assert to see what
is happening, like in the compiled daemon.

To me, there is a first problem, connect_ipmi_obj (the callback
function) calls initiate_ipmi_connect a second time because the state
was left to DOWN before submitting it the ipmi engine, but that is
another story, that causes another bug that we are just about to
witness:

Now, I get Segfaults happening with too many socket pairs:
Program received signal SIGSEGV, Segmentation fault.
0xb7f7da0c in _debug (
    fmt=0xbf801fe8 "(ipmiconsole_ctx.c, ipmiconsole_ctx_connection_setup, 442): socketpair: Too many open files", ap=0xbf801fb4 "") at ipmiconsole_debug.c:108
108 {
and backtraces with thousands of calls...

#0  0xb7f7da0c in _debug (
    fmt=0xbf801fe8 "(ipmiconsole_ctx.c, ipmiconsole_ctx_connection_setup, 442): socketpair: Too many open files", ap=0xbf801fb4 "") at ipmiconsole_debug.c:108
#1  0xb7f7e1a1 in ipmiconsole_debug (
    fmt=0xbf801fe8 "(ipmiconsole_ctx.c, ipmiconsole_ctx_connection_setup, 442): socketpair: Too many open files") at ipmiconsole_debug.c:198
#2  0xb7f78ba3 in ipmiconsole_ctx_connection_setup (c=0x1730e048) at 
ipmiconsole_ctx.c:529
#3  0xb7f71450 in ipmiconsole_engine_submit (c=0x1730e048, callback=0x805c196 
<connect_ipmi_obj>, 
    callback_arg=0x806fdf8) at ipmiconsole.c:191
#4  0x0805c316 in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:403
#5  0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#6  0xb7f7bd64 in __ipmiconsole_ctx_connection_cleanup (c=0x1730db90, 
    session_submitted=<value optimized out>) at ipmiconsole_ctx.c:837
#7  0xb7f714d1 in ipmiconsole_engine_submit (c=0x1730db90, callback=0x805c196 
<connect_ipmi_obj>, 
    callback_arg=0x806fdf8) at ipmiconsole.c:229
#8  0x0805c316 in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:403
#9  0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#10 0xb7f7bd64 in __ipmiconsole_ctx_connection_cleanup (c=0x1730d6d8, 
    session_submitted=<value optimized out>) at ipmiconsole_ctx.c:837
#11 0xb7f714d1 in ipmiconsole_engine_submit (c=0x1730d6d8, callback=0x805c196 
<connect_ipmi_obj>, 
    callback_arg=0x806fdf8) at ipmiconsole.c:229
...

As a quick fix, let's pass the state to CONMAN_IPMI_PENDING before
submitting to the engine, and now I do find the "correct" bug, which
raises an assert in libfreeipmi:

conmand: ipmiconsole_ctx.c:415: _ipmiconsole_ctx_connection_init:
Assertion `c->magic == 0x74AB8831' failed.

Program received signal SIGABRT, Aborted.
0xb7fe2424 in __kernel_vsyscall ()
(gdb) backtrace
#0  0xb7fe2424 in __kernel_vsyscall ()
#1  0xb7e49751 in *__GI_raise (sig=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0xb7e4cb82 in *__GI_abort () at abort.c:92
#3  0xb7e428b8 in *__GI___assert_fail (assertion=0xb7fa4025 "c->magic == 
0x74AB8831", 
    file=0xb7fa5fdc "ipmiconsole_ctx.c", line=415, function=0xb7fa63e0 "_ipmiconsole_ctx_connection_init")
    at assert.c:81
#4  0xb7f773cc in _ipmiconsole_ctx_connection_init (c=0x6) at 
ipmiconsole_ctx.c:415
#5  0xb7f7c0a2 in __ipmiconsole_ctx_connection_cleanup (c=0x806e5c0, 
    session_submitted=<value optimized out>) at ipmiconsole_ctx.c:948
#6  0xb7f714d1 in ipmiconsole_engine_submit (c=0x806e5c0, callback=0x805c196 
<connect_ipmi_obj>, 
    callback_arg=0x806fdf8) at ipmiconsole.c:229
#7  0x0805c323 in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:403
#8  0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#9  0x0805bfd6 in open_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:309
#10 0x080538ae in reopen_obj (obj=0x806fdf8) at server-obj.c:266
#11 0x0804b95b in open_objs (conf=0x806d028) at server.c:699
#12 0x0804a745 in main (argc=2, argv=0xbffff3e4) at server.c:152

Indeed, the context was destroyed a few moments before by a call to
fail_ipmi_connect:

#0  fail_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:509
#1  0x0805c276 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:381
#2  0xb7f7bd64 in __ipmiconsole_ctx_connection_cleanup (c=0x806e5c0, 
    session_submitted=<value optimized out>) at ipmiconsole_ctx.c:837
#3  0xb7f714d1 in ipmiconsole_engine_submit (c=0x806e5c0, callback=0x805c196 
<connect_ipmi_obj>, 
    callback_arg=0x806fdf8) at ipmiconsole.c:229
#4  0x0805c323 in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:403
#5  0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#6  0x0805bfd6 in open_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:309
#7  0x080538ae in reopen_obj (obj=0x806fdf8) at server-obj.c:266
#8  0x0804b95b in open_objs (conf=0x806d028) at server.c:699
#9  0x0804a745 in main (argc=2, argv=0xbffff3e4) at server.c:152

Without this assert, the program, without surprise, segfaults a few
instructions later.

Original comment by chris.m.dunlap on 4 Mar 2011 at 1:51

GoogleCodeExporter commented 9 years ago
Hi,
I'm sorry to have left this open for so long :( I have continued to work on 
this bug on my side and came up with something that seems to fix the problem, 
although I'm not 100% satisfied with the solution.

I attach the patch I have been using for the past few months, with a minor 
change (I really scratched my head in order to come up with clear explanations 
and re-did quite a few gdb sessions in order to reproduce the bug...). As I 
explained to Chris at the time, I had this segfault when using conman + 
freeipmi and non resolving hosts.

So, here comes the fun:

1. The first chunk is about a trivial error in text formatting, that is not 
related to the issue at hand, but could still disrupt conmand in some (very 
unlikely) cases.

2. The second chunk fixes the loop between initiate_ipmi_connect and 
connect_ipmi_obj. By changing the state to PENDING, before submitting to the 
engine, we ensure that the second call to connect_impi_obj will not be the same.

3. Now, to fix the issue raised in the analysis I submitted in the post quoted 
by Chris just above, I had to:
  a. remove the context destruction in fail_ipmi_connect (chunk #5)
  b. remove the assert in create_ipmi_ctx and replace it by a test and a context creation if the context exists (chunk #3)

With this modification, the context still exists when returning to 
_ipmiconsole_ctx_connection_init and everything goes fine.

4. However, one last thing remains to be done, and this is the thing that does 
not satisfy me completely. In chunk #4, I add a test at the beginning of 
fail_ipmi_connect, to return without doing anything if ipmi->aux.ipmi.timer is 
not equal to -1 and if the state is DOWN. In fact, when analyzing the stack 
trace (see above), one can see that connect_ipmi_obj gets called twice:
once by open_ipmi_obj, and once, upon callback by the engine, by 
__ipmiconsole_ctx_connection_cleanup.

Upon failing, fail_ipmi_connect is called. The first time (exiting from the 
callback) everything normal is happening, and a call to connect_ipmi_obj is 
rescheduled. And just after that, fail_ipmi_connect is called again, from the 
very first connect_ipmi_obj. This second time, we get a failed assert, as 
ipmi->aux.ipmi.timer is not equal to -1. Moreover, we really do not have 
anything to do as we've just done everything the moment before. Hence the test 
and the return, at the beginning of this function.

If we do nothing, we get log messages twice, two calls to 
tpoll_timeout_relative and a double fold increase of the delay.

To me, the root of the problem here is that connect_ipmi_obj is called as a 
callback to itself, and maybe the lack of states in the automaton makes it 
difficult to differentiate between the causes of the errors.

The way I implemented a solution to point 4 is working, although, to say the 
least, I am not sure it is the best way it could be implemented. I would be 
more than happy to get your opinion on that point, and I'm really open to 
discussion. 

Please find the patch attached, and do not hesitate to comment !

Matthieu.

Original comment by matthieu...@bull.net on 8 Mar 2011 at 4:17

Attachments:

GoogleCodeExporter commented 9 years ago
Matthieu,
I retried conman using your patch and it appears to be working fine. 
Thanks!
Cameron

Original comment by cameron....@gmail.com on 9 Mar 2011 at 12:34

GoogleCodeExporter commented 9 years ago
Hmm, perhaps I spoke too soon. I just noticed that after I leave the console, I 
can't seem to get back in. I'm using agetty in the inittab file and once I've 
left the console, the agetty process no longer shows up and an 'init q' doesn't 
spawn it.

Original comment by cameron....@gmail.com on 9 Mar 2011 at 12:42

GoogleCodeExporter commented 9 years ago
Hey Matthieu,

I think most of this problem is attributable to concurrent access to a given 
ipmi_obj struct from the main thread and the ipmiconsole engine thread via 
connect_ipmi_obj().  Once ipmiconsole_engine_submit() is called, the 
ipmiconsole engine will start running the connect_ipmi_obj() callback before 
control is returned to initiate_ipmi_connect() in the main thread.  I've added 
a mutex to serialize access to key variables (ctx, state, timer, & delay) 
within the ipmi_obj struct (r1019).

That mutex will protect the transition to the PENDING state (in #2).  I've kept 
the state transition until after ipmiconsole_engine_submit() in case the 
submission fails (at which point it remains in the DOWN state).

Re: #4, the failed assert on ipmi->aux.ipmi.timer should no longer be possible 
with the mutex.  And I don't think multiple calls to fail_ipmi_connect() for 
the same connection can still occur.

Re: #3, Al says invoking ipmiconsole_ctx_destroy() from within the callback 
should be ok.  But destroying the ctx immediately before recreating it seems 
safer, so I've made those changes (r1017 & r1018).

Re: #1, yeah, oops.  Also, I've changed it to a non-fatal error which will 
retry the connection (r1013 & r1014) -- although it should be impossible to 
trigger now unless new states are added.

I've committed these changes to 
<https://conman.googlecode.com/svn/branches/issue-7>.  I'm hoping to do some 
testing at scale tomorrow.  Let me know if you have a chance to test this.

Original comment by chris.m.dunlap on 9 Mar 2011 at 4:07

GoogleCodeExporter commented 9 years ago
Hi Chris,

I had a quick test run on the issue 7 branch this morning, and witnessed a 
double lock of the same mutex.

You will find attached a debugging output.

I've set breakpoints on every functions of server-ipmi.c and every call to a 
x_pthread_mutex* function.

conmand is configured here with a single console, with an invalid bmc, which 
does not resolve.

As noted before, connect_ipmi_obj is called twice, there is two calls to 
x_pthread_mutex_lock, and the program remains locked.

In fact, in this testcase of a failure, a single thread is spawned and the two 
calls to connect_ipmi_obj are done within the same thread, hence the deadlock 
when a connection fails.

The solution you implemented with a mutex would work if all the calls happened 
in different threads, but unfortunately, it does not seem to always be the case.

Original comment by matthieu...@bull.net on 9 Mar 2011 at 4:00

Attachments:

GoogleCodeExporter commented 9 years ago
The test for a non-resolving host here makes for a simple reproducer.  Thanks!  
ipmiconsole_engine_submit() is behaving in a way I didn't expect.

Original comment by chris.m.dunlap on 9 Mar 2011 at 9:18

GoogleCodeExporter commented 9 years ago
Al made a fix to libipmiconsole yesterday that addresses the conmand deadlock 
with the non-resolving host:

  * libipmiconsole/src/ipmiconsole_ctx.c
  (__ipmiconsole_ctx_connection_cleanup): Only call the callback if
  an error occurred during a session submission, not if an error
  occurred in API land.

This change is available on the freeipmi trunk:

  svn://svn.savannah.gnu.org/freeipmi/trunk

I'm currently testing these changes on a configuration with 1303 consoles, 116 
of which are using IPMI.  No problems so far.

Original comment by chris.m.dunlap on 10 Mar 2011 at 10:41

GoogleCodeExporter commented 9 years ago
So far the latest fixes seem to work flawlessly for me. My only problem now is 
a large number of flaky Dell iDRACs.
Thanks all!

Original comment by cameron....@gmail.com on 15 Mar 2011 at 12:18

GoogleCodeExporter commented 9 years ago
Good to hear things are working.  I was talking to Al earlier today and our 
current plan is to try and have new releases out by the end of the month.

Original comment by chris.m.dunlap on 15 Mar 2011 at 1:00

GoogleCodeExporter commented 9 years ago
This issue was updated by revision r1022.

Requires freeipmi-1.0.3 or later.

Original comment by chris.m.dunlap on 4 Apr 2011 at 11:32