Closed GoogleCodeExporter closed 9 years ago
Sorry - this is probably the wrong place for this and I can't see how to delete
the ticket. I saw the users group link after submitting this.
Original comment by cameron....@gmail.com
on 24 Feb 2011 at 5:54
No problem. I'll leave this open until the root cause has been identified.
Discussion is at
<http://groups.google.com/group/conman-users/t/e20dba51e31c57e5>.
Original comment by chris.m.dunlap
on 24 Feb 2011 at 9:25
Once the ipmi.ctx has been submitted via ipmiconsole_engine_submit() from
within initiate_ipmi_connect(), there is a race condition where the
connect_ipmi_obj() callback can be invoked by the ipmiconsole_engine thread
before ipmi.state has been updated in the next line. This can cause a pending
connection to be re-initiated ad nauseam.
Also, the connect_ipmi_obj() callback can be invoked from either the timer
thread or the ipmiconsole_engine thread. If fail_ipmi_connect() is called from
within the ipmiconsole_engine thread, then the ipmi.ctx memory can be
de-allocated while it is still in use by the ipmiconsole_engine thereby making
the engine very unhappy.
Original comment by chris.m.dunlap
on 4 Mar 2011 at 1:38
A while back, Matthieu Pérotin encountered the same problem and sent me a most
thorough analysis. My bad for not getting this fixed sooner.
Matthieu writes:
The problem here resides in the fact that the patch I submitted destroys
the contexts in fail_ipmi_connect, in order to maintain a consitency
between the state of the IPMI connection and the need of a context. The
source of the problem here was that it was impossible to run this part
of conman in a debugger: an assert in create_ipmi_ctx would fail, the
function being called on a pre existing context.
Anyway, what I did not forsee, and did not witness due to a lack of
sufficient testing, was that fail_ipmi_connect could be called from the
libfreeipmi, hence from the context. Hence, destroying the context in
here was a very bad idea... => SEGFAULT.
The whole backtrace is as follow:
#0 0xb7fe2424 in __kernel_vsyscall ()
#1 0xb7e49751 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2 0xb7e4cb82 in *__GI_abort () at abort.c:92
#3 0xb7e428b8 in *__GI___assert_fail (assertion=0x8069738 "ipmi->aux.ipmi.ctx
== ((void *)0)",
file=0x80693b8 "server-ipmi.c", line=451, function=0x8069937 "create_ipmi_ctx") at assert.c:81
#4 0x0805c427 in create_ipmi_ctx (ipmi=0x806fdf8) at server-ipmi.c:451
#5 0x0805c2bd in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:396
#6 0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#7 0xb7f7bd64 in __ipmiconsole_ctx_connection_cleanup (c=0x806e5c0,
session_submitted=<value optimized out>) at ipmiconsole_ctx.c:837
#8 0xb7f714d1 in ipmiconsole_engine_submit (c=0x806e5c0, callback=0x805c196
<connect_ipmi_obj>,
callback_arg=0x806fdf8) at ipmiconsole.c:229
#9 0x0805c316 in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:402
#10 0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#11 0x0805bfd6 in open_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:309
#12 0x080538ae in reopen_obj (obj=0x806fdf8) at server-obj.c:266
#13 0x0804b95b in open_objs (conf=0x806d028) at server.c:699
#14 0x0804a745 in main (argc=2, argv=0xbffff794) at server.c:152
In this backtrace the same problem than with issue 1 occurs, the assert
fails, and prevents debugging... lets comment out the assert to see what
is happening, like in the compiled daemon.
To me, there is a first problem, connect_ipmi_obj (the callback
function) calls initiate_ipmi_connect a second time because the state
was left to DOWN before submitting it the ipmi engine, but that is
another story, that causes another bug that we are just about to
witness:
Now, I get Segfaults happening with too many socket pairs:
Program received signal SIGSEGV, Segmentation fault.
0xb7f7da0c in _debug (
fmt=0xbf801fe8 "(ipmiconsole_ctx.c, ipmiconsole_ctx_connection_setup, 442): socketpair: Too many open files", ap=0xbf801fb4 "") at ipmiconsole_debug.c:108
108 {
and backtraces with thousands of calls...
#0 0xb7f7da0c in _debug (
fmt=0xbf801fe8 "(ipmiconsole_ctx.c, ipmiconsole_ctx_connection_setup, 442): socketpair: Too many open files", ap=0xbf801fb4 "") at ipmiconsole_debug.c:108
#1 0xb7f7e1a1 in ipmiconsole_debug (
fmt=0xbf801fe8 "(ipmiconsole_ctx.c, ipmiconsole_ctx_connection_setup, 442): socketpair: Too many open files") at ipmiconsole_debug.c:198
#2 0xb7f78ba3 in ipmiconsole_ctx_connection_setup (c=0x1730e048) at
ipmiconsole_ctx.c:529
#3 0xb7f71450 in ipmiconsole_engine_submit (c=0x1730e048, callback=0x805c196
<connect_ipmi_obj>,
callback_arg=0x806fdf8) at ipmiconsole.c:191
#4 0x0805c316 in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:403
#5 0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#6 0xb7f7bd64 in __ipmiconsole_ctx_connection_cleanup (c=0x1730db90,
session_submitted=<value optimized out>) at ipmiconsole_ctx.c:837
#7 0xb7f714d1 in ipmiconsole_engine_submit (c=0x1730db90, callback=0x805c196
<connect_ipmi_obj>,
callback_arg=0x806fdf8) at ipmiconsole.c:229
#8 0x0805c316 in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:403
#9 0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#10 0xb7f7bd64 in __ipmiconsole_ctx_connection_cleanup (c=0x1730d6d8,
session_submitted=<value optimized out>) at ipmiconsole_ctx.c:837
#11 0xb7f714d1 in ipmiconsole_engine_submit (c=0x1730d6d8, callback=0x805c196
<connect_ipmi_obj>,
callback_arg=0x806fdf8) at ipmiconsole.c:229
...
As a quick fix, let's pass the state to CONMAN_IPMI_PENDING before
submitting to the engine, and now I do find the "correct" bug, which
raises an assert in libfreeipmi:
conmand: ipmiconsole_ctx.c:415: _ipmiconsole_ctx_connection_init:
Assertion `c->magic == 0x74AB8831' failed.
Program received signal SIGABRT, Aborted.
0xb7fe2424 in __kernel_vsyscall ()
(gdb) backtrace
#0 0xb7fe2424 in __kernel_vsyscall ()
#1 0xb7e49751 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2 0xb7e4cb82 in *__GI_abort () at abort.c:92
#3 0xb7e428b8 in *__GI___assert_fail (assertion=0xb7fa4025 "c->magic ==
0x74AB8831",
file=0xb7fa5fdc "ipmiconsole_ctx.c", line=415, function=0xb7fa63e0 "_ipmiconsole_ctx_connection_init")
at assert.c:81
#4 0xb7f773cc in _ipmiconsole_ctx_connection_init (c=0x6) at
ipmiconsole_ctx.c:415
#5 0xb7f7c0a2 in __ipmiconsole_ctx_connection_cleanup (c=0x806e5c0,
session_submitted=<value optimized out>) at ipmiconsole_ctx.c:948
#6 0xb7f714d1 in ipmiconsole_engine_submit (c=0x806e5c0, callback=0x805c196
<connect_ipmi_obj>,
callback_arg=0x806fdf8) at ipmiconsole.c:229
#7 0x0805c323 in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:403
#8 0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#9 0x0805bfd6 in open_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:309
#10 0x080538ae in reopen_obj (obj=0x806fdf8) at server-obj.c:266
#11 0x0804b95b in open_objs (conf=0x806d028) at server.c:699
#12 0x0804a745 in main (argc=2, argv=0xbffff3e4) at server.c:152
Indeed, the context was destroyed a few moments before by a call to
fail_ipmi_connect:
#0 fail_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:509
#1 0x0805c276 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:381
#2 0xb7f7bd64 in __ipmiconsole_ctx_connection_cleanup (c=0x806e5c0,
session_submitted=<value optimized out>) at ipmiconsole_ctx.c:837
#3 0xb7f714d1 in ipmiconsole_engine_submit (c=0x806e5c0, callback=0x805c196
<connect_ipmi_obj>,
callback_arg=0x806fdf8) at ipmiconsole.c:229
#4 0x0805c323 in initiate_ipmi_connect (ipmi=0x806fdf8) at server-ipmi.c:403
#5 0x0805c221 in connect_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:370
#6 0x0805bfd6 in open_ipmi_obj (ipmi=0x806fdf8) at server-ipmi.c:309
#7 0x080538ae in reopen_obj (obj=0x806fdf8) at server-obj.c:266
#8 0x0804b95b in open_objs (conf=0x806d028) at server.c:699
#9 0x0804a745 in main (argc=2, argv=0xbffff3e4) at server.c:152
Without this assert, the program, without surprise, segfaults a few
instructions later.
Original comment by chris.m.dunlap
on 4 Mar 2011 at 1:51
Hi,
I'm sorry to have left this open for so long :( I have continued to work on
this bug on my side and came up with something that seems to fix the problem,
although I'm not 100% satisfied with the solution.
I attach the patch I have been using for the past few months, with a minor
change (I really scratched my head in order to come up with clear explanations
and re-did quite a few gdb sessions in order to reproduce the bug...). As I
explained to Chris at the time, I had this segfault when using conman +
freeipmi and non resolving hosts.
So, here comes the fun:
1. The first chunk is about a trivial error in text formatting, that is not
related to the issue at hand, but could still disrupt conmand in some (very
unlikely) cases.
2. The second chunk fixes the loop between initiate_ipmi_connect and
connect_ipmi_obj. By changing the state to PENDING, before submitting to the
engine, we ensure that the second call to connect_impi_obj will not be the same.
3. Now, to fix the issue raised in the analysis I submitted in the post quoted
by Chris just above, I had to:
a. remove the context destruction in fail_ipmi_connect (chunk #5)
b. remove the assert in create_ipmi_ctx and replace it by a test and a context creation if the context exists (chunk #3)
With this modification, the context still exists when returning to
_ipmiconsole_ctx_connection_init and everything goes fine.
4. However, one last thing remains to be done, and this is the thing that does
not satisfy me completely. In chunk #4, I add a test at the beginning of
fail_ipmi_connect, to return without doing anything if ipmi->aux.ipmi.timer is
not equal to -1 and if the state is DOWN. In fact, when analyzing the stack
trace (see above), one can see that connect_ipmi_obj gets called twice:
once by open_ipmi_obj, and once, upon callback by the engine, by
__ipmiconsole_ctx_connection_cleanup.
Upon failing, fail_ipmi_connect is called. The first time (exiting from the
callback) everything normal is happening, and a call to connect_ipmi_obj is
rescheduled. And just after that, fail_ipmi_connect is called again, from the
very first connect_ipmi_obj. This second time, we get a failed assert, as
ipmi->aux.ipmi.timer is not equal to -1. Moreover, we really do not have
anything to do as we've just done everything the moment before. Hence the test
and the return, at the beginning of this function.
If we do nothing, we get log messages twice, two calls to
tpoll_timeout_relative and a double fold increase of the delay.
To me, the root of the problem here is that connect_ipmi_obj is called as a
callback to itself, and maybe the lack of states in the automaton makes it
difficult to differentiate between the causes of the errors.
The way I implemented a solution to point 4 is working, although, to say the
least, I am not sure it is the best way it could be implemented. I would be
more than happy to get your opinion on that point, and I'm really open to
discussion.
Please find the patch attached, and do not hesitate to comment !
Matthieu.
Original comment by matthieu...@bull.net
on 8 Mar 2011 at 4:17
Attachments:
Matthieu,
I retried conman using your patch and it appears to be working fine.
Thanks!
Cameron
Original comment by cameron....@gmail.com
on 9 Mar 2011 at 12:34
Hmm, perhaps I spoke too soon. I just noticed that after I leave the console, I
can't seem to get back in. I'm using agetty in the inittab file and once I've
left the console, the agetty process no longer shows up and an 'init q' doesn't
spawn it.
Original comment by cameron....@gmail.com
on 9 Mar 2011 at 12:42
Hey Matthieu,
I think most of this problem is attributable to concurrent access to a given
ipmi_obj struct from the main thread and the ipmiconsole engine thread via
connect_ipmi_obj(). Once ipmiconsole_engine_submit() is called, the
ipmiconsole engine will start running the connect_ipmi_obj() callback before
control is returned to initiate_ipmi_connect() in the main thread. I've added
a mutex to serialize access to key variables (ctx, state, timer, & delay)
within the ipmi_obj struct (r1019).
That mutex will protect the transition to the PENDING state (in #2). I've kept
the state transition until after ipmiconsole_engine_submit() in case the
submission fails (at which point it remains in the DOWN state).
Re: #4, the failed assert on ipmi->aux.ipmi.timer should no longer be possible
with the mutex. And I don't think multiple calls to fail_ipmi_connect() for
the same connection can still occur.
Re: #3, Al says invoking ipmiconsole_ctx_destroy() from within the callback
should be ok. But destroying the ctx immediately before recreating it seems
safer, so I've made those changes (r1017 & r1018).
Re: #1, yeah, oops. Also, I've changed it to a non-fatal error which will
retry the connection (r1013 & r1014) -- although it should be impossible to
trigger now unless new states are added.
I've committed these changes to
<https://conman.googlecode.com/svn/branches/issue-7>. I'm hoping to do some
testing at scale tomorrow. Let me know if you have a chance to test this.
Original comment by chris.m.dunlap
on 9 Mar 2011 at 4:07
Hi Chris,
I had a quick test run on the issue 7 branch this morning, and witnessed a
double lock of the same mutex.
You will find attached a debugging output.
I've set breakpoints on every functions of server-ipmi.c and every call to a
x_pthread_mutex* function.
conmand is configured here with a single console, with an invalid bmc, which
does not resolve.
As noted before, connect_ipmi_obj is called twice, there is two calls to
x_pthread_mutex_lock, and the program remains locked.
In fact, in this testcase of a failure, a single thread is spawned and the two
calls to connect_ipmi_obj are done within the same thread, hence the deadlock
when a connection fails.
The solution you implemented with a mutex would work if all the calls happened
in different threads, but unfortunately, it does not seem to always be the case.
Original comment by matthieu...@bull.net
on 9 Mar 2011 at 4:00
Attachments:
The test for a non-resolving host here makes for a simple reproducer. Thanks!
ipmiconsole_engine_submit() is behaving in a way I didn't expect.
Original comment by chris.m.dunlap
on 9 Mar 2011 at 9:18
Al made a fix to libipmiconsole yesterday that addresses the conmand deadlock
with the non-resolving host:
* libipmiconsole/src/ipmiconsole_ctx.c
(__ipmiconsole_ctx_connection_cleanup): Only call the callback if
an error occurred during a session submission, not if an error
occurred in API land.
This change is available on the freeipmi trunk:
svn://svn.savannah.gnu.org/freeipmi/trunk
I'm currently testing these changes on a configuration with 1303 consoles, 116
of which are using IPMI. No problems so far.
Original comment by chris.m.dunlap
on 10 Mar 2011 at 10:41
So far the latest fixes seem to work flawlessly for me. My only problem now is
a large number of flaky Dell iDRACs.
Thanks all!
Original comment by cameron....@gmail.com
on 15 Mar 2011 at 12:18
Good to hear things are working. I was talking to Al earlier today and our
current plan is to try and have new releases out by the end of the month.
Original comment by chris.m.dunlap
on 15 Mar 2011 at 1:00
This issue was updated by revision r1022.
Requires freeipmi-1.0.3 or later.
Original comment by chris.m.dunlap
on 4 Apr 2011 at 11:32
Original issue reported on code.google.com by
cameron....@gmail.com
on 24 Feb 2011 at 5:46Attachments: