Closed stwirth closed 8 years ago
+1 on this issue. I'm investigating a core with a very similar stack trace. It's incredibly difficult to reproduce, but I've managed to get two instances of the dump now:
Program terminated with signal 11, Segmentation fault.
(gdb) bt full
#0 atomic_conditional_increment (pw=0x3241) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:92
rv = <optimized out>
#1 add_ref_lock (this=0x3239) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:139
No locals.
#2 shared_count (r=..., this=0x7f37ab7fdb18) at /usr/include/boost/smart_ptr/detail/shared_count.hpp:589
No locals.
#3 boost::shared_ptr<ros::Subscription>::shared_ptr<ros::Subscription> (this=0x7f37ab7fdb10, r=...) at /usr/include/boost/smart_ptr/shared_ptr.hpp:414
No locals.
#4 0x00007f37b900b7b0 in lock (this=0x7f376c0275b8) at /usr/include/boost/smart_ptr/weak_ptr.hpp:178
No locals.
#5 ros::TransportPublisherLink::onRetryTimer (this=0x7f376c0275a0) at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/transport_publisher_link.cpp:204
parent = {
px = 0x0,
pn = {
pi_ = 0x3239
}
}
__PRETTY_FUNCTION__ = "void ros::TransportPublisherLink::onRetryTimer(const ros::WallTimerEvent&)"
#6 0x00007f37b8fc1e56 in operator() (a0=..., this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
No locals.
#7 ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call (this=0x7f371c000eb0)
at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/include/ros/timer_manager.h:184
tracked = {
px = 0x0,
pn = {
pi_ = 0x0
}
}
event = {
last_expected = {
<ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
sec = 1430265753,
nsec = 727748844
}, <No data fields>},
last_real = {
<ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
sec = 1430265753,
nsec = 727984536
}, <No data fields>},
current_expected = {
<ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
sec = 1430265753,
nsec = 827748844
}, <No data fields>},
current_real = {
<ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
sec = 1430265753,
nsec = 828038627
}, <No data fields>},
profile = {
last_duration = {
<ros::DurationBase<ros::WallDuration>> = {
sec = 0,
nsec = 424
}, <No data fields>}
}
}
#8 0x00007f37b8fea4ef in ros::CallbackQueue::callOneCB (this=this@entry=0x24eb900, tls=tls@entry=0x7f37a40008c0)
at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/callback_queue.cpp:382
rw_lock = {
m = 0x7f371c0014f8,
is_locked = true
}
last_calling = 18446744073709551615
result = ros::CallbackInterface::Invalid
info = {
callback = {
px = 0x7f371c000eb0,
pn = {
pi_ = 0x7f371c000910
}
},
removal_id = 139877086151072,
marked_for_removal = false
}
cb = @0x7f37ab7fdd00: {
px = 0x7f371c000eb0,
pn = {
pi_ = 0x7f371c000910
}
}
id_info = {
px = 0x7f371c0014f0,
pn = {
pi_ = 0x7f371c000a80
}
}
#9 0x00007f37b8fea843 in ros::CallbackQueue::callAvailable (this=0x24eb900, timeout=...)
at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/callback_queue.cpp:333
tls = <optimized out>
called = 0
#10 0x00007f37b90162cc in ros::internalCallbackQueueThreadFunc () at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/init.cpp:282
queue = {
px = 0x24eb900,
pn = {
pi_ = 0x24e7cd0
}
}
#11 0x00007f37bb4b594d in boost::(anonymous namespace)::thread_proxy (param=<optimized out>) at libs/thread/src/pthread/thread.cpp:164
thread_info = {
px = 0x24eeac0,
pn = {
pi_ = 0x24eb860
}
}
#12 0x00007f37b8ce41da in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#13 0x00007f37b6e11d7d in clone () from /lib64/libc.so.6
No symbol table info available.
The only useful piece of information that I can provide in reproducing the dump is that it happens (sometimes) as I kill a node that published / subscribed to the same topics that the dumping node subscribed / published to respectively.
Poking around in frame 8 (died @ line 382
of /src/libros/callback_queue.cpp
), I see that the TLS is not being marked_for_removal
, which causes the cb->call()
to get executed on a subscription that has already been destroyed:
│362 CallbackInfo info = *tls->cb_it;
...
│375 if (info.marked_for_removal)
│376 {
│377 tls->cb_it = tls->callbacks.erase(tls->cb_it);
│378 }
│379 else
│380 {
│381 tls->cb_it = tls->callbacks.erase(tls->cb_it);
>│382 result = cb->call();
│383 }
Should / how would this callback be marked for removal?
(gdb) print info
$50 = {
callback = {
px = 0x7f371c000eb0,
pn = {
pi_ = 0x7f371c000910
}
},
removal_id = 139877086151072,
marked_for_removal = false
}
@dirk-thomas & @wjwwood, Have you seen any instances of this segfault at OSRF? I now have several instances from various ROS nodes, mostly on disconnect, but the core dumps are frustratingly hard to debug past a certain point (all with traces similar to the one shown above).
I'm trying to distill this crash down into a simple, reproducible example, so any thoughts or tips would be greatly appreciated.
@rethink-imcmahon I have not. I'll try to spend some time today or tomorrow looking at it.
Ok, I still don't have a nice reproducible example, but I think I have uncovered the issue. For this block of offending code where the dumps always occur - https://github.com/ros/ros_comm/blob/indigo-devel/clients/roscpp/src/libros/callback_queue.cpp#L362-L383
Inside CallbackQueue::callOneCB()
there is a very important info
variable:
362 | CallbackInfo info = *tls->cb_it;
...
368 | boost::shared_lock<boost::shared_mutex> rw_lock(id_info->calling_rw_mutex);
...
375 | if (info.marked_for_removal)
376 | {
377 | tls->cb_it = tls->callbacks.erase(tls->cb_it);
378 | }
379 | else
380 | {
381 | tls->cb_it = tls->callbacks.erase(tls->cb_it);
382 | result = cb->call();
383 | }
What seems odd here is that we are not referencing the original *tls->cb_it
object, but making a copy of it! To make the reference properly, we need to add an &:
362 | CallbackInfo& info = *tls->cb_it;
Furthermore, if you look at another very similar block of code in the same file, you can see the reference is properly executed for a CallbackInfo
object:
https://github.com/ros/ros_comm/blob/indigo-devel/clients/roscpp/src/libros/callback_queue.cpp#L248-L261
248 | CallbackInfo& info = *it;
249 |
250 | if (info.marked_for_removal)
251 | {
252 | it = callbacks_.erase(it);
253 | continue;
254 | }
Copy vs reference? It really shouldn't matter, should it? Well in the L362-383 block it does, since there is a shared_mutex
lock in between the copy and the if (info.marked_for_removal)
. If the conditions are just right, by the time the mutex unlocks, the info
copy is stale and false
, despite the fact that the original object's marked_for_removal
field would have been set to true
. That would cause the result = cb->call()
to execute on an invalid callback and coredump.
I'm wide open for critiques of this theory & ideas on a reliable method for reproducing this crash...
roscomm
built fine with the changes that I proposed. Perhaps I'm just lucky, but I managed to get two coredumps in a row with modified talker / subscriber
roscpp_tutorial
nodes. Trying to reproduce the issue (and gain more insight into the original *tls->cb_it
object), I added a couple copies to bring them into the local scope. Then rebuilt. Now the nodes are spinning up and down continuously every few seconds, so I'll post if I get another useful coredump.
I have created PR #615 to add the missing reference. Even if it is unclear if it helps solving this problem not copying the callback info at that point is the "right" thing to do. (Update: but as trivial as the patch looks like it doesn't seem to pass all unit tests)
Can you please share some information about your scenario where you get the crashes with the roscpp_tutorial talker / listener. Then others might be able to reproduce it the same way.
I want to summarize my thoughts when reading through the code - may be it helps others to trace the flow faster.
The callback iterators in the thread local storage (TLS) are the ones being already popped of the queue to be called within a specific thread without requiring locks on the queue itself.
There are four lines which use marked_for_removal
beside the constructor of the info object:
The second one checks the flag for callbacks in the queue and therefore does so with a locked mutex_
.
The third and fouth check the flag for a callback in the TLS and therefore do so while having a lock of the calling_rw_mutex
. The invocations of callOneCB
happen from callAvailable
which picks callbacks from the queue and puts them into the thread local storage (while holding mutex_
). The invocation of each callback in the TLS is then done mutex-free (which is the idea of the TLS).
The first one actually marks all pending callbacks in the TLS for removal. The surrounding removeByID
method holds no locks while setting these flags. This method is invoked from several places whenever callbacks are being removed / deregistered. This could happen asynchronously from a different thread as well as within another callback.
But since the TLS contains only thread local callbacks the block changing the flags (https://github.com/ros/ros_comm/blob/ea9ca68379175d977f8f72639cf9f067953b9e6d/clients/roscpp/src/libros/callback_queue.cpp#L193-L204) can only occur in case of an invocation from within another callback. Otherwise the TLS should not contain anything. Therefore I think it is perfectly safe to do this without additional locking.
Ok, thank you for the clarifications Dirk. I'm still trying to wrap my head around some of it, but the explanation is certainly helpful. Just brainstorming ways to debug this, I was thinking of adding an artificial wait around the lock to simulate a starved thread. I'll let you know if this line of thought turns up anything.
As for the talker / listener, I'll commit them to my fork and put a link here. They did segfault overnight but in another piece of initialization code. I'll look into that later, but for now this is still my focus.
What are the unit tests checking for exactly?
Are you referring to the unit tests failing for #615? I don't know what they are checking exactly. Will have to look closer into them but it seems they are failing due to the added reference.
@dirk-thomas or @stwirth, do you have any updates on this bug? I'd be curious to know why #615 unit tests were failing, and how we can fix up this patch
@rethink-imcmahon I don't have updates. The bug never appeared again so I don't spend time on it.
Also no updates from me. I haven't spent any more time on this since the last comment. I don't expect to have time for looking further into it and therefore tried to summarize my investigation as good as possible to enable others to continue looking into it.
I've observed this crash but I don't have anything meaningful to add:
Core was generated by `/home/administrator/robot_ws/devel/lib/move_base/move_base map:=slam/map cmd_ve'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007ff4450347a3 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
(gdb) bt
#0 0x00007ff4450347a3 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
#1 0x00007ff444fe7b30 in ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call() () from /opt/ros/indigo/lib/libroscpp.so
#2 0x00007ff445010777 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) () from /opt/ros/indigo/lib/libroscpp.so
#3 0x00007ff445011583 in ros::CallbackQueue::callAvailable(ros::WallDuration) () from /opt/ros/indigo/lib/libroscpp.so
#4 0x00007ff44503d8f4 in ros::internalCallbackQueueThreadFunc() () from /opt/ros/indigo/lib/libroscpp.so
#5 0x00007ff442acfa4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#6 0x00007ff4428ae182 in start_thread (arg=0x7ff43b37e700) at pthread_create.c:312
#7 0x00007ff44433a47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
We have many virtual machines running a wide variety of ROS tasks nightly, so this issue is causing a handful of coredumps every night. In trying to see why the fix I proposed in May didn't work, I added the "reference-instead-of-copy" patch locally. Eventually, I managed to make it core dump:
(gdb) bt full
#0 0x00007f0e3c5afebc in ros::CallbackQueue::callOneCB (this=0x2129460, tls=0x7f0e280008c0)
at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:382
rw_lock = {m = 0x7f0ddc000938, is_locked = true}
last_calling = 18446744073709551615
result = ros::CallbackInterface::Invalid
__PRETTY_FUNCTION__ = "ros::CallbackQueue::CallOneResult ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*)"
info = @0x7f0e28001910: {callback = {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}},
removal_id = 139698911842016, marked_for_removal = false}
cb = @0x7f0e28001910: {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}
id_info = {px = 0x7f0ddc000930, pn = {pi_ = 0x7f0ddc000a80}}
#1 0x00007f0e3c5af921 in ros::CallbackQueue::callAvailable (this=0x2129460, timeout=...)
at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:333
tls = 0x7f0e280008c0
called = 0
#2 0x00007f0e3c5f5160 in ros::internalCallbackQueueThreadFunc ()
at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/init.cpp:282
queue = {px = 0x2129460, pn = {pi_ = 0x21289c0}}
#3 0x00007f0e3c6006c9 in boost::detail::thread_data<void (*)()>::run (this=0x2129cb0)
at /usr/include/boost/thread/detail/thread.hpp:117
No locals.
#4 0x00007f0e3ab79a4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
No symbol table info available.
#5 0x00007f0e3a958182 in start_thread (arg=0x7f0e34b86700) at pthread_create.c:312
__res = <optimized out>
pd = 0x7f0e34b86700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139698990769920, -5890305797539832903, 0, 0, 139698990770624,
139698990769920, 5790792828027591609, 5790814652555123641}, mask_was_saved = 0}}, priv = {pad = {0x0,
0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#6 0x00007f0e3b6f347d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) frame 0
#0 0x00007f0e3c5afebc in ros::CallbackQueue::callOneCB (this=0x2129460, tls=0x7f0e280008c0)
at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:382
382 result = cb->call();
(gdb) print cb
$1 = (ros::CallbackInterfacePtr &) @0x7f0e28001910: {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}
This coredump was quite hard to produce, but after hundreds of subscribers were spun up / SIGINT, I managed to get what you see here. So it seems that before, when we were copying CallbackInfo rather than grabbing a reference to it, we were seeing the partial execution of the callback (onRetryTimer
). Now, it appears the underlying result = cb->call()
itself is no longer valid. I think this has to do with the pointer to the subscriber being destroyed, causing the dereference to fail.
Before this patch, when callback_queue
held a copy of CallbackInfo info
, we could last a little longer and end up in more dangerous territory by trying to utilize a destroyed weak_ptr
. According to the documentation, weak_ptr operations never throw exceptions. roscpp
must have been doing something wrong then to cause the weak_ptr to throw an exception down in the assembly code as we see in the original stack trace. I looked some plausible causes for this: threads being destroyed while references to their callbacks still exist, or possibly this - the vector of subscriptions being swapped out during connection dropping, leaving a null pointer to the subscription vector in its place:
https://github.com/ros/ros_comm/blob/ace5b9c11af766564441bad2cb08f5fca5b73784/clients/roscpp/src/libros/subscription.cpp#L158-L176
Despite the new core dump, think I still prefer having the reference for CallbackInfo rather than the direct copy, since having the copy lead us into a weird assembly code state, and the reference gives us a null cb->call()
faster (which can be handled appropriately). I do feel like I am dancing around the underlying issue, so if anyone can give a more satisfying explanation of these errors, I'd be happy to hear it.
The head of our QA department sent me some stats on this bug today. We run an extensive suite of tests on a nightly basis with many virtual robots in the cloud. In just the virtual robots (not counting the instances of this bug that we find on our real hardware) he finds in the past month:
Total count of core dumps since 7/19/15: 56
Total count of test runs: 59
Total count of test suites per run: 40
On average, each suite takes 36 min.
So we saw 56 instances of the roscpp bug in total of 2360 newly cleaned robots, each of them running heavily for 36 min.
It seems every ROS subscriber is susceptible to the crash, and the nodes that create-destroy subscriptions regularly are the worst offenders. I am still actively hunting this bug down, but I would welcome any help from others in the community. @wjwwood @dirk-thomas @jacquelinekay any thoughts on how to raise awareness on this issue in the community?
@rethink-imcmahon I'd like to have a crack at it, but realistically I'm going to be too busy for a few weeks. Maybe you can just send an email to ros-users@ explaining that several people have taken some time to look at the problem but it eludes us. Someone might take it as a challenge and find a problem that we have overlooked.
@rethink-imcmahon here are a few ideas for creating a test to reliably reproduce this crash (since it sounds like it happens randomly):
The fact that it's difficult to reproduce but happens more frequently when there are many subscribers makes me wonder if it happens when there's a scheduling anomaly. Can you try adding a sleep right before line 382 of callback_queue.cpp
?
Or, to confirm or deny your theory about the subscriptions dropping, you could make a test where one thread spins doing some pub/sub work, and another thread causes all subscriptions to drop.
I made a little progress on this today. @jacquelinekay, thanks for the tip. I added the line "usleep(100);" before line 382 of callback_queue.cpp and that did help to reproduce the bug more often. I also modified a simple test by @rethink-imcmahon as follows and I can reproduce the bug about every minute:
#include "ros/ros.h"
#include "std_msgs/String.h"
#include <sstream>
#include <boost/thread.hpp>
void chatterCallback(const std_msgs::String::ConstPtr& msg)
{
ROS_INFO("I heard: [%s]", msg->data.c_str());
}
void subscribeDestroy() {
ros::Time start_time;
double rollover_time = 5.0; // seconds
size_t num_subs = 100;
std_msgs::String msg;
msg.data = std::string("hello world");
ros::NodeHandle n;
while(ros::ok())
{
start_time = ros::Time::now();
std::vector<ros::Subscriber> subs;
for( size_t i = 0; i < num_subs; i++ )
{
subs.push_back(n.subscribe("chatter", 100, chatterCallback));
}
usleep(500000);
// subscribers destructors called here
std::cout<<"Rolling over! Calling dtors now. "<<"Total time: "<<(ros::Time::now() - start_time).toSec()<<std::endl;
}
}
int main(int argc, char **argv)
{
ros::init(argc, argv, "listener");
boost::thread first (subscribeDestroy);
ros::spin();
return 0;
}
I've started working on solutions to the problem, but no big revelations yet.
@dirk-thomas I think it's safe to close this issue now that #677 is merged. Thanks!
Awesome @rethink-imcmahon! Thanks for all the effort!
Is there an ETA on when this fix will transition to the public debs? We're having lots of problems with this on the VMs here at ClearPath.
The changes in shadow-fixed have been announced six days ago (http://lists.ros.org/pipermail/ros-release/2015-October/004708.html) and we are waiting for people to test them thoroughly during that soak period. Probably within a week from now it will be synced to public.
We've put the shadow build of ros_comm 1.11.15-0 on our internal apt servers here at ClearPath and are running it without incident.
Thanks for reporting back.
We will sync it once the shadow-fixed repo has no regressions. Currently 1.11.16 is being built for a different reason.
We witnessed a crash of
diagnostic_agregator/aggregator_node
on one of our robots, that is what i could find:Unfortunately that happened only once on one robot. No way to reproduce it. Has anybody seen something like that or can give me a hint on how to debug this further?