ros / ros_comm

ROS communications-related packages, including core client libraries (roscpp, rospy, roslisp) and graph introspection tools (rostopic, rosnode, rosservice, rosparam).
http://wiki.ros.org/ros_comm
762 stars 911 forks source link

boost::recursive_mutex assert fails in ros::Connection::drop() when Connection drops #1989

Open meng-atomic opened 4 years ago

meng-atomic commented 4 years ago

ROS Version

kinetic-devel tag: 1.12.13

Core Stack

0 0x00007f2d7c657428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54

1 0x00007f2d7c65902a in __GI_abort () at abort.c:89

2 0x00007f2d7c64fbd7 in __assert_fail_base (fmt=, assertion=assertion@entry=0x7c8095 "!pthread_mutex_lock(&m)",

file=file@entry=0x7c60d0 "/usr/include/boost/thread/pthread/recursive_mutex.hpp", line=line@entry=113,
function=function@entry=0x7cbdc0 <boost::recursive_mutex::lock()::__PRETTY_FUNCTION__> "void boost::recursive_mutex::lock()")
at assert.c:92

3 0x00007f2d7c64fc82 in __GI___assert_fail (assertion=assertion@entry=0x7c8095 "!pthread_mutex_lock(&m)",

file=file@entry=0x7c60d0 "/usr/include/boost/thread/pthread/recursive_mutex.hpp", line=line@entry=113,
function=function@entry=0x7cbdc0 <boost::recursive_mutex::lock()::__PRETTY_FUNCTION__> "void boost::recursive_mutex::lock()")
at assert.c:101

4 0x00000000005e1b7e in boost::recursive_mutex::lock (this=)

at /usr/include/boost/thread/pthread/recursive_mutex.hpp:113

5 boost::unique_lock::lock (this=0x7f2d74663270) at /usr/include/boost/thread/lock_types.hpp:346

6 0x00007f2d80e93085 in ros::Connection::drop(ros::Connection::DropReason) () from /opt/ros/kinetic/lib/libroscpp.so

7 0x00007f2d80f0e9b5 in ros::TransportPublisherLink::drop() () from /opt/ros/kinetic/lib/libroscpp.so

8 0x00007f2d80f365f1 in ros::Subscription::pubUpdate(std::vector<std::cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) ()

from /opt/ros/kinetic/lib/libroscpp.so

9 0x00007f2d80ec0193 in ros::TopicManager::pubUpdate(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::vector<std::cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) () from /opt/ros/kinetic/lib/libroscpp.so

10 0x00007f2d80ec0676 in ros::TopicManager::pubUpdateCallback(XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) ()

from /opt/ros/kinetic/lib/libroscpp.so

11 0x00007f2d80eb38ec in ros::XMLRPCCallWrapper::execute(XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) ()

from /opt/ros/kinetic/lib/libroscpp.so

12 0x00007f2d7ad3ae4f in XmlRpc::XmlRpcServerConnection::executeMethod(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) () from /opt/ros/kinetic/lib/libxmlrpcpp.so

13 0x00007f2d7ad3d5e4 in XmlRpc::XmlRpcServerConnection::executeRequest() () from /opt/ros/kinetic/lib/libxmlrpcpp.so

14 0x00007f2d7ad3aaa8 in XmlRpc::XmlRpcServerConnection::writeResponse() () from /opt/ros/kinetic/lib/libxmlrpcpp.so

15 0x00007f2d7ad3ac60 in XmlRpc::XmlRpcServerConnection::handleEvent(unsigned int) () from /opt/ros/kinetic/lib/libxmlrpcpp.so

16 0x00007f2d7ad38a6f in XmlRpc::XmlRpcDispatch::work(double) () from /opt/ros/kinetic/lib/libxmlrpcpp.so

17 0x00007f2d80eafe3e in ros::XMLRPCManager::serverThreadFunc() () from /opt/ros/kinetic/lib/libroscpp.so

Crash Site

https://github.com/ros/ros_comm/blob/53ab1921217b9963b45cf64979054780028cac86/clients/roscpp/src/libros/connection.cpp#L330

Code Analysis

When the scoped_lock tries to lock the dropmutex, the assertion happened. Something must be wrong with the mutex. Since the dropmutex is a member of class Connection, it makes sense to infer that the dropmutex has already been destructed with the destruction of an object of class Connection.

Three threads relate to this problem:

  1. A thread provided by PollManager(Thread P)
  2. A thread provided by XMLRpcManager(Thread X)
  3. A thread provided by TransportPublisherLink(Thread T, the timer thread)

Follow the call stack:
https://github.com/ros/ros_comm/blob/53ab1921217b9963b45cf64979054780028cac86/clients/roscpp/src/libros/transport_publisher_link.cpp#L109-L118 In Thread X, the address of connection_ was passed to Connection::drop(). During the call to Connection::drop(), Thread X was suspended. Thread P removed the dropped connection, Thread T created a new Connection object and replace the old connection with the newly created one:
https://github.com/ros/ros_comm/blob/53ab1921217b9963b45cf64979054780028cac86/clients/roscpp/src/libros/transport_publisher_link.cpp#L229-L236 When Thread X resumed, the object that Connection::drop() used as this pointer had already been destructed. So dropmutex no longer existed. The program crased.

Fix Discussion

Option 1: Data racing exists in TransportPublisherLink::connection_(a shared_pointer to Connection instance), so it should be protected by a lock. Option 2: In TransportPublisherLink::drop(), create a tmp variable on stack to hold the connection instance so that it will not be destructed before Connection::drop() returns.

fujitatomoya commented 4 years ago

@meng-atomic

thanks for the insights!

one question which version of ROS did you meet this core stack? like noetic or melodic? and can you reproduce this issue with SSCCE?

this could be related to the followings, these have been addressed with Noetic release.

and in your theory, the problem could happen at any stack with using Connection object, right?

meng-atomic commented 4 years ago

@fujitatomoya I've reproduced the issue to make it short and clear. #1949 #1939 and this issue are all caused by the drop of a connection, but crased in different sites for different reasons. TransportPublisherLink::connection_ is not thread safe, so, yes , the problem could happen with using Connection object.

fujitatomoya commented 4 years ago

@meng-atomic

I've reproduced the issue to make it short and clear.

is that possible if you could share that test sample with us?