Open meng-atomic opened 4 years ago
@meng-atomic
thanks for the insights!
one question which version of ROS did you meet this core stack? like noetic or melodic? and can you reproduce this issue with SSCCE?
this could be related to the followings, these have been addressed with Noetic release.
and in your theory, the problem could happen at any stack with using Connection object, right?
@fujitatomoya I've reproduced the issue to make it short and clear. #1949 #1939 and this issue are all caused by the drop of a connection, but crased in different sites for different reasons. TransportPublisherLink::connection_ is not thread safe, so, yes , the problem could happen with using Connection object.
@meng-atomic
I've reproduced the issue to make it short and clear.
is that possible if you could share that test sample with us?
ROS Version
kinetic-devel tag: 1.12.13
Core Stack
0 0x00007f2d7c657428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
1 0x00007f2d7c65902a in __GI_abort () at abort.c:89
2 0x00007f2d7c64fbd7 in __assert_fail_base (fmt=, assertion=assertion@entry=0x7c8095 "!pthread_mutex_lock(&m)",
3 0x00007f2d7c64fc82 in __GI___assert_fail (assertion=assertion@entry=0x7c8095 "!pthread_mutex_lock(&m)",
4 0x00000000005e1b7e in boost::recursive_mutex::lock (this=)
5 boost::unique_lock::lock (this=0x7f2d74663270) at /usr/include/boost/thread/lock_types.hpp:346
6 0x00007f2d80e93085 in ros::Connection::drop(ros::Connection::DropReason) () from /opt/ros/kinetic/lib/libroscpp.so
7 0x00007f2d80f0e9b5 in ros::TransportPublisherLink::drop() () from /opt/ros/kinetic/lib/libroscpp.so
8 0x00007f2d80f365f1 in ros::Subscription::pubUpdate(std::vector<std::cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std:: cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) ()
from /opt/ros/kinetic/lib/libroscpp.so
9 0x00007f2d80ec0193 in ros::TopicManager::pubUpdate(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::vector<std:: cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) () from /opt/ros/kinetic/lib/libroscpp.so
10 0x00007f2d80ec0676 in ros::TopicManager::pubUpdateCallback(XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) ()
from /opt/ros/kinetic/lib/libroscpp.so
11 0x00007f2d80eb38ec in ros::XMLRPCCallWrapper::execute(XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) ()
from /opt/ros/kinetic/lib/libroscpp.so
12 0x00007f2d7ad3ae4f in XmlRpc::XmlRpcServerConnection::executeMethod(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, XmlRpc::XmlRpcValue&, XmlRpc::XmlRpcValue&) () from /opt/ros/kinetic/lib/libxmlrpcpp.so
13 0x00007f2d7ad3d5e4 in XmlRpc::XmlRpcServerConnection::executeRequest() () from /opt/ros/kinetic/lib/libxmlrpcpp.so
14 0x00007f2d7ad3aaa8 in XmlRpc::XmlRpcServerConnection::writeResponse() () from /opt/ros/kinetic/lib/libxmlrpcpp.so
15 0x00007f2d7ad3ac60 in XmlRpc::XmlRpcServerConnection::handleEvent(unsigned int) () from /opt/ros/kinetic/lib/libxmlrpcpp.so
16 0x00007f2d7ad38a6f in XmlRpc::XmlRpcDispatch::work(double) () from /opt/ros/kinetic/lib/libxmlrpcpp.so
17 0x00007f2d80eafe3e in ros::XMLRPCManager::serverThreadFunc() () from /opt/ros/kinetic/lib/libroscpp.so
Crash Site
https://github.com/ros/ros_comm/blob/53ab1921217b9963b45cf64979054780028cac86/clients/roscpp/src/libros/connection.cpp#L330
Code Analysis
When the scoped_lock tries to lock the dropmutex, the assertion happened. Something must be wrong with the mutex. Since the dropmutex is a member of class Connection, it makes sense to infer that the dropmutex has already been destructed with the destruction of an object of class Connection.
Three threads relate to this problem:
Follow the call stack:
https://github.com/ros/ros_comm/blob/53ab1921217b9963b45cf64979054780028cac86/clients/roscpp/src/libros/transport_publisher_link.cpp#L109-L118 In Thread X, the address of connection_ was passed to Connection::drop(). During the call to Connection::drop(), Thread X was suspended. Thread P removed the dropped connection, Thread T created a new Connection object and replace the old connection with the newly created one:
https://github.com/ros/ros_comm/blob/53ab1921217b9963b45cf64979054780028cac86/clients/roscpp/src/libros/transport_publisher_link.cpp#L229-L236 When Thread X resumed, the object that Connection::drop() used as this pointer had already been destructed. So dropmutex no longer existed. The program crased.
Fix Discussion
Option 1: Data racing exists in TransportPublisherLink::connection_(a shared_pointer to Connection instance), so it should be protected by a lock. Option 2: In TransportPublisherLink::drop(), create a tmp variable on stack to hold the connection instance so that it will not be destructed before Connection::drop() returns.