ros / ros_comm

ROS communications-related packages, including core client libraries (roscpp, rospy, roslisp) and graph introspection tools (rostopic, rosnode, rosservice, rosparam).
http://wiki.ros.org/ros_comm
752 stars 911 forks source link

Statistics deadlock fix #2246

Open JesseAtTorc opened 2 years ago

JesseAtTorc commented 2 years ago

Credit to Jacob White @ Torc Robotics for the thorough research on development of this fix.

ROS_COMM Problem and Solution

There are two deadlocks that can occur when the rosparam enable_statistics is set to true. Both stem from this block of code from statistics.cpp in the ros_comm package (line 230):

    if (!pub_.getTopic().length())
    {
      ros::NodeHandle n("~");
      // creating the publisher in the constructor results in a deadlock. so do it here.
      pub_ = n.advertise<rosgraph_msgs::TopicStatistics>("/statistics", 1);
    }    

    pub_.publish(msg)

Note: The comment describing the publisher deadlock is original ros content.

Deadlock 1: During the advertisement of the /statistics msg, a deadlock can occur if the same topic is adveritised twice. The StatisticsLogger calls the advertise function after a second of statistics data is accumulated on the topic of interest. Thus in order to produce the deadlock, we must be in the middle of subscribing to something exactly 1 second after the 1st subscriber was created. A deadlock occurs between callbacksmutex in Subscription class and the subsmutex in the TopicManager class. Stack trace with mutex ownership described below. This will occur only on startup during the second subscription to the same topic. In our case this was the tf topic. Because the statistics logger requests 5 ros parameters, and init is called every time there is a new subscription, the network delay becomes a factor and is required to demonstrate this.

Thread 1

Thread 2

Deadlock 2: The second deadlock occurs because there is a deadlock between 4 separate threads running that become locked due to a direct call to publish within subscription.cpp via StatisticsLogger::publish. This happens on the thread within PollManager which is listening for incoming subscribed topics. When Subscription::handleMessage is called, the callbacksmutex is locked which prevents call to StatisticsLogger::publish from creating the publisher for /statistics. This also occurs on startup. The topic that appears to drive this is the bond topic that occurs between nodelet components.

A summary of the lock is provided below.

Threads are waiting on these mutexes: Thread 1: Bond::isBroken()

Thread 2: TopicManager::publish

Thread 3: TransportTCP::close()

Thread 7: Subscription::getPublishTypes((bool&, bool&, std::type_info const&)

Solution:

Based on Jacob White's suggestion, another class was created to handle the accumulation of /statistics msg data with it's own thread to handle publishing of queued statistics data. It was constructed using a similar pattern used in the other classes called in init.cpp and is initialized in the same location.

  TopicManager::instance()->start();
  ServiceManager::instance()->start();
  ConnectionManager::instance()->start();
  PollManager::instance()->start();
  XMLRPCManager::instance()->start();
  StatisticsManager::instance()->start(); <------  new class 

Instead of publishing directly, now the StatisticsLogger class calls StatisticsManager::addToQueue to add the msg to a mutex protected deque and the threadFunc handed off to the new thread consumes and publishes the msgs. This change moves the initialization of the publisher and publishing of the messages off of the PollManager thread.

Attached in a subsequent comment is a small set of code that can be used to recreate the deadlock. You'll need to clone ros_comm where is says to to provide an overlay and after building the package, these steps are required to recreate it locally:

  1. Start roscore and set enable_statistics to true. This can be done via the launch file in the zip file.

  2. Introduce a network delay to simulate a highly loaded network. This can be done via the instructions in this link. The same instructions exist inside the package inside minimal_deadlock_node.cpp. Target your loopback connection if testing on a local machine. https://www.pico.net/kb/how-can-i-simulate-delayed-and-dropped-packets-in-linux/

  3. Playback a rosbag with tf topics or have some node consistently broadcast tf data.

  4. Run the node via 'rosrun minimal_deadlock minimal_deadlock_node` The terminal will output the timing of specific locks and unlocks. Depending on the computer performance adjusting the timing of the delay and the timing of the sleep call in minimal_deadlock.cpp.

After switching to the updated branch and rebuilding, the deadlock is no longer possible to recreate with the same steps.

JesseAtTorc commented 2 years ago

Attached is the isolated example:

deadlock.zip

Additionally, our legal team has two questions for the ROS maintainers on making this contribution. Some of these items aren't clear from a review of the repository and wiki and we're hoping that we might get some clarity as we move forward.

  1. Can we confirm that ROS Comm is licensed under the BSD license? Is there a page on GitHub or somewhere they can point us to where we can learn more about and confirm our understanding of the licensing terms for ROS?

  2. What are the licensing terms from us to the community for the code we want to contribute? Is there a form we would need to complete and sign?

JesseAtTorc commented 2 years ago

This PR resolves #1329