Apparent race condition causes trajectory_execution_manager to crash

henryponti commented 5 years ago

Description

When a trajectory is preempted, trajectory_execution_manager sometimes crashes with a boost mutex error.

Environment

ROS Distro: Melodic
OS Version: Ubuntu 18.04
Moveit binary build
ros-industrial/universal_robot source build

Steps to reproduce

The script below should cause the crash within a few minutes.

https://gist.github.com/henryponti/21d14e59804d294fc3617e21ccdbd91e

Expected behaviour

move_group should allow the trajectory to be preempted without crashing.

Actual behaviour

move_group crashes.

Backtrace

https://gist.github.com/henryponti/5ed3bafd6f66fb4aedbadea2df82f336

welcome[bot] commented 5 years ago

Thanks for reporting an issue. We will have a look asap. If you can think of a fix, please consider providing it as a pull request.

BryceStevenWilley commented 5 years ago

Ran this with debug symbols, seems that the execution_thread_ that the Trajectory manager is holding gets cleared at some point, cause the segfault. Crashing line here.

henryponti commented 5 years ago

Thanks for the additional info @BryceStevenWilley. I tried to check if the thread is joinable, but that didn't help:

https://gist.github.com/henryponti/cbef093c1c00f635bdb0f5a2c9204c56

Could the problem be related to this?

https://stackoverflow.com/questions/51438287/boostthreadjoin-crash-trying-to-reference-destroyed-thread-info

Calling detach() instead of join() does seem to help:

https://gist.github.com/henryponti/c70dc65f78a31b82be9a6ef041cf827d

But it also crashes eventually:

https://gist.github.com/henryponti/c5fb862a8ff6ae7863fcc65d16d3f17a

BryceStevenWilley commented 5 years ago

From what I could tell (at least with how it was manifesting on my machine), the issue isn't whether or not the thread is joinable or not, it's the fact that the thread is null. Calling any function on a null will cause the segfault. I think that execution_thread_.reset(); is being called in another thread somehow, likely from some buggy way of handling the mutex/boolean execution_complete_ variable.

Just read through the stack overflow question you posted, that seems to be very much related, but I'm out of my depth with the exact situation that's happening.

henryponti commented 5 years ago

I agree. I tried to check if the pointer is null as well but it didn't help either. It looks like some deeper digging will be necessary. I am not terribly familiar with the architecture though, so I'm not sure I'll be able to get to the bottom of the problem without some extra help.

https://gist.github.com/henryponti/6d557b87ca42e26f40abb5b6826317dc

https://gist.github.com/henryponti/819d8e043d5a4497d5bb258161766bf6

rhaschke commented 5 years ago

@BryceStevenWilley, you are exactly right: Essentially we have two non-synchronized threads competing for access to execution_thread_.

The ActionServer thread, calling stopExecution() before executing a new path and after waiting for its completion:

#11 trajectory_execution_manager::TrajectoryExecutionManager::stopExecution(bool) () from /home/henry/catkin_ws/devel/lib/libmoveit_trajectory_execution_manager.so.1.0.1
#12 trajectory_execution_manager::TrajectoryExecutionManager::waitForExecution() () from /home/henry/catkin_ws/devel/lib/libmoveit_trajectory_execution_manager.so.1.0.1
#13 move_group::MoveGroupExecuteTrajectoryAction::executePath(boost::shared_ptr<moveit_msgs::ExecuteTrajectoryGoal_<std::allocator<void> > const> const&, moveit_msgs::ExecuteTrajectoryResult_<std::allocator<void> >&) () from /home/henry/catkin_ws/devel/lib/libmoveit_move_group_default_capabilities.so
#14 move_group::MoveGroupExecuteTrajectoryAction::executePathCallback(boost::shared_ptr<moveit_msgs::ExecuteTrajectoryGoal_<std::allocator<void> > const> const&) () from /home/henry/catkin_ws/devel/lib/libmoveit_move_group_default_capabilities.so
#15 actionlib::SimpleActionServer<moveit_msgs::ExecuteTrajectoryAction_<std::allocator<void> > >::executeLoop() () from /home/henry/catkin_ws/devel/lib/libmoveit_move_group_default_capabilities.so

The ROS event loop, calling stopExecution() when receiving stop on the trajectory_execution_event topic:

#0 trajectory_execution_manager::TrajectoryExecutionManager::stopExecution(bool)@plt () from /vol/sandbox/ros/devel/lib/libmoveit_trajectory_execution_manager.so.1.0.1
#1 trajectory_execution_manager::TrajectoryExecutionManager::processEvent (this=0x555556092340, event="stop") at /vol/sandbox/ros/src/moveit/moveit_ros/planning/trajectory_execution_manager/src/trajectory_execution_manager.cpp:225
#2 trajectory_execution_manager::TrajectoryExecutionManager::receiveEvent (this=0x555556092340, event=...) at /vol/sandbox/ros/src/moveit/moveit_ros/planning/trajectory_execution_manager/src/trajectory_execution_manager.cpp:233

Finally there is the ActionServer thread, calling stopExecution() when preempting:

#4 trajectory_execution_manager::TrajectoryExecutionManager::stopExecution (this=0x5555560921a0, auto_clear=true) at /vol/sandbox/ros/src/moveit/moveit_ros/planning/trajectory_execution_manager/src/trajectory_execution_manager.cpp:1173
#5 move_group::MoveGroupExecuteTrajectoryAction::preemptExecuteTrajectoryCallback (this=0x5555560e4510) at /vol/sandbox/ros/src/moveit/moveit_ros/move_group/src/default_capabilities/execute_trajectory_action_capability.cpp:128

This raises several questions:

Why do we need two different mechanisms to interrupt a trajectory? Probably, preempting through the ActionServer is the cleaner way. But, having a direct topic allows for an emergency stop from other nodes as well...
Why they are both used from the MoveGroupInterface?
How can we mutex-protect the access, particularly which variables, apart from execution_thread_, need to be protected?

henryponti commented 5 years ago

I haven't had a chance to look into that over the past few days. I am hoping to take another look early next week. The information from @BryceStevenWilley will help a lot.

moveit / moveit