ros / ros_comm

ROS communications-related packages, including core client libraries (roscpp, rospy, roslisp) and graph introspection tools (rostopic, rosnode, rosservice, rosparam).
http://wiki.ros.org/ros_comm
747 stars 914 forks source link

ros::Timer & ros::ServiceServer finalization failure #2283

Open aurzenligl opened 1 year ago

aurzenligl commented 1 year ago

Delivering #2121 broke existing ros::Timer and ros::ServiceServer usages, introducing possibility of using destroyed objects by last callback call. Affected usages in public projects look like this: https://github.com/ros-planning/moveit/blob/1.1.10/moveit_ros/moveit_servo/include/moveit_servo/collision_check.h https://github.com/ros-planning/moveit/blob/1.1.10/moveit_ros/moveit_servo/src/collision_check.cpp One production system I'm aware of was affected as well.

Before #2121 stop/shutdown post-condition ensured callback is not being executed, now it doesn't, allowing destruction of referred-to data and undefined behavior when last callback call continues. Following example illustrates such crash scenario: https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp

When ros::Timer or ros::ServiceServer is created with a callback capturing data by reference, there is no easy way to guarantee that lifetime of callback execution doesn't exceed the lifetime of data it uses. Problem occurs in code bases where objects composing timers or srv-servers (capturing this) are created/destroyed repeatedly while spinners are running.

Phrasing that documentation uses to describe stop/shutdown/dtor behavior is as such:

It's imprecise wrt waiting for currently executing callback, but since all above classes did block in stop/shutdown before #2121, and ros::Subscriber still blocks, I assume proper/expected/consistent behavior is to block.

iwanders commented 1 year ago

Following example illustrates such crash scenario: https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp

Yikes, that's definitely not good. Thanks for a minimal example to reproduce this failure.

I assume proper/expected/consistent behavior is to block. & doesn't deadlock as long as timer stop is called w/o external mtx locked (from #2284)

I'm not sure I agree with this behaviour, nothing in the documentation states that cancelling the timer will only happen after any current callbacks complete their execution. In my opinion cancelling a timer should be non blocking in all situations. But I think this is a matter of opinion, and different people may have different assumptions around this.

If it is blocking until any callbacks are completed (like in #2284) we go back to the situation where it's very easy to accidentally make the system deadlock with an external mutex in play. Requiring the user to unlock its mutex to allow any callback to complete before being able to cancel the timer (as proposed there) may not be possible to do without introducing any thread safety issues on the user side. :thinking:

I'm not too sure yet how we can solve this in a way that works for all use cases, especially if the historic behaviour has been to block and there's code out there that relies on that. Maybe we revert the fix for the deadlock (#2121) and update the documentation? Stating that all running callbacks must be able to finish while blocked on the timer stop? Basically reverting back to the old behaviour and possibly deadlocking when there's a user mutex involved.

fyi @guillaumeautran , @mikepurvis , @jasonimercer for awareness. The timer deadlock fix from #2121 (CORE-18232) introduces memory unsafety in some situations.

aurzenligl commented 1 year ago

If it is blocking until any callbacks are completed (like in https://github.com/ros/ros_comm/pull/2284) we go back to the situation where it's very easy to accidentally make the system deadlock with an external mutex in play.

True, external mutex use-case can coexist with the old blocking stop behavior, assuming that:

I'm not too sure yet how we can solve this in a way that works for all use cases, especially if the historic behaviour has been to block and there's code out there that relies on that. Maybe we revert the fix for the deadlock (https://github.com/ros/ros_comm/pull/2121) and update the documentation? Stating that all running callbacks must be able to finish while blocked on the timer stop? Basically reverting back to the old behaviour and possibly deadlocking when there's a user mutex involved.

It would be great to have the old blocking behavior back and defined in documentation. :+1: Still, self-removal might be a worthwhile addition, previously self-removal would deadlock, so couldn't have been relied on.