Open aurzenligl opened 1 year ago
Following example illustrates such crash scenario: https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp
Yikes, that's definitely not good. Thanks for a minimal example to reproduce this failure.
I assume proper/expected/consistent behavior is to block. & doesn't deadlock as long as timer stop is called w/o external mtx locked (from #2284)
I'm not sure I agree with this behaviour, nothing in the documentation states that cancelling the timer will only happen after any current callbacks complete their execution. In my opinion cancelling a timer should be non blocking in all situations. But I think this is a matter of opinion, and different people may have different assumptions around this.
If it is blocking until any callbacks are completed (like in #2284) we go back to the situation where it's very easy to accidentally make the system deadlock with an external mutex in play. Requiring the user to unlock its mutex to allow any callback to complete before being able to cancel the timer (as proposed there) may not be possible to do without introducing any thread safety issues on the user side. :thinking:
I'm not too sure yet how we can solve this in a way that works for all use cases, especially if the historic behaviour has been to block and there's code out there that relies on that. Maybe we revert the fix for the deadlock (#2121) and update the documentation? Stating that all running callbacks must be able to finish while blocked on the timer stop? Basically reverting back to the old behaviour and possibly deadlocking when there's a user mutex involved.
fyi @guillaumeautran , @mikepurvis , @jasonimercer for awareness. The timer deadlock fix from #2121 (CORE-18232) introduces memory unsafety in some situations.
If it is blocking until any callbacks are completed (like in https://github.com/ros/ros_comm/pull/2284) we go back to the situation where it's very easy to accidentally make the system deadlock with an external mutex in play.
True, external mutex use-case can coexist with the old blocking stop behavior, assuming that:
timer.stop
calltimer.stop
, B: below stop), AASBB must be legal
I'm not too sure yet how we can solve this in a way that works for all use cases, especially if the historic behaviour has been to block and there's code out there that relies on that. Maybe we revert the fix for the deadlock (https://github.com/ros/ros_comm/pull/2121) and update the documentation? Stating that all running callbacks must be able to finish while blocked on the timer stop? Basically reverting back to the old behaviour and possibly deadlocking when there's a user mutex involved.
It would be great to have the old blocking behavior back and defined in documentation. :+1: Still, self-removal might be a worthwhile addition, previously self-removal would deadlock, so couldn't have been relied on.
Delivering #2121 broke existing
ros::Timer
andros::ServiceServer
usages, introducing possibility of using destroyed objects by last callback call. Affected usages in public projects look like this: https://github.com/ros-planning/moveit/blob/1.1.10/moveit_ros/moveit_servo/include/moveit_servo/collision_check.h https://github.com/ros-planning/moveit/blob/1.1.10/moveit_ros/moveit_servo/src/collision_check.cpp One production system I'm aware of was affected as well.Before #2121 stop/shutdown post-condition ensured callback is not being executed, now it doesn't, allowing destruction of referred-to data and undefined behavior when last callback call continues. Following example illustrates such crash scenario: https://github.com/aurzenligl/study/blob/master/ros-timer/src/race.cpp
When
ros::Timer
orros::ServiceServer
is created with a callback capturing data by reference, there is no easy way to guarantee that lifetime of callback execution doesn't exceed the lifetime of data it uses. Problem occurs in code bases where objects composing timers or srv-servers (capturingthis
) are created/destroyed repeatedly while spinners are running.Phrasing that documentation uses to describe stop/shutdown/dtor behavior is as such:
It's imprecise wrt waiting for currently executing callback, but since all above classes did block in stop/shutdown before #2121, and
ros::Subscriber
still blocks, I assume proper/expected/consistent behavior is to block.