ros / ros_comm

ROS communications-related packages, including core client libraries (roscpp, rospy, roslisp) and graph introspection tools (rostopic, rosnode, rosservice, rosparam).
http://wiki.ros.org/ros_comm
761 stars 912 forks source link

roscpp multithreaded spinners eat up CPU when callbacks take too long #2341

Open Zob314 opened 1 year ago

Zob314 commented 1 year ago

Issue https://github.com/ros/ros_comm/issues/1545 was fixed in ROS melodic, but not noetic.

The fix for the issue took a rather meandering rout to getting in, but PR https://github.com/ros/ros_comm/pull/1684 and PR https://github.com/ros/ros_comm/pull/2014 both seem related.

As far as I can tell, the offending functions in noetic are the same as in melodic pre fix, so this might be as easy as cherry picking those changes over.

meyerj commented 4 weeks ago

:+1:

I was the original author of https://github.com/cwecht/ros_comm/pull/1, a follow-up on @cwecht's https://github.com/ros/ros_comm/pull/1684, that got cerry-picked and merged into melodic-devel in https://github.com/ros/ros_comm/pull/2014.

The main difference between the two versions is that if the callback queue is not empty, but none of the callbacks is ready, the current version on noetic-devel without that patch returns TryAgain immediately without waiting, even if the timeout argument is non-zero. And that ultimately leads to the original issue #1545. The updated version also waits in case of a non-empty queue, and only returns TryAgain afterwards if and only if the wait call timed out.

The problem occurs in the following situation:

As a result, the callback queue (callbacks_) is not empty and hence the call to callOne() does not block, but none of the calls is actually ready and the attempt to actually call it results in a TryAgain response, such that it is enqueued again etc., and all normally idle threads will do that concurrently without sleeping in between.

There is another minor difference between melodic-devel and noetic-devel introduced in https://github.com/ros/ros_comm/pull/1684, namely that SubscriptionQueue::ready() does not return true unconditionally in melodic-devel, but only if SubscriptionQueue::call() would not return CallbackInterface::TryAgain later. So no need to take the callback from the queue, just to then learn that it cannot be executed yet and then be pushed to the end of the queue again. That also may cause message reordering if multiple subscribers share the same CallbackQueue instance.

Are you interested in yet another PR to forward-port the relevant commits to noetic-devel?

meyerj commented 4 weeks ago

Are you interested in yet another PR to forward-port the relevant commits to noetic-devel?

Done: https://github.com/ros/ros_comm/pull/2377