ros / actionlib

Provides a standardized interface for interfacing with preemptable tasks. Examples of this include moving the base to a target location, performing a laser scan and returning the resulting point cloud, detecting the handle of a door, etc.
http://www.ros.org/wiki/actionlib
96 stars 154 forks source link

Some ActionClients being left waiting forever when server sends results to two or more clients in a very short span of time #144

Open glpuga opened 5 years ago

glpuga commented 5 years ago

I found an issue when the cpp Action Server sends results to two or more clients during a very short span of time, which causes all but one of the clients to be left hanging in most cases.

I tracked the problem to being very likely related to the default value of the queue length (only one message long) used for the feedback and result subscribers in the ActionClient, see action_client.h:224.

This is related to the changes made in PR https://github.com/ros/actionlib/pull/55

Since all clients share the same feedback and result topics, when the server sends multiple messages addressed to different clients in quick succession, all of the clients subscriber queues become full after the first message and drop the following messages.

As a result, in most cases the clients that were meant to receive the second, third or later messages become hanged waiting for results that were lost and will not be resent.

A similar problem may cause feedback messages to be silently dropped.

While the queue length can be configured using the parameter server, a higher default value may reduce the chances of this problem arising. This would be specially useful since it's hard to track the symptoms to the cause when clients are left waiting indefinitely.

This issue was noted in ROS Kinetic, but it's probably present in other versions as well.

glpuga commented 5 years ago

Poking @mikaelarguedas who knows this code.

fujitatomoya commented 4 years ago

@glpuga

i do agree with you, it is so unlikely to happen but eventually there is a possibility. since it is configurable using parameter actionlib_client_pub_queue_size/actionlib_client_sub_queue_size, it is user responsibility what to set.

StefanFabian commented 4 years ago

@fujitatomoya I have to strongly disagree. You're right that most users may not ever encounter this limitation but it's not as unlikely as you think it is.

As far as I understand it, the idea of the ActionClient is explicitly to support multiple goals whereas the SimpleActionClient simplifies that to tracking a single goal. As it is, the ActionClient does not support multiple goals fully. A simple example would be sending 2 or more goals to an ActionServer and canceling those goals using cancelAllGoals(). Depending on the implementation of the server, this will result in the cancel messages from the server to be sent faster than the client-side ROS queue will be handled and all except for one goal will be in a broken state stuck forever in waiting for a cancel ACK.

This is a critical bug and no sane user would expect the queue size to be set to such a low value by default as there is no reason to do so. Also, the python implementation of the action client does not contain this bug (I assume setting the pub_queue_size to None does the same as setting it to 0, i.e., disables the queue size limit).

I would suggest to at least change this: https://github.com/ros/actionlib/blob/6673539a82956f71fe06cfc625a1d3c5db841211/actionlib/include/actionlib/client/action_client.h#L223-L226

to a sub_queue_size of 100 or better yet 0. Not as critical but still noteworthy, the pub_queue_size could also be increased significantly. That limitation may not usually be hit but there's also not really reason I could think of to set it that low. Especially, given that most uncritical queue sizes in other software stacks are usually set to sizes in the hundreds or thousands.

I can make a PR if you want, though it may be a bit overkill for at most 10 line changes.

fujitatomoya commented 4 years ago

@StefanFabian

SimpleActionClient simplifies that to tracking a single goal. As it is, the ActionClient does not support multiple goals fully.

correct,

A simple example would be sending 2 or more goals to an ActionServer and canceling those goals using cancelAllGoals(). Depending on the implementation of the server, this will result in the cancel messages from the server to be sent faster than the client-side ROS queue will be handled and all except for one goal will be in a broken state stuck forever in waiting for a cancel ACK.

i do understand, and yes logically that is possible or probable.

Also, the python implementation of the action client does not contain this bug (I assume setting the pub_queue_size to None does the same as setting it to 0, i.e., disables the queue size limit).

None means infinite(default), agreed on this. at least this should be consistent as ActionClient python. (i cannot explain why there is a difference between cpp and python.)

fujitatomoya commented 4 years ago

@StefanFabian @glpuga

https://github.com/ros/actionlib/pull/162, that is an easy fix, but could you check just in case?