ros2 / rmw_cyclonedds

ROS 2 RMW layer for Eclipse Cyclone DDS
Apache License 2.0
112 stars 91 forks source link

Hidden limitation in action server names? #384

Closed pauldinh closed 2 years ago

pauldinh commented 2 years ago

Bug report

Required Info:

Steps to reproduce issue

- Create rclpy.action.ActionServer with action_name > 98 characters
- Create rclpy.action.ActionClient connecting to this server

- Call client.wait_for_server()              # to witness it blocking indefinitely
- Call client.wait_for_server(timeout_sec=2) # to witness it times out
- Call client.send_goal()                    # to witness server accepts the goal

I have also pushed a repo to make it simpler to test:

- mkdir -p ~/action_ws/src
- cd ~/action_ws/src
- git clone https://github.com/pauldinh/action_name_test.git
- cd ~/action_ws
- rosdep update && rosdep install --from-paths src -iyr
- colcon build --symlink-install
- source ~/action_ws/install/setup.bash
- ros2 run action_name_test server

# in second terminal
- source ~/action_ws/install/setup.bash
- ros2 run action_name_test client

Expected behavior

Clients can connect to the action servers with no issue, or servers throw an exception if the length of the action name is too long (similar to how an exception is thrown if topics are ~255 characters).

Actual behavior

Action servers don't display any warning/exception unless a string of ~255 is used (whatever the actual limit is).

Action clients either block indefinitely when calling wait_for_server() or they always timeout if a timeout is specified with the call. However, if you pass in a timeout so it doesn't block, you can send the goal anyway and everything seems to work.

Snippet of output from the example repo:

[INFO] [1648651460.647444469] [fibonacci_action_client_97]: client.wait_for_server(timeout_sec=1) returned `True` for server length `97`
[INFO] [1648651460.647722267] [fibonacci_action_client_98]: client.wait_for_server(timeout_sec=1) returned `True` for server length `98`
[WARN] [1648651461.650723237] [fibonacci_action_client_99]: client.wait_for_server(timeout_sec=1) returned `False` for server length `99`
[WARN] [1648651462.653469814] [fibonacci_action_client_100]: client.wait_for_server(timeout_sec=1) returned `False` for server length `100`

You can check the README in the repo for sample output that shows the clients sending the goal anyway. The servers accept, process, and return a result for each client, as if nothing is wrong.

I do not see this behavior with FastRTPS.

eboasson commented 2 years ago

Curious 🙂

I suspect it is this: https://github.com/ros2/rmw_cyclonedds/blob/f9f211f34151bb4bbaf4fae17072537559b4dab7/rmw_cyclonedds_cpp/src/rmw_node.cpp#L5009 and that the condition for retrying is simply wrong. Blame it on my relationship to C++.

Some "TLC" did happen and dds_get_name these days returns the actual length so if at first the buffer is too small, one can simply retry with a buffer that you know to be large enough instead of this kludge.

I don't currently have a build of ROS 2 at hand and won't be able to do a pull request today (the direct result of macOS no longer being tier 1 ...) but if you happen to have everything set up for a quick experiment, you might want to try increasing the 128 in the referenced line.

yamokosk commented 2 years ago

Curious 🙂

I suspect it is this:

https://github.com/ros2/rmw_cyclonedds/blob/f9f211f34151bb4bbaf4fae17072537559b4dab7/rmw_cyclonedds_cpp/src/rmw_node.cpp#L5009

and that the condition for retrying is simply wrong. Blame it on my relationship to C++. Some "TLC" did happen and dds_get_name these days returns the actual length so if at first the buffer is too small, one can simply retry with a buffer that you know to be large enough instead of this kludge.

I don't currently have a build of ROS 2 at hand and won't be able to do a pull request today (the direct result of macOS no longer being tier 1 ...) but if you happen to have everything set up for a quick experiment, you might want to try increasing the 128 in the referenced line.

I don't think its this. 128 characters doesn't line up with the 99 character limit reported.

Grep'ed the cyclone code base (https://github.com/eclipse-cyclonedds/cyclonedds) and that shows char topic_name[100] is all over the place. Based on the reported character count, this seems a more likely culprit. Yay, magic numbers!

pauldinh commented 2 years ago

but if you happen to have everything set up for a quick experiment, you might want to try increasing the 128 in the referenced line.

Yeah this looks like the culprit. I doubled the vector to 255 and I haven't been able to reproduce the issue. I also tested bumping it up by 1 (to 129) which resulted in 99 no longer timing out like it used to:

[INFO] [1648687538.394413767] [fibonacci_action_client_97]: client.wait_for_server() timed out? False
[INFO] [1648687538.394787646] [fibonacci_action_client_98]: client.wait_for_server() timed out? False
[INFO] [1648687538.395153215] [fibonacci_action_client_99]: client.wait_for_server() timed out? False
[WARN] [1648687538.395611024] [fibonacci_action_client_100]: client.wait_for_server() timed out? True

Earlier today, after I filed the ticket, I observed this happening with services too:

And after doubling that line to 255 this also went away.

pauldinh commented 2 years ago

Grep'ed the cyclone code base (https://github.com/eclipse-cyclonedds/cyclonedds) and that shows char topic_name[100] is all over the place. Based on the reported character count, this seems a more likely culprit. Yay, magic numbers!

All of the char topic_name[100] hits are unit test related so it's possible that's just an arbitrary number chosen for the tests. An unlucky coincidence if so.

eboasson commented 2 years ago

And after doubling that line to 255 this also went away.

Thanks for confirming!

All of the char topic_name[100] hits are unit test related so it's possible that's just an arbitrary number chosen for the tests. An unlucky coincidence if so

Quite so.

The discrepancy between the service name length and the buffer size is caused by the name mangling in ROS. Probably if you'd shorten your name space with a few characters the limit would have gone up to just above 100. It is just that that experiment would at best increase confusion.

So just a simple bug in the RMW layer. (To be honest, this upsets me a lot less than one in Cyclone proper, even though I feel a bit stupid for getting this loop wrong ... 🙂)

pauldinh commented 2 years ago

Thanks for the quick turnaround!