ros2 / rmw_dps

Implementation of the ROS Middleware (rmw) Interface using Intel's Distributed Publish & Subscribe.
Apache License 2.0
23 stars 8 forks source link

Nodes discovery taking too long #38

Open mauropasse opened 5 years ago

mauropasse commented 5 years ago

Hello, I'm testing the introspection branch, running one of the iRobot benchmark tests, and I noticed the list of discovered nodes is not properly updated on time, in most of the times the nodes are not fully discovered. The issue seems to be on the function get_discovered_nodes defined on rmw_dps_cpp/include/rmw_dps_cpp/custom_node_info.hpp. The map discovered_nodes_ is not populated with all the present nodes in the system.

If you'd like to reproduce the issue:

git clone https://github.com/irobot-ros/ros2-performance.git
mkdir -p ~/performance_test_ws/src
cp -r ros2-performance/performances/ ~/performance_test_ws/src
cd ~/performance_test_ws
colcon build
source install/local_setup.bash
cd install/benchmark/lib/benchmark/
./benchmark topology/sierra_nevada.json
# Here it hangs forever in the discovery process

The problem starts to be seen when there are about 10 nodes in the ROS2 system.

Any ideas about what could be wrong? Thanks

malsbat commented 5 years ago

@mauropasse, I've been unable to reproduce the issue after running it locally all morning. My best guess is that either an advertisement publication was dropped or that the notification condition failed to trigger.

The first seems unlikely, as that would imply we are overflowing the socket buffers.

I have been debugging some intermittent test failures with test_rclcpp tests that may be related to the second guess. I'll keep you updated on my progress.

mauropasse commented 5 years ago

@malsbat is strange you can't reproduce the issue. I did the same test in 4 laptops and the RPi2 and in all of them I get the same problem. We've had this problem when testing FastRTPS, and the issue was the overflow of the network due to the high frequency of discovery messages from the built-in endpoints. Anyway, keep me posted if you find anything strange. Thanks!

malsbat commented 5 years ago

@mauropasse, I wanted to give a short update on this one: the test failures I see related to the guard conditions appear to be in the rcl multithreaded executor code. I'm still investigating, but the error reproduces with rmw_fastrtps as well, although less frequently.