ros2 / rmw_cyclonedds

ROS 2 RMW layer for Eclipse Cyclone DDS
Apache License 2.0
112 stars 91 forks source link

Cyclone DDS hangs in Galactic within a VM with all defaults using ros2cli #383

Closed AlexisTM closed 2 years ago

AlexisTM commented 2 years ago

Bug report

Required Info:

Steps to reproduce issue

- Build ROS2 from source
- ros2 run examples_rclcpp_minimal_composition composition_composed # The node doesn't matter
- ros2 node list # in another terminal

Expected behavior

The program hangs and never exits (even with CTRL+C).

Actual behavior

ros2 node list returns the list of the nodes

Additional information

Running RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 node list has the correct behaviour

Starting the daemon with debug prevents hanging, but doesn't provide with the expected output (below, the /topic topic exists and is published in the composed node precedently started and running fine).

user@apertis:~/ros2/bosch$ ros2 daemon stop 
The daemon has been stopped

user@apertis:~/ros2/bosch$ ros2 daemon start --debug
Interface kind: 2, info: [('10.0.2.2', 'enp0s3', True)]
Addresses by interfaces: {2: {'enp0s3': '10.0.2.15'}}
Serving XML-RPC on localhost:11511/ros2cli/
The daemon has been started

user@apertis:~/ros2/bosch$ ros2 topic echo /topic
get_topic_names_and_types()
Interface kind: 2, info: [('10.0.2.2', 'enp0s3', True)]
Addresses by interfaces: {2: {'enp0s3': '10.0.2.15'}}
get_name()
Interface kind: 2, info: [('10.0.2.2', 'enp0s3', True)]
Addresses by interfaces: {2: {'enp0s3': '10.0.2.15'}}
get_namespace()
Interface kind: 2, info: [('10.0.2.2', 'enp0s3', True)]
Addresses by interfaces: {2: {'enp0s3': '10.0.2.15'}}
WARNING: topic [/topic] does not appear to be published yet
Could not determine the type for the passed topic

user@apertis:~/ros2/bosch$ ros2 node list
get_node_names_and_namespaces()
Interface kind: 2, info: [('10.0.2.2', 'enp0s3', True)]
Addresses by interfaces: {2: {'enp0s3': '10.0.2.15'}}
eboasson commented 2 years ago

I am afraid I don't (won't) have an Apertis 2022 SDK at hand. It certainly sounds like that is exhibiting some behaviour that interacts badly with Cyclone. Unusual, but not impossible.

What I usually suggest is to gather some tracing information from Cyclone DDS because that usually gives some insights in what is going on when everything works but there is no communication. Enabling it is as simple as putting

CYCLONEDDS_URI="<Tr><V>finest</><Out>cdds.log.\${CYCLONEDDS_PID}</>"

in the environment. That log starts with the configuration options, then you get some information on network selection and so on.

It is not so likely to give a clue as to why it hangs. For that, generally the best thing to do is to attach gdb and get stack traces for all threads (thread apply all bt) and look for something suspicious.

AlexisTM commented 2 years ago

For the following logs, no other node was running.

cdds.log.1189.txt

cdds.log.1190.txt

eboasson commented 2 years ago

Looks to me like you have a firewall blocking all multicast traffic. This causes two problems:

I suspect the hanging is caused by ROS 2's signal handler being "too nice" if the impossible happens and stopping cleanly turns out to be impossible.

I'd suggest allowing multicast, but you can also disable multicast altogether (it is just that you lose out on a lot of niceties and add a significant amount of overhead):

<General>
  <AllowMulticast>false</AllowMulticast>
</General>
<Discovery>
  <ParticipantIndex>auto</ParticipantIndex>
  <Peers>
    <Peer address="localhost"/>
  </Peers>
</Discovery>

The hanging you could fix by not having threads dedicated to receiving data on a specific socket (Internal/MultipleReceiveThreads = false), but that alone won't get make discovery work. And both options above that should make discovery work should also solve the hanging.

AlexisTM commented 2 years ago

That seems to be the problem indeed. Thank you!