Open ksuszka opened 1 year ago
Sorry, not much i can help w/o reproducible environment...
I'm not 100% sure that it is caused by the FastDDS middleware or its SHM transport
To be sure for this, how about disabling shm
transport via configuration file? if that works w/o any problem, shm
transport could be the reason to produce the issue?
see https://fast-dds.docs.eprosima.com/en/latest/fastdds/transport/shared_memory/shared_memory.html and https://fast-dds.docs.eprosima.com/en/latest/fastdds/xml_configuration/transports.html, i do not think you need to change any application code.
also here, https://fast-dds.docs.eprosima.com/en/latest/fastdds/xml_configuration/transports.html
FYI. We had numerous issues with FastDDS and we switched back to CycloneDDS. No more hang ups, silent errors, messages stopped being delivered.
CycloneDDS+Iceoryx is far from being perfect, but in our case, after learning a few of its quirks, it seems to be much more reliable.
I leave this issue open as I think it's still a problem with FastDDS, but at the moment we don't plan to investigate it any further.
@ksuszka thanks for sharing the experience.
Just FYI, several patches related SHM transport and data sharing is staging for next humble patch release. (https://github.com/ros2/ros2/pull/1484/files#diff-0b86fcc230a228fb210653f2069d07ee0ab117da02c6471640ae12327835ff4fL37-R37)
Bug report
Required Info:
Steps to reproduce issue
Unfortunately, the issue occurs randomly. The issue happened three times so far. We have system with a few dozen nodes. During testing on a vehicle ekf_node from robot_localization package stopped responding. The process was still running, but it wasn't discoverable by ros2 tooling and no messages were sent from it. We had some issues with messages not being delivered after we switched to FastDDS, but we blamed SHM transport so far. This complete hang was something new.
We used
strace
to check what is going on but it didn't show any activity in the ekf_node process. So we connected to it with gdb and checked stack traces of all threads and it seemed like all threads were waiting somewhere inside FastDDS middleware.Here is the output from gdb process:
I'm not 100% sure that it is caused by the FastDDS middleware or its SHM transport, but we had been using the same ekf_node for the past two years with CycloneDDS without issues and experienced it after switching to FastDDS, and all threads hang inside DDS middleware so it is our best guess at the moment.
Expected behavior
ROS node should work without hangs.
Actual behavior
ROS node stopped responding.
Additional information
Unfortunately, we don't have an easy way to reliably reproduce this issue. We tested it in the same environment multiple times and it only happened three times so far. This gdb analysis is from the third time.
Additionally, it would be great to know what actions/diagnostic tools should we use if we encounter this issue again to make it easier to diagnose and fix.