Numerous test failures for Nav2 when using rmw_connextdds

System Info

OS
- Ubuntu 20.04
ROS version and installation type
- rolling/source
RTI Connext DDS version and installation type
- rti-connext-dds-5.3.1/focal,now 5.3.1-nc.x64Linux3gcc5.4.0+2 amd64
RMW version or commit hash
- f46c39d7db6797a80a24631f2890ebd143bfe54e

Bug Description

Numerous failures in Nav2 when testing with rmw_connextdds that don't occur with other RMWs. To list a few:

An abnormal number of rmw_dds_common warnings are encountered during testing of nav2:

``` [controller_server-7] [WARN] [1617199619.684671125] [rmw_dds_common]: nanoseconds value too large for 32-bits, saturated at UINT_MAX [controller_server-7] [WARN] [1617199619.684736997] [rmw_dds_common]: nanoseconds value too large for 32-bits, saturated at UINT_MAX [controller_server-7] [WARN] [1617199619.684755779] [rmw_dds_common]: nanoseconds value too large for 32-bits, saturated at UINT_MAX ... ```

Expected lexeme type not found as handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library:

``` [planner_server-8] >>> [rcutils|error_handling.c:108] rcutils_set_error_state() [planner_server-8] This error state is being overwritten: [planner_server-8] [planner_server-8] 'Handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library, at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rosidl_typesupport/rosidl_typesupport_cpp/src/type_support_dispatch.hpp:111' [planner_server-8] [planner_server-8] with this new error message: [planner_server-8] [planner_server-8] 'Expected lexeme type (19) not found, search ended at index 21, at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/lexer_lookahead.c:233' [planner_server-8] [planner_server-8] rcutils_reset_error() should be called after error handling to avoid this. [planner_server-8] <<< ```

Cannot delete and wait on the same object

``` [gzserver-1] [ERROR] [1617200184.378385234] [rmw_connextdds]: cannot delete and wait on the same object [gzserver-1] [ERROR] [1617200184.381227447] [rmw_connextdds]: cannot delete and wait on the same object [gzserver-1] [ERROR] [1617200184.381472426] [rmw_connextdds]: cannot delete and wait on the same object ```

See test failure logs for complete error messages:

https://app.circleci.com/pipelines/github/ros-planning/navigation2/5045/workflows/062fb8a4-bf43-4bbe-8dbe-4cf54a698abc/jobs/18610

Expected Behavior

No test failures encountered like with other alternative RMW.

How to Reproduce

export IMAGE_NAME=rosplanning/navigation2:main.release
export FROM_IMAGE=osrf/ros2:nightly-rmw-nonfree

docker pull $IMAGE_NAME

#OR

git clone https://github.com/ros-planning/navigation2.git
cd navigation2
docker build \
    --tag ${IMAGE_NAME} \
    --build-arg FROM_IMAGE \
    --file Dockerfile .

mkdir -p /tmp/overlay_ws/log
docker run -it --rm \
    -v /tmp/overlay_ws/log:/opt/overlay_ws/log \
    rosplanning/navigation2:main.release \
    bash
export RMW_IMPLEMENTATION=rmw_connextdds
colcon test
colcon test-result --verbose

cd /tmp/overlay_ws/log/latest_test/

Workarounds

Switching to alternative RMWs such as rmw_fastrtps_cpp or rmw_cyclonedds_cpp.

Additional context

With the recent release announcement, the nav2 project has re-enabled rmw_connextdds to its rmw matrix of nightly tests.

Hi @ruffsl, thank you for this report, I will try to reproduce locally and start investigating.

Meanwhile, a couple of comments:

* An abnormal number of rmw_dds_common warnings are encountered during testing of nav2:

These warnings are printed by this utility function in rmw_dds_common which was recently introduced and maybe isn't used by other RMWs yet (maybe because they don't need it/implement the conversion themselves).

One solution would be to downgrade the warning in clamp_rmw_time_to_dds_time(), but I also wonder why these overflow are happening in the first place. In DDS, the nanosec field of a Duration is never expected to be >= 1s, so overflows are never expected on a "normalized" value. See DDS v1.4 section 2.3.2:

The two types used to represent time: Duration_t and Time_t are mapped into structures that contain fields for the second and the nanosecond parts. These types are further constrained to always use a ‘normalized’ representation for the time, that is, the nanosec field must verify 0 <= nanosec < 1000000000.

I wonder if this is not the case in the ROS 2 API, in which case there is probably some more work to be done to perform a proper conversion of rmw_time_t into DDS_Duration_t than just "clamping" the value...

* Expected lexeme type not found as handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library:

This is something that I've seen before show up in some unit test logs, but I haven't been able to understand/resolve yet. I think it has to be related to this call in rmw_graph.cpp (because there is no other mention of rosidl_typesupport_cpp anywhere else in the repository) but I'm not sure if that call has any problems, because this error doesn't show up under "normal conditions".

* Cannot delete and wait on the same object

I personally think rmw_connextdds is (at least partially) right here in forbidding this operation, unless of course there is a bug in the implementation, but let me try to explain why.

The error comes from here, and it's being triggered because an object being deleted (a "publisher" or a "subscriber", maybe owned by a "client" or "service", or maybe a "guard condition") is still marked attached to a waitset that is currently in the midst of beginning a wait(). This could be (what I would consider) a user error, because they are trying to delete something while, e.g., an executor is "spinning on it", but it may also happen because the waitset implementation based on DDS WaitSets tries to compensate the lack of attach/detach in the ROS 2 API by "caching" the attached references, and only detaching them upon the next wait() (or deletion of one of the attached objects).

Now, if you think the deletion is legitimate (i.e. whatever object is being deleted is not currently associated with a waitset), then it might be that there is a problem in the current implementation of RMW_Connext_WaitSet::invalidate(), which should wait for the wait() call to transition to the next state (from ACQUIRING to BLOCKED), and only return an error if the condition that was invalidating the WaitSet is still attached (because it will have been cleared by the concurrent wait() during that transition, if the object was not part of the current set of "waitables"), instead of immediately returning an error like it does now. I'll investigate this "race condition" while reproducing locally, but it would be great if you could confirm that the object deletion doesn't have any of the possible problems that I mentioned.

This is something that I've seen before show up in some unit test logs, but I haven't been able to understand/resolve yet. I think it has to be related to this call in rmw_graph.cpp (because there is no other mention of rosidl_typesupport_cpp anywhere else in the repository) but I'm not sure if that call has any problems, because this error doesn't show up under "normal conditions".

I think I may have found the reason for the rosidl_typesupport_cpp error messages and the associated fix.

I'm still investigating, but one thing I've noticed is that the default implementation used by rmw_connextdds for WaitSets (based on DDS WaitSets) might be generating too much CPU usage, and causing some tests to fail.

I'm currently focusing on nav2_system_tests and, while I still have failures, I get much better results when I compile the RMW with an alternative WaitSet implementation based on the standard C++ library. Still getting to the bottom of it, but I'm starting to suspect the resolution to this issue will include switching that implementation.

I've been doing some extensive testing locally (by building a "minimal" Rolling tree and the nav2 stack in Debug mode), and created #22 to address some of the issues that I was able to identify.

I have been running tests from nav2_system_tests with them and was able to see all tests reported failing by CI pass at least a few times each. So far, I have also been running tests with some larger default timeouts to help me debug other non-time related issues (server_timeout from 10ms to 10s, bond_timeout from 4s to 10s).

In general, the failures seem to me to be mostly related to:

Slow default discovery: lifecycle_manager aborts the test because bond_timeout expires with some component.
Slow default reliability: these application all run on very tight loops and expect very quick response from the middleware. Connext's default QoS parameters controlling the DDS reliability protocol might not be tuned to provide the expected performance, particularly when it comes to promptly detect and resend missed samples. This might cause tests to fail if some "test deadline" is missed (e.g. some short timeout).
High CPU usage caused by the current implementation of WaitSets.

There is also the possibility for these tests to run into a discovery race condition that exists for all "volatile" ROS 2 clients: there is a fundamental race condition in the client operation is_service_available() which stems from DDS' peer to peer nature and the way its discovery phase works. When a client detects a remote service as "alive" because both its request writer and reply reader have been matched by the corresponding entities in a remote application, there is no guarantee that the same is also true for the remote application. For this reason, a client using "volatile" durability (the default in ROS 2) runs the risk to be blocked indefinitely for a reply that will never come, if it sends a request right after a service is locally detected as enabled but the service's request reader has not yet been matched locally. The message will not be repaired by the writer, because of the "volatile" durability.

There is also another potential race condition which may lead to a reply not being sent by a service because the local writer had not yet matched the client's reader when the reply was written. This is addressed by the DDS-RPC specification by prescribing that the "reply writer" always be matched by an implementation before its associated "request reader" when matching the endpoint's of a remote client, but this requires the propagation of additional information via discovery, and, as far as I know, no DDS vendor yet implements this behavior.

The solution to the first race condition is, imho, simple: always use at least "transient local" durability for clients. But maybe I'm not considering some adverse effect of that decision.

Anyway, these race conditions should only be relevant for applications which create and delete clients very often, since that triggers discovery and matching. This shouldn't be the case for a "typical" DDS application (because endpoint creation/deletion is an expensive operation, with system-wide consequences because of discovery, and should not be carried out continuously if possible), but I'm not sure if the same it's true for a ROS 2 application making extensive use of actions like the nav2 components seem to be (I'm not that familiar with the client API exposed by rclcpp and how it's used by a typical ROS 2 application).

I was able to confirm that the issues with "slow discovery" and "slow reliability" can be mitigated by tuning the QoS via XML (since the required parameters are not exposed by ROS). Increasing the "announcement period" makes component able to discover each other more quickly, while increasing the "heartbeat period" will help making sure that writers will be quicker to repair missed samples.

I did some experiments running with the XML configuration below saved as NDDS_QOS_PROFILES.xml in the workspace directory where I ran the tests. The configuration might give some better results than the "out of the box" experience, particularly for speeding up discovery. I think the actual "optimal" solution though can only be to craft a configuration file that specifies the correct parameters for each topic (using "topic filters") instead of changing defaults for all reliable topics, which I'm pretty sure causes a lot of unnecessary work to dispatch periodic messages for unused topics.

At this point, I'm validating the changes in #22 and look forward to have them in master ASAP.

After that, I guess we'll run them through another set of your CI and see what comes out of it.

<?xml version="1.0" encoding="UTF-8"?>
<dds xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://community.rti.com/schema/6.0.1/rti_dds_profiles.xsd" version="6.0.1">
    <!-- Qos Library -->
    <qos_library name="Issue21">
      <qos_profile name="BaseProfile" base_name="BuiltinQosLib::Generic.Common" is_default_qos="true">
        <participant_qos>
          <discovery_config>
            <initial_participant_announcements>10</initial_participant_announcements>
            <min_initial_participant_announcement_period>
              <sec>0</sec>
              <nanosec>200000000</nanosec>
            </min_initial_participant_announcement_period>
            <max_initial_participant_announcement_period>
              <sec>0</sec>
              <nanosec>400000000</nanosec>
            </max_initial_participant_announcement_period>
          </discovery_config>
        </participant_qos>
        <datareader_qos>
          <protocol>
            <rtps_reliable_reader>
              <min_heartbeat_response_delay>
                <sec>0</sec>
                <nanosec>0</nanosec>
              </min_heartbeat_response_delay>
              <max_heartbeat_response_delay>
                <sec>0</sec>
                <nanosec>0</nanosec>
              </max_heartbeat_response_delay>
            </rtps_reliable_reader>
          </protocol>
        </datareader_qos>
        <datawriter_qos>
          <protocol>
            <rtps_reliable_writer>
              <low_watermark>0</low_watermark>
              <high_watermark>2</high_watermark>
              <heartbeats_per_max_samples>100</heartbeats_per_max_samples>
              <min_send_window_size>10</min_send_window_size>
              <max_send_window_size>10</max_send_window_size>
              <max_heartbeat_retries>500</max_heartbeat_retries>
              <heartbeat_period>
                <sec>0</sec>
                <nanosec>10000000</nanosec>
              </heartbeat_period>
              <fast_heartbeat_period>
                <sec>0</sec>
                <nanosec>500000</nanosec>
              </fast_heartbeat_period>
              <late_joiner_heartbeat_period>
                <sec>0</sec>
                <nanosec>500000</nanosec>
              </late_joiner_heartbeat_period>
              <max_nack_response_delay>
                <sec>0</sec>
                <nanosec>0</nanosec>
              </max_nack_response_delay>
            </rtps_reliable_writer>
          </protocol>
        </datawriter_qos>
      </qos_profile>
    </qos_library>
</dds>

I forgot to mention that the issue with the "typesupport" error is caused by an error message in rosidl_typesupport_cpp which should really be a debug statement at best.

I had opened a PR to remove it (rosidl_typesupport#107) but closed it for now, because it seems like @clalancette has already one (draft) open to address the same issue (rosidl_typesupport#102).

The other one about "saturated time" will be removed by the alternative WaitSet implementation. It is nevertheless curious because I kept seeing it pop up with the other implementation even after re-implementing the logic to convert rmw_time_t to DDS_Duration_t, and making it accept RMW_DURATION_INFINITE, which I assumed was the culprit for the "extremely large values". It looked like some of the components still passed these unexpectedly large values, and I'm not sure why (I would think that, 99 out of a 100 cases at least, an application that needs to wait for something larger than [INT_MAX sec, 10^9-1 nanosec] should actually be waiting for "infinite", but maybe I'm wrong...).

@ruffsl we merged #22 to address some of the issues that you reported. The commit should have been included in the latest nightly, but I'm not sure if it was picked up by the nav2 CI (https://app.circleci.com/pipelines/github/ros-planning/navigation2/5175/workflows/79e99820-07e8-420c-94f2-bfba2a356c32/jobs/19071) because I would have expected at least the Handle's typesupport identifier (rosidl_typesupport_cpp) messages to have been resolved (but it could also be that those issues haven't been resolved yet...).

I doubt things will ever work "out of the box" with the default reliability protocol settings, so I doubt plans will get much greener, but I would appreciate some feedback on the latest modifications.

It would be helpful if you (or anyone with deeper knowledge of the nav2 stack than me) could take a little time to test things out with custom reliability settings. There might be further, more severe, functional issues that are causing the tests to fail, but my limited knowledge of the inner workings of the stack makes it hard to differentiate test failures because of suboptimal reliability settings and failures caused by underlying issues in the RMW. For example, I don't know which topics are created by each applications, and might be related to a particular failure. Without that information, it is hard/impossible for me to try to tune the QoS of a specific topic to see if some particular setting makes a test pass.

Looking forward to any input you may have on the matter. I'd be happy to continue to work towards getting all tests to pass, but at this point I'm not sure how to proceed to squash the remaining issues.

The commit should have been included in the latest nightly, but I'm not sure if it was picked up by the nav2 CI

With the cron job cadence between ci.ros2.org, the building of the upstream osrf/ros2:nightly parent image, and downstream rosplanning/navigation2:master.<mode> image for our CI, it takes a day to propagate, as the cron job for nav2 is scheduled by an offset to avoid rebuilding CI images in the middle of working hours for most nav2 maintainers. Let's check again tomorrow.

Looking forward to any input you may have on the matter. I'd be happy to continue to work towards getting all tests to pass, but at this point I'm not sure how to proceed to squash the remaining issues.

Thanks for all the contributions thus far. Once the changes have propagated and we have some new nightly job results to inspect, I'll dig through the test logs again to see what may be performing differently between tests within the matrix of RMW jobs.

looks like there were some clear improvements ✌🏼

Indeed, from 15 failures down to just 5; a great improvement. Skimming through the error logs, it seems like there may be some liveliness issues with bond, like missed heartbeat, as well as dropped messages, like for configuring the costmaps.

[lifecycle_manager-4] [INFO] [1618496230.369896380] [lifecycle_manager_filters]: ?[34m?[1mHave not received a heartbeat from filter_mask_server.?[0m?[0m
[lifecycle_manager-4] [ERROR] [1618496230.370132736] [lifecycle_manager_filters]: CRITICAL FAILURE: SERVER filter_mask_server IS DOWN after not receiving a heartbeat for 4000 ms. Shutting down related nodes.

[lifecycle_manager-1] [ERROR] [1618493536.178024540] [lifecycle_manager_test]: Server bond_tester was unable to be reached after 4.00s by bond. This server may be misconfigured.
[lifecycle_manager-1] [ERROR] [1618493536.178455902] [lifecycle_manager_test]: Failed to bring up all requested nodes. Aborting bringup.

[planner_server-11] [WARN] [1618496250.011785754] [global_costmap.global_costmap]: KeepoutFilter: Filter mask was not received
[controller_server-10] [WARN] [1618496251.608839039] [local_costmap.local_costmap]: KeepoutFilter: Filter mask was not received
...
[tester_node-17] [ERROR] [1618496252.661153071] [nav2_tester]: ?[1;37;41mPose (-1.2776381146555877, -0.5528456233865882) belongs to keepout zone?[0m
[tester_node-17] [ERROR] [1618496252.661485470] [nav2_tester]: ?[1;37;41mPath plan intersects with keepout zone?[0m

@SteveMacenski , are there any current tickets that discuss these symptoms, or are these unexhibited in other RMWs? Any suggestions on narrowing down the cause?

I have never seen this with any other middlewares. I think if we did, I'd be flooded with issues being submitted about it from users (as we have many using both Fast-DDS and Cyclone). Over the ~1 year the bond connections have been in Nav2, I have yet to see it ever trigger on me incorrectly

[lifecycle_manager-4] [INFO] [1618496230.369896380] [lifecycle_manager_filters]: ?[34m?[1mHave not received a heartbeat from filter_mask_server.?[0m?[0m
[lifecycle_manager-4] [ERROR] [1618496230.370132736] [lifecycle_manager_filters]: CRITICAL FAILURE: SERVER filter_mask_server IS DOWN after not receiving a heartbeat for 4000 ms. Shutting down related nodes.

[lifecycle_manager-1] [ERROR] [1618493536.178024540] [lifecycle_manager_test]: Server bond_tester was unable to be reached after 4.00s by bond. This server may be misconfigured.
[lifecycle_manager-1] [ERROR] [1618493536.178455902] [lifecycle_manager_test]: Failed to bring up all requested nodes. Aborting bringup.

These might have something to do with delays in discovery caused by the default participant announcement settings (5 initial announcements, sent once every second, and then once every 30s, see API docs for relevant QoS). They may also be related to the slow reliability protocol settings that Connext uses by default, which only send an RTPS Heartbeat every 3s, so they are definitely suboptimal (see #26).

I suspect the latter to be more relevant to these issues, since it addresses "user endpoints", and because I think (and please correct me if I'm wrong) that these "heartbeat" and "bring up" phases are application-level events (i.e. implemented by nav2), driven by some topics and clients/services (i.e. the components expect to communicate over some ROS 2 entities and replies are not received within timeouts), more than relating to DDS-level discovery issues (which I would expect to occur even with default settings within the 4s timeout).

[planner_server-11] [WARN] [1618496250.011785754] [global_costmap.global_costmap]: KeepoutFilter: Filter mask was not received
[controller_server-10] [WARN] [1618496251.608839039] [local_costmap.local_costmap]: KeepoutFilter: Filter mask was not received
...
[tester_node-17] [ERROR] [1618496252.661153071] [nav2_tester]: ?[1;37;41mPose (-1.2776381146555877, -0.5528456233865882) belongs to keepout zone?[0m
[tester_node-17] [ERROR] [1618496252.661485470] [nav2_tester]: ?[1;37;41mPath plan intersects with keepout zone?[0m

I have seen this error occur even with faster reliability settings, and I haven't been able to determine the exact reason for it. I have also seen it fail with other RMW implementations too on my local machine. I'm not sure what the symptom of "plan intersects with keepout zone" says about underlying issues related to middleware entities (e.g. a data-reader might not be receiving data in time? or some reply from a service is lost...). Also, there might be multiple processes responsible for avoiding that error, but I'm not sure which ones these would be among the ones created by test, so it would help me to focus my investigations if you could tell me a bit more about why this error would show up in general, and if you have any more insight on this specific failure (e.g. you can pinpoint some clearly misbehaving component from the logs).

I suspect the latter to be more relevant to these issues, since it addresses "user endpoints", and because I think (and please correct me if I'm wrong) that these "heartbeat" and "bring up" phases are application-level events (i.e. implemented by nav2), driven by some topics and clients/services (i.e. the components expect to communicate over some ROS 2 entities and replies are not received within timeouts), more than relating to DDS-level discovery issues (which I would expect to occur even with default settings within the 4s timeout).

Yes!! This doesn't sound like a discovery issue (bond has its own heartbeats), and it's pretty a reliability settings issue (I was pretty surprised when I read "3s").

P.S.: I don't understand why bond was even ported to ROS 2. It made sense in ROS 1, but DDS has already a lot of hearbeating. We maybe should have some "node" liveliness callbacks, that will be called when a node gets "disconnected". Though we already have publisher liveliness callbacks, that could be used with a similar purpose.

I don't understand why bond was even ported to ROS 2.

Because I ported it because the DDS eq. weren't exposed yet

ros2 / rmw_connextdds

Numerous test failures for Nav2 when using rmw_connextdds #21