ros2 / ros2_tracing

Tracing tools for ROS 2.
https://docs.ros.org/en/rolling/
Apache License 2.0
139 stars 42 forks source link

:farmer: Flaky test `test_generic_pub_sub__rmw_connextdds` in linux and linux-rhel nightlies #124

Closed Crola1702 closed 3 months ago

Crola1702 commented 3 months ago

Description

test_generic_pub_sub__rmw_connextdds is failing in linux and linux-rhel nightlies. As it's a flaky test, it happens mostly on repeated jobs.

Reference build:

Test regressions:

Log output (see test report full log output):

AssertionError: 2 != 1 : expected 1 event matching rmw_publisher_handle=24180656 in events: <Long output...>
Flaky ratio: job_name last_fail first_fail build_count failure_count failure_percentage
nightly_linux-rhel_repeated 2024-07-02 2024-06-22 15 11 73.33
nightly_linux-rhel_release 2024-07-02 2024-06-27 15 6 40.0
nightly_linux_repeated 2024-07-01 2024-06-22 14 9 64.29
nightly_linux-rhel_debug 2024-07-01 2024-06-22 15 6 40.0
nightly_linux_release 2024-06-30 2024-06-22 15 2 13.33
nightly_linux_debug 2024-06-30 2024-06-30 15 1 6.67
Rci__nightly-connext_ubuntu_noble_amd64 2024-06-29 2024-06-22 11 6 54.55
christophebedard commented 3 months ago

Thanks for opening this issue. I usually look at the nightly CI results from the bot's comment every day, and I've been noticing that a number of these tracing tests have started failing somewhat consistently.

This test_generic_pub_sub test runs the test_generic_ping & test_generic_pong executables, which are pretty much equivalent to the test_pub_sub test and the test_ping & test_pong executables. Then the following sequence is expected:

  1. The *_ping process has a timer that periodically publishes a message on the /ping topic
  2. The *_pong process receives the message from the /ping topic, immediately publishes a message on the /pong topic, and then shuts down
  3. The *_ping process receives the message from the /pong topic and shuts down

Due to potential timing differences in the launch order and all, an initial /ping message has to be be published periodically until the process/node is shut down, otherwise the *_pong process might not get the first/only /ping message, leading to the test hanging.

The test only expects 1 initial /ping message, which has worked in the past*. However, if the machine is too slow, or if the rmw implementation** takes too long to initialize or something, it might need a second message in order to properly trigger the ping->pong sequence. If we look at the timestamps of the 2 ros2:rmw_publish events, they are ~500 ms apart, which is the period of the initial timer: 1719987494798313704 ns - 1719987494298538374 ns = 499.8 ms.

I just need to relax these kinds of assertions and expect >= 1 /ping message instead. I should also do the same for test_pub_sub.

(*) I realize that this is kind of incoherent given the paragraph above :sweat_smile:

(**) we've only recently started running these tests against all rmw implementations (#116)

clalancette commented 3 months ago

Due to potential timing differences in the launch order and all, an initial /ping message has to be be published periodically until the process/node is shut down, otherwise the *_pong process might not get the first/only /ping message, leading to the test hanging.

I'm not sure if this changes the nature of the test, but would using a transient_local topic help? That would ensure that no matter what order ping and pong comes up, pong will always get one message.

christophebedard commented 3 months ago

Huh, I don't know why I didn't think of that :facepalm: yeah, that sounds like the perfect solution. I'll give that a shot.