Crola1702 commented 3 months ago

Description

test_generic_pub_sub__rmw_connextdds is failing in linux and linux-rhel nightlies. As it's a flaky test, it happens mostly on repeated jobs.

Reference build:

Test regressions:

Log output (see test report full log output):

AssertionError: 2 != 1 : expected 1 event matching rmw_publisher_handle=24180656 in events: <Long output...>

Flaky ratio: job_name	last_fail	first_fail	build_count	failure_count	failure_percentage
nightly_linux-rhel_repeated	2024-07-02	2024-06-22	15	11	73.33
nightly_linux-rhel_release	2024-07-02	2024-06-27	15	6	40.0
nightly_linux_repeated	2024-07-01	2024-06-22	14	9	64.29
nightly_linux-rhel_debug	2024-07-01	2024-06-22	15	6	40.0
nightly_linux_release	2024-06-30	2024-06-22	15	2	13.33
nightly_linux_debug	2024-06-30	2024-06-30	15	1	6.67
Rci__nightly-connext_ubuntu_noble_amd64	2024-06-29	2024-06-22	11	6	54.55

christophebedard commented 3 months ago

Thanks for opening this issue. I usually look at the nightly CI results from the bot's comment every day, and I've been noticing that a number of these tracing tests have started failing somewhat consistently.

This test_generic_pub_sub test runs the test_generic_ping & test_generic_pong executables, which are pretty much equivalent to the test_pub_sub test and the test_ping & test_pong executables. Then the following sequence is expected:

The *_ping process has a timer that periodically publishes a message on the /ping topic
The *_pong process receives the message from the /ping topic, immediately publishes a message on the /pong topic, and then shuts down
The *_ping process receives the message from the /pong topic and shuts down

Due to potential timing differences in the launch order and all, an initial /ping message has to be be published periodically until the process/node is shut down, otherwise the *_pong process might not get the first/only /ping message, leading to the test hanging.

The test only expects 1 initial /ping message, which has worked in the past*. However, if the machine is too slow, or if the rmw implementation** takes too long to initialize or something, it might need a second message in order to properly trigger the ping->pong sequence. If we look at the timestamps of the 2 ros2:rmw_publish events, they are ~500 ms apart, which is the period of the initial timer: 1719987494798313704 ns - 1719987494298538374 ns = 499.8 ms.

I just need to relax these kinds of assertions and expect >= 1 /ping message instead. I should also do the same for test_pub_sub.

(*) I realize that this is kind of incoherent given the paragraph above :sweat_smile:

(**) we've only recently started running these tests against all rmw implementations (#116)

clalancette commented 3 months ago

Due to potential timing differences in the launch order and all, an initial /ping message has to be be published periodically until the process/node is shut down, otherwise the *_pong process might not get the first/only /ping message, leading to the test hanging.

I'm not sure if this changes the nature of the test, but would using a transient_local topic help? That would ensure that no matter what order ping and pong comes up, pong will always get one message.

christophebedard commented 3 months ago

Huh, I don't know why I didn't think of that :facepalm: yeah, that sounds like the perfect solution. I'll give that a shot.

ros2 / ros2_tracing

:farmer: Flaky test `test_generic_pub_sub__rmw_connextdds` in linux and linux-rhel nightlies #124

Description