Closed Crola1702 closed 3 months ago
Thanks for opening this issue. I usually look at the nightly CI results from the bot's comment every day, and I've been noticing that a number of these tracing tests have started failing somewhat consistently.
This test_generic_pub_sub
test runs the test_generic_ping
& test_generic_pong
executables, which are pretty much equivalent to the test_pub_sub
test and the test_ping
& test_pong
executables. Then the following sequence is expected:
*_ping
process has a timer that periodically publishes a message on the /ping
topic*_pong
process receives the message from the /ping
topic, immediately publishes a message on the /pong
topic, and then shuts down*_ping
process receives the message from the /pong
topic and shuts downDue to potential timing differences in the launch order and all, an initial /ping
message has to be be published periodically until the process/node is shut down, otherwise the *_pong
process might not get the first/only /ping
message, leading to the test hanging.
The test only expects 1 initial /ping
message, which has worked in the past*. However, if the machine is too slow, or if the rmw
implementation** takes too long to initialize or something, it might need a second message in order to properly trigger the ping->pong sequence. If we look at the timestamps of the 2 ros2:rmw_publish
events, they are ~500 ms apart, which is the period of the initial timer: 1719987494798313704 ns - 1719987494298538374 ns = 499.8 ms.
I just need to relax these kinds of assertions and expect >= 1 /ping
message instead. I should also do the same for test_pub_sub
.
(*) I realize that this is kind of incoherent given the paragraph above :sweat_smile:
(**) we've only recently started running these tests against all rmw
implementations (#116)
Due to potential timing differences in the launch order and all, an initial
/ping
message has to be be published periodically until the process/node is shut down, otherwise the*_pong
process might not get the first/only/ping
message, leading to the test hanging.
I'm not sure if this changes the nature of the test, but would using a transient_local
topic help? That would ensure that no matter what order ping
and pong
comes up, pong
will always get one message.
Huh, I don't know why I didn't think of that :facepalm: yeah, that sounds like the perfect solution. I'll give that a shot.
Description
test_generic_pub_sub__rmw_connextdds is failing in linux and linux-rhel nightlies. As it's a flaky test, it happens mostly on repeated jobs.
Reference build:
Test regressions:
Log output (see test report full log output):