ros2 / domain_bridge

Bridge communication across different ROS 2 domains.
Apache License 2.0
52 stars 12 forks source link

test_qos_matching flaky #22

Closed jacobperron closed 3 years ago

jacobperron commented 3 years ago

I've noticed that sometimes the Jenkins PR jobs result in test failures, but retriggering often resolves the issue.

We should figure out if there's a real bug or if the tests can be improved to reduce flakiness.

Example failure: https://build.ros2.org/job/Gpr__domain_bridge__ubuntu_focal_amd64/9/testReport/

jacobperron commented 3 years ago

Another instance: https://build.ros2.org/job/Rdev__domain_bridge__ubuntu_focal_amd64/20/

rebecca-butler commented 3 years ago

I've done a bit of investigation into this problem. The flake occurs in tests where we have two publishers on the same topic with different QoS settings. The bridged QoS should use the settings that best match all publishers (e.g. the maximum deadline and lifespan values), but sometimes it only uses the values from the first publisher. The tests that are affected are qos_matches_max_of_duration_policy and qos_matches_topic_exists_multiple_publishers.

The problem seems to be in get_topic_qos(). This function iterates over the info for the available endpoints to get the max lifespan and deadline values, but sometimes only 1 endpoint is available at the time when the function is called. This means it can only use the first publisher's values.

I don't see an immediately obvious solution to this problem. For now, I've added a delay in get_topic_qos() to wait for a bit after the first publisher is found. I'm also checking if the number of endpoints remains the same before and after get_topic_qos() runs, and if a new endpoint has become available, the function is called again to update the QoS settings. This isn't exactly a robust solution, but I haven't seen the flake happen since adding it, so it might be good enough. If the issue does come up again, we can just comment out the problematic tests until we have a better solution.

rebecca-butler commented 3 years ago

I also discovered a different flaky test while working on this (see #46).