ros2 / demos

Apache License 2.0
491 stars 329 forks source link

:farmer: Flaky test `demo_nodes_cpp.TestExecutablesTutorial.test_processes_output` failing variations from connextdds #693

Open Crola1702 opened 3 months ago

Crola1702 commented 3 months ago

Bug report

Required Info:

Steps to reproduce issue

  1. Run a build in a Linux or Windows nightly job
  2. See the test regression fail

Description

There is a parent test regression with different variations on ConnextDDS: demo_nodes_cpp.TestExecutablesTutorial.test_processes_output

Failing test regressions:

Log output (test_tutorial_parameter_events_async__rmw_connextdds): ``` FAIL: test_processes_output (demo_nodes_cpp.TestExecutablesTutorial.test_processes_output) Test all processes output against expectations. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/build/demo_nodes_cpp/test_parameter_events_async__rmw_connextdds_Debug.py", line 74, in test_processes_output proc_output.assertWaitFor( File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/launch_testing/lib/python3.12/site-packages/launch_testing/io_handler.py", line 146, in assertWaitFor assert success, 'Waiting for output timed out' ^^^^^^^ AssertionError: Waiting for output timed out ```
Log output (test_tutorial_parameter_events__rmw_connextdds): ``` FAIL: test_processes_output (demo_nodes_cpp.TestExecutablesTutorial) Test all processes output against expectations. ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\ci\ws\build\demo_nodes_cpp\test_parameter_events__rmw_connextdds_Release.py", line 74, in test_processes_output proc_output.assertWaitFor( File "C:\ci\ws\install\Lib\site-packages\launch_testing\io_handler.py", line 146, in assertWaitFor assert success, 'Waiting for output timed out' AssertionError: Waiting for output timed out ```

Flakiness report (projectroot.test_tutorial_parameter_events_async__rmw_connextdds):

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_linux_debug 2024-05-22 2024-05-09 15 7 46.67
nightly_linux_repeated 2024-05-19 2024-05-08 16 2 12.5
nightly_win_deb 2024-05-15 2024-05-07 9 5 55.56

Flakiness report (projectroot.test_tutorial_parameter_events__rmw_connextdds):

job_name last_fail first_fail build_count failure_count failure_percentage
nightly_win_rep 2024-05-22 2024-05-07 13 13 100.0
nightly_win_rel 2024-05-22 2024-05-22 14 1 7.14
nightly_linux-rhel_repeated 2024-05-21 2024-05-12 16 3 18.75
nightly_win_deb 2024-05-12 2024-05-09 9 2 22.22

I don't see any specific change that points to a reason why it started failing more in normal jobs, and not only repeated ones (package history).

projectroot.test_tutorial_parameter_events_async__rmw_connextdds

projectroot.test_tutorial_parameter_events__rmw_connextdds:

fujitatomoya commented 3 months ago

I did take a look at this flaky test issue with connextdds.

1st, this cannot be reproducible with my local dev environment...

2nd, 30 seconds should be long enough to receive the parameter events,

https://github.com/ros2/demos/blob/rolling/demo_nodes_cpp/test/test_executables_tutorial.py.in#L74-L78

both cases https://ci.ros2.org/view/nightly/job/nightly_linux_debug/3064/testReport/junit/(root)/projectroot/test_tutorial_parameter_events_async__rmw_connextdds/ and https://ci.ros2.org/view/nightly/job/nightly_win_rep/3362/testReport/junit/(root)/projectroot/test_tutorial_parameter_events__rmw_connextdds/ missed the last 2 events below.

https://github.com/ros2/demos/blob/4a18995bb7ef52ff40c7892005dca3a68d64fff0/demo_nodes_cpp/test/parameter_events.txt#L58-L69

3rd, QoS for parameter events is reliable enough, https://github.com/ros2/rmw/blob/22f59f8931944999864ef3b0d7aa75ab7258f028/rmw/include/rmw/qos_profiles.h#L77-L88

after all, i could not find why last 2 events are missing here only connextdds, probably connextdds misses those messages already.