ros2 / launch_ros

Tools for launching ROS nodes and for writing tests involving ROS nodes.
Apache License 2.0
59 stars 74 forks source link

:man_farmer: `launch_testing_ros` regressions `wait_for_topic_launch_test` on linux #304

Closed Blast545 closed 3 weeks ago

Blast545 commented 2 years ago

Bug report

Required Info:

Steps to reproduce issue

Run a buildfarm job on nightly_linux_debug

Expected behavior

All tests pass.

Actual behavior

Various launch_testing_ros tests fail and throw exceptions. See: https://ci.ros2.org/view/nightly/job/nightly_linux_debug/2251/

Additional information

wait_for_topic_launch_test.TestFixture.test_topics_unsuccessful shows:

Error log ``` Traceback (most recent call last):^M File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/wait_for_topic_launch_test.py", line 89, in test_topics_unsuccessful^M assert wait_for_node_object.topics_received() == expected_topics^M AssertionError: assert {'chatter_0',..., 'chatter_4'} == {'chatter_0',..., 'chatter_4'}^M Extra items in the right set:^M 'chatter_2'^M 'chatter_1'^M Use -v to get more diff ... [INFO] [python-7]: sending signal 'SIGINT' to process[python-7]^M [INFO] [python-6]: sending signal 'SIGINT' to process[python-6]^M [python-6] Traceback (most recent call last):^M [python-6] File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/talker.py", line 51, in ^M [python-6] main()^M [python-6] File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/talker.py", line 47, in main^M [python-6] rclpy.shutdown()^M [python-6] File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/__init__.py", line 126, in shutdown^M [python-6] _shutdown(context=context)^M [python-6] File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/utilities.py", line 58, in shutdown^M [python-6] return context.shutdown()^M [python-6] File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/context.py", line 102, in shutdown^M [python-6] self.__context.shutdown()^M [python-6] rclpy._rclpy_pybind11.RCLError: failed to shutdown: rcl_shutdown already called on the given context, at /home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/rcl/rcl/src/rcl/ini\ t.c:241 ```

wait_for_topic_launch_test.TestFixture.test_topics_successful shows:

Error log ``` Traceback (most recent call last):^M File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/wait_for_topic_launch_test.py", line 61, in test_topics_successful^M with WaitForTopics(topic_list, timeout=2.0) as wait_for_node_object_1:^M File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/launch_testing_ros/wait_for_topics.py", line 96, in __enter__^M raise RuntimeError('Did not receive messages on these topics: ',^M RuntimeError: ('Did not receive messages on these topics: ', {'chatter_4'}) ... [python-12] Traceback (most recent call last):^M [python-12] File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/talker.py", line 51, in ^M [python-12] main()^M [python-12] File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/talker.py", line 47, in main^M [python-12] rclpy.shutdown()^M [python-12] File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/__init__.py", line 126, in shutdown^M [python-12] _shutdown(context=context)^M [python-12] File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/utilities.py", line 58, in shutdown^M [python-12] return context.shutdown()^M [python-12] File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/install/rclpy/lib/python3.10/site-packages/rclpy/context.py", line 102, in shutdown^M [python-12] self.__context.shutdown()^M [python-12] rclpy._rclpy_pybind11.RCLError: failed to shutdown: rcl_shutdown already called on the given context, at /home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/rcl/rcl/src/rcl/in\ it.c:241 ... Traceback (most recent call last):^M File "/home/jenkins-agent/workspace/nightly_linux_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/wait_for_topic_launch_test.py", line 69, in test_topics_successful^M assert wait_for_node_object_2.wait()^M AssertionError: assert False^M + where False = >()^M + where > = .wait^M ^M ```
Blast545 commented 2 years ago

Today it appeared without an exception, just the test failure, see: https://ci.ros2.org/view/nightly/job/nightly_linux_debug/2254/testReport/junit/launch_testing_ros.test.examples/wait_for_topic_launch_test/wait_for_topic_launch_test/

sloretz commented 2 years ago

I'm able to reproduce both failures ( assert wait_for_node_object_2.wait() and AssertionError: assert set() == {'chatter_0',..., 'chatter_4'} , but only when running stress --cpu N. With an idle system I can't reproduce at all.

Blast545 commented 2 years ago

Do you think this test failure might be related to this issue? https://ci.ros2.org/job/nightly_linux_coverage/1584/testReport/junit/launch_testing_ros.test.examples/wait_for_topic_launch_test/wait_for_topic_launch_test/ @sloretz

Blast545 commented 2 years ago

Today it appeared in the humble coverage job: https://ci.ros2.org/job/nightly_linux_humble_coverage/16/testReport/junit/launch_testing_ros.test.examples/wait_for_topic_launch_test/wait_for_topic_launch_test/

Blast545 commented 2 years ago

It also has been appearing often in the linux_coverage jobs: https://ci.ros2.org/job/nightly_linux_coverage/1604/testReport/junit/launch_testing_ros.test.examples/wait_for_topic_launch_test/wait_for_topic_launch_test/

Blast545 commented 2 years ago

:man_farmer: This error fails with a flaky rate of +- 7%.

References from last month: https://build.ros2.org/view/Rci/job/Rci__nightly-debug_ubuntu_jammy_amd64/121/ https://build.ros2.org/view/Rci/job/Rci__nightly-debug_ubuntu_jammy_amd64/136/ https://ci.ros2.org/view/nightly/job/nightly_linux_debug/2333/

Crola1702 commented 2 years ago

Flakiness ratio of the last 2 weeks:

Linux:

Aarch64:

Rhel:

Crola1702 commented 1 year ago

Happened today in aarch64-debug: 2197

FAIL: wait_for_topic_launch_test.TestFixture.test_topics_unsuccessful
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins-agent/workspace/nightly_linux-aarch64_debug/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/wait_for_topic_launch_test.py", line 90, in test_topics_unsuccessful
    assert wait_for_node_object.topics_received() == expected_topics
AssertionError: assert {'chatter_1',..., 'chatter_4'} == {'chatter_0',..., 'chatter_4'}
  Extra items in the right set:
  'chatter_0'
Crola1702 commented 1 year ago

This issue increased is appearance on ros2 buildfarm during the last week considerably.

Flakiness report 2023-02-21:

Job: Rci__nightly-debug_ubuntu_jammy_amd64 Last time seen: 2023-02-21 Last 30 builds: 3 builds failed, 10% with error Last 7 builds: 3 builds failed, 42.85% with error

Job: nightly_linux_coverage Last time seen: 2023-02-20 Last 30 builds: 7 builds failed, 23.33% with error Last 7 builds: 4 builds failed, 57.14% with error

Job: Rci__nightly-release_ubuntu_jammy_amd64 Last time seen: 2023-02-20 Last 30 builds: 2 builds failed, 6.66% with error Last 7 builds: 2 builds failed, 28.57% with error

Job: nightly_linux-rhel_debug Last time seen: 2023-02-19 Last 30 builds: 3 builds failed, 10% with error Last 7 builds: 1 builds failed, 14.28% with error

Job: nightly_linux_debug Last time seen: 2023-02-18 Last 30 builds: 1 builds failed, 3.33% with error Last 7 builds: 1 builds failed, 14.28% with error

Job: nightly_linux-rhel_repeated Last time seen: 2023-02-17 23:00:01.423000 Last 30 builds: 2 builds failed, 6.66% with error Last 7 builds: 1 builds failed, 14.28% with error

Job: nightly_linux_humble_coverage Last time seen: 2023-02-17 Last 30 builds: 4 builds failed, 13.33% with error Last 7 builds: 2 builds failed, 28.57% with error

Job: nightly_linux_repeated Last time seen: 2023-02-17 Last 30 builds: 1 builds failed, 3.33% with error Last 7 builds: 1 builds failed, 14.28% with error

Job: nightly_linux-aarch64_debug Last time seen: 2023-02-10 Last 30 builds: 1 builds failed, 3.33% with error Last 7 builds: 0 builds failed, 0% with error

clalancette commented 1 year ago

This issue increased is appearance on ros2 buildfarm during the last week considerably.

I think the reason for that is that there were 2 weeks in there where we basically weren't running any of the rclpy-based tests (this was due to a bug in colcon that has since been fixed). So it may not be something that actually increased, we just really weren't testing it much over the last month.

Can you go back and compare with historical data? Like, what was the percentage failure in November and December 2022?

Crola1702 commented 1 year ago

what was the percentage failure in November and December 2022?

Rcinightly-debug_ubuntu_jammy_amd64: 6.56% nightly_linux_coverage: 37.1% nightly_linux-rhel_debug: 12.9% nightly_linux_debug: 11.48% nightly_linux-rhel_repeated: 11.29% nightly_linux_humble_coverage: 12.9% nightly_linux_repeated: 4.92% nightly_linux-aarch64_debug: 9.09% nightly_linux-aarch64_repeated: 1.37% Rcinightly-fastrtps_ubuntu_jammy_amd64: 1.64%

In both November and December

Yes, I think they are like the same as before

Crola1702 commented 1 year ago

Flakiness Report 2023-03-27

job_name last_time first_time build_count failure_count failure_percentage
nightly_linux-rhel_repeated 2023-03-26 2021-12-09 30 14 46.67
nightly_linux_coverage 2023-03-26 2021-10-12 30 14 46.67
Rci__nightly-debug_ubuntu_jammy_amd64 2023-03-27 2022-03-01 31 9 29.03
Rci__nightly-fastrtps_ubuntu_jammy_amd64 2023-03-27 2022-03-01 33 9 27.27
nightly_linux_debug 2023-03-27 2021-10-31 31 7 22.58
nightly_linux-aarch64_debug 2023-03-26 2021-07-23 34 6 17.65
nightly_linux-rhel_debug 2023-03-26 2021-12-10 30 5 16.67
nightly_linux-aarch64_repeated 2023-03-26 2021-07-23 32 3 9.38
nightly_linux-rhel_release 2023-03-26 2021-12-30 30 2 6.67
Rci__nightly-release_ubuntu_jammy_amd64 2023-03-27 2022-03-01 33 2 6.06
nightly_linux_humble_coverage 2023-03-27 2022-08-08 31 1 3.23

I think these percentages are higher than ever. Some jobs almost reached 50% flakiness.

@clalancette Do you think we should disable this test?

clalancette commented 1 year ago

@clalancette Do you think we should disable this test?

No, I don't think so. The ones that we actually mostly care about (nightly_linux_debug and below) are at 22%. Also, we will hopefully have some time to look at this later in April.

clalancette commented 1 year ago

With #360 merged, I'm going to call this "fixed". But please reopen if this starts occuring again.

Crola1702 commented 1 year ago

A similar problem is happening in Humble Coverage at a 10% flaky ratio:

Reference build: https://ci.ros2.org/job/nightly_linux_humble_coverage/594/

wait_for_topic_launch_test.TestFixture.test_topics_successful failed

FAIL: wait_for_topic_launch_test.TestFixture.test_topics_successful
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins-agent/workspace/nightly_linux_humble_coverage/ws/src/ros2/launch_ros/launch_testing_ros/test/examples/wait_for_topic_launch_test.py", line 62, in test_topics_successful
    with WaitForTopics(topic_list, timeout=2.0) as wait_for_node_object_1:
  File "/home/jenkins-agent/workspace/nightly_linux_humble_coverage/ws/src/ros2/launch_ros/launch_testing_ros/launch_testing_ros/wait_for_topics.py", line 96, in __enter__
    raise RuntimeError('Did not receive messages on these topics: ',
RuntimeError: ('Did not receive messages on these topics: ', {'chatter_1'})

I'm not sure if this is the same problem, or if it is a different but similar one

Blast545 commented 2 months ago

This seems to be happening consistently in humble coverage jobs now, see: https://ci.ros2.org/job/nightly_linux_humble_coverage/944/

Crola1702 commented 1 month ago

This is happening as a flaky issue

Flakiness for the last 15 days: job_name last_fail first_fail build_count failure_count failure_percentage
nightly_linux_humble_coverage 2024-10-07 2024-09-23 14 7 50.0
Ici__nightly-debug_ubuntu_jammy_amd64 2024-10-04 2024-10-04 5 1 20.0
nightly_linux_iron_coverage 2024-10-02 2024-09-22 13 3 23.08
nightly_linux-rhel_repeated 2024-09-24 2024-09-23 15 2 13.33

Last failures:

Crola1702 commented 3 weeks ago

This issue didn't happen again