ros2 / rosbag2

Apache License 2.0
283 stars 251 forks source link

:man_farmer: Regression in test_play_{timing,services}__rmw_{rmw_vendor} on the buildfarm jobs #862

Open j-rivero opened 3 years ago

j-rivero commented 3 years ago

Description

The following tests have started to fail consistently (three days in a row) in the CI of https://ci.ros2.org/job/nightly_linux_repeated/:

If I'm not wrong, the build displays that the commit used is 891e08128e2d6ff36452871abda2cd776b8a1566 that correspond to the pull #848 .

Expected Behavior

Tests should pass :)

Actual Behavior

Timeout


2_transport/test_results/rosbag2_transport/test_play_timing__rmw_fastrtps_cpp.gtest.xml
30: Running main() from gmock_main.cc
30: [==========] Running 3 tests from 1 test suite.
30: [----------] Global test environment set-up.
30: [----------] 3 tests from PlayerTestFixture
30: [ RUN      ] PlayerTestFixture.playing_respects_relative_timing_of_stored_messages
30: [       OK ] PlayerTestFixture.playing_respects_relative_timing_of_stored_messages (1025 ms)
30: [ RUN      ] PlayerTestFixture.playing_respects_rate
    Test #30: test_play_timing__rmw_fastrtps_cpp .................***Timeout  60.05 sec
test 31
      Start 31: test_play_services__rmw_fastrtps_cpp

31: Test command: /home/jenkins-agent/workspace/nightly_linux_repeated/venv/bin/python3.8 "-u" "/home/jenkins-agent/workspace/nightly_linux_repeated/ws/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_results/rosbag2_transport/test_play_services__rmw_fastrtps_cpp.gtest.xml" "--package-name" "rosbag2_transport" "--output-file" "/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/ament_cmake_gmock/test_play_services__rmw_fastrtps_cpp.txt" "--env" "RMW_IMPLEMENTATION=rmw_fastrtps_cpp" "--command" "/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_play_services__rmw_fastrtps_cpp" "--gtest_output=xml:/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_results/rosbag2_transport/test_play_services__rmw_fastrtps_cpp.gtest.xml"
31: Test timeout computed to be: 60
31: -- run_test.py: extra environment variables:
31:  - RMW_IMPLEMENTATION=rmw_fastrtps_cpp
31: -- run_test.py: invoking following command in '/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport':
31:  - /home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_play_services__rmw_fastrtps_cpp --gtest_output=xml:/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_results/rosbag2_transport/test_play_services__rmw_fastrtps_cpp.gtest.xml
31: Running main() from gmock_main.cc
31: [==========] Running 5 tests from 1 test suite.
31: [----------] Global test environment set-up.
31: [----------] 5 tests from PlaySrvsTest
31: [ RUN      ] PlaySrvsTest.pause_resume
31: [       OK ] PlaySrvsTest.pause_resume (572 ms)
31: [ RUN      ] PlaySrvsTest.toggle_paused
31: [       OK ] PlaySrvsTest.toggle_paused (487 ms)
31: [ RUN      ] PlaySrvsTest.set_rate_good_values
    Test #31: test_play_services__rmw_fastrtps_cpp ...............***Timeout  60.06 sec
test 32

To Reproduce

Check CI job

System (please complete the following information)

Blast545 commented 3 years ago

After #863 was merged, has been failing in 21 out of 35 (60%) of the latest debug builds. The test failing is playing_respects_delay. I haven't checked repeated jobs, but right now it's failing frequently in the nightly debug jobs.

Most recent case: https://build.ros2.org/view/Rci/job/Rci__nightly-debug_ubuntu_focal_amd64/472/

Blast545 commented 2 years ago

Without taking into account the repeated jobs, this one has occurred 13 times in the last 20 days. In the nightly_win_rel, nightly_win_deb, nightly_linux_release, nightly_linux_aarch_release and Rci__nightly-release_ubuntu_jammy_amd64.

Here's a recent reference: https://ci.ros2.org/view/nightly/job/nightly_win_rel/2278/testReport/junit/(root)/rosbag2_transport/test_play_services__rmw_fastrtps_cpp_gtest_missing_result/

It affects all the rmw vendors. Related to https://github.com/ros2/rosbag2/issues/732, but this one is the one that happens more often.

MichaelOrlov commented 2 years ago

@Blast545 I've spent some of my time and tried to make analysis for failures in tests:

Those tests are flaky by design and I am honestly don't know how to rewrite them to be more deterministic.

MichaelOrlov commented 2 years ago

@Blast545 BTW. I see that rosbag2_transport.test_play_services__rmw_fastrtps_cpp periodically fails on CI.
Here is one of the examples https://build.ros2.org/job/Hpr__rosbag2__ubuntu_jammy_amd64/1/testReport/junit/(root)/projectroot/test_play_services__rmw_fastrtps_cpp/ I am curious about if it fails only with rmw_fastrtps ?

I've tried to make a brief analysis of the failure. I see that failure happens in play_next_response = successful_call(cli_playnext); on 195 iteration out of 200. And failure repeats twice with the same numbers.
It looks like 195 is some magic number specific for the fastrtps DDS. Maybe max number of some internal resources.

play_next_response = successful_call(cli_playnext); consistently fails in 101 line. EXPECT_EQ(future.wait_for(service_calltimeout), std::future_status::ready);

It would be actually better to ask someone from FAST RTPS support team to look on this failure.

Meanwhile to mitigate this failure I would suggest to try to increase const std::chrono::seconds service_calltimeout {1}; up to 3 seconds. And decrease number of messages to publish const size_t num_msgs_topublish = 200; to 150

Blast545 commented 2 years ago

Thanks for digging into this! @MichaelOrlov

Yeah, the rosbag2_transport.test_play_services__rmw_fastrtps_cpp is the one flaky remaining affecting us the most. I have seen it in various rosbag2_transport PRs.

On Linux it happens only for rmw_fastrtps, in case it rings a bell, it only occurs on release builds. Rci__nightly-release_ubuntu_jammy_amd64#74) nightly_linux_release#2276 nightly_linux-rhel_release#1127

It happens on other distributions for `rmw_cyclonedds`, but it's not a common scenario On the nightly_win_rep it only fails with `rmw_cyclonedds`: https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/ On the nightly_linux_repeated it could fail either only `rmw_fastrtps` (more common) or both like: https://ci.ros2.org/view/nightly/job/nightly_linux_repeated/2642 And there's an ultra rare scenario where it fails for the three rmw vendors on nightly_win_deb and nightly_win_rel: https://ci.ros2.org/view/nightly/job/nightly_win_rel/2278/ https://ci.ros2.org/view/nightly/job/nightly_win_deb/2334/

I will open the PR with your suggestion tomorrow morning @MichaelOrlov and get more feedback there as needed. @clalancette should we ping someone in particular from the fastrtps team to take a look?

MichaelOrlov commented 2 years ago

@Blast545 I see how test fails on rmw_cyclonedds from your link https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/. It fails in another test toggle_pause. With just few iterations for service call. And seems that timeout in 1 second for service call is not enough for some cases. Probably with heavy loaded HW or network stack.

Let's try to increase [const std::chrono::seconds service_calltimeout {1}; up to 5 seconds and see if CI will only fail on fastrtps on 195 iteration.
If this is will be the case, than make sense to call someone from fastrtps team to take a look.

MichaelOrlov commented 2 years ago

@Blast545 @clalancette I have a good news about this annoying failing test_play_services test.

First of all I was able to reproduce it locally with some extra load for my machine. I loaded my computer with stress -m 60 command. Please note I was able to reproduce the same failure for both rmws FastRTPS and CycloneDDS. And not only for play_next test. Basically similar what we have seen on CI.

The second good news I found a breaking PR and commit: This is Update client API to be able to remove pending requests rclcpp#1734 and relevant commit https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559

I've tried to revert commit https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559 locally and failure doesn't reproduce any more.

@ivanpauno Could you please pick up further analysis of the failing https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/ from this point since you was an author of the breaking commit https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559

ivanpauno commented 2 years ago

@ivanpauno Could you please pick up further analysis of the failing https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/ from this point since you was an author of the breaking commit https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559

Could you summarize the analysis you have done up to now?

The problem seems to be a race. Maybe https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559 introduced a race (though that seems to be unlikely based on the code changes), but it might be a pre-existing bug that became more likely to happen after the commit.

MichaelOrlov commented 2 years ago

Hi @ivanpauno Sorry my late response. My preliminary analysis is following:

  1. We have observing subset of tests failure or found them to be flaky. Those tests are sort of "integration tests" in rosbag2_trasport package and involve sending request and receiving responses via service calls. In particular affected tests in test_play_services.
  2. I've found out that all failing tests has a common design that they expecting to receive response back from "server" as an answer on the request. For instance in play_next or is_paused requests which are supposed to return boolean value.
  3. The failure happening in successful_call(..) when we are not getting value from future due to the timeout. I've tried to increase those timeout up until 30 seconds but doesn't help, observing the same failure. It's difficult to analyse this failure since the response simply doesn't arrive and it happens not so often.
  4. I've tried to optimize unit tests and add some debug info to exclude some possible overlap with another use cases and flows. I also thoroughly reviewed source code of the rosbag2 which is responsible for sending response and found no issues everything is clear and well written. Also I've tried to run valgrind with those failing tests and haven't found anything which could cause memory corruption.
  5. In most cases failure happening with rmw_fastrtps, but we have seen some cases when the same failure happening on rmw_cyclonedds for example here https://pipelines.actions.githubusercontent.com/serviceHosts/af0a8eef-e408-4986-ae27-20f00fbcb6f9/_apis/pipelines/1/runs/26054/signedlogcontent/2?urlExpires=2022-06-23T00%3A53%3A29.4599815Z&urlSigningMethod=HMACV1&urlSignature=K%2FQgv3z8x4SAtVeznmXFKGdg05Bur8EcI5VCv4ZaJl0%3D search for test_play_services__rmw_cyclonedds_cpp.
  6. I was able to reproduce those failure locally by loading my local machine with another tasks via command stress -m 60.
  7. I found out that without https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559 tests performs much faster with heavy load on the machine. It seems https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559 also introduce performance regression for services.

Please let me know if you need more information or details about this issue or if something unclear.

fujitatomoya commented 2 years ago

could be related to https://github.com/ros2/rmw_fastrtps/pull/616.

MichaelOrlov commented 2 years ago

could be related to ros2/rmw_fastrtps#616.

@fujitatomoya Unlikely, since in test_play_services we are not sending many requests in burst. We are sending service requests one by one and each time verifying that we are getting corresponding response from the "server" before sending the next request. And at some iteration we just loosing response.

fujitatomoya commented 2 years ago

@MichaelOrlov thanks for the comment. i wasnt sure, just came up to mind.

ivanpauno commented 2 years ago

I was able to reproduce the issue, I don't fully understand why it happens yet (and it's pretty hard to reproduce it). I will post here if I have further updates.

MichaelOrlov commented 1 year ago

Could be related to the https://github.com/ros2/ros2/issues/922 and https://answers.ros.org/question/411860/can-ros2-services-still-be-expected-to-be-flakey/

MichaelOrlov commented 1 year ago

Could be related to the https://github.com/ros2/rclcpp/issues/2039, https://github.com/eProsima/Fast-DDS/pull/3087 and https://github.com/osrf/docker_images/issues/628

MichaelOrlov commented 10 months ago

Could be related to the https://github.com/ros2/rmw_cyclonedds/pull/187, https://github.com/ros2/rmw_cyclonedds/issues/74, https://github.com/ros2/rmw_fastrtps/issues/392, https://github.com/ros2/rmw_cyclonedds/issues/191,