ros2 / rosbag2

Apache License 2.0
272 stars 245 forks source link

:farmer: test_rosbag2_play_end_to_end flaky on Windows #1437

Open Crola1702 opened 1 year ago

Crola1702 commented 1 year ago

Description

Flaky test test_rosbag2_play_end_to_end in windows CI (debug, release and repeated)

Test regressions:

Expected Behavior

Test should pass

Actual Behavior

Test failing because of a timeout

To Reproduce

  1. Run a build in nightly_win_deb, nightly_win_rel or nightly_win_rep
  2. See projectroot.test_rosbag2_play_end_to_end fail

System (please complete the following information)

Additional context

Reference build: https://ci.ros2.org/view/nightly/job/nightly_win_deb/2813/

Test regressions:

Test is failing because of a timeout:

Log output:

[WARN] [1691491268.193064100] [subscription_manager_1691491266065392000]: SubscriptionManager::continue_spinning(..) finished by timeout
C:\ci\ws\src\ros2\rosbag2\rosbag2_tests\test\rosbag2_tests\test_rosbag2_play_end_to_end.cpp(80): error: Expected equality of these values:
  future.wait_for(service_call_timeout_)
    Which is: 4-byte object <01-00 00-00>
  std::future_status::ready
    Which is: 4-byte object <00-00 00-00>
 1/11 Test  #2: test_rosbag2_play_end_to_end .....***Timeout  60.00 sec

Test gets stuck when the error pops up (normally takes 15 seconds to run)


Flakiness ratio (last 15 days)

job_name last_time first_time build_count failure_count failure_percentage
nightly_win_rep 2023-08-16 2023-08-02 15 13 86.67
nightly_win_rel 2023-08-16 2023-08-02 15 14 93.33
nightly_win_deb 2023-08-16 2023-08-02 15 13 86.67

Updated 17-08-2023

Running a diff between ros2 repos in nightly_win_deb#2806 and nightly_win_deb#2807:

```diff @@ -36,7 +36,7 @@ repositories: eProsima\Fast-DDS: type: git url: https://github.com/eProsima/Fast-DDS.git - version: 7cf43a62cabc3124721258f02c9257f451dd1971 + version: 9ae27f174c1a33bf539c589ce2f1d6630052d1b0 eProsima\foonathan_memory_vendor: type: git url: https://github.com/eProsima/foonathan_memory_vendor.git @@ -208,7 +208,7 @@ repositories: ros2\mimick_vendor: type: git url: https://github.com/ros2/mimick_vendor.git - version: 6fcd465251c1e62b1dddabf6607712da8a141a2c + version: 24f0be689e525dbcf18cce910ac12bae26c07e3c ros2\orocos_kdl_vendor: type: git url: https://github.com/ros2/orocos_kdl_vendor.git @@ -244,7 +244,7 @@ repositories: ros2\rclpy: type: git url: https://github.com/ros2/rclpy.git - version: 913afa019b0d60e7255bd84247dbeab4735d96ea + version: 5367703c9812680a563fb904b8c9d187da310bd3 ros2\rcpputils: type: git url: https://github.com/ros2/rcpputils.git @@ -264,7 +264,7 @@ repositories: ros2\rmw_connextdds: type: git url: https://github.com/ros2/rmw_connextdds.git - version: 1206113b6eff6d2a6557911866548ff4415c6852 + version: b57d032cad10a68f0cb0d349a715b9e3c4475cf4 ros2\rmw_cyclonedds: type: git url: https://github.com/ros2/rmw_cyclonedds.git @@ -300,7 +300,7 @@ repositories: ros2\rosbag2: type: git url: https://github.com/ros2/rosbag2.git - version: ba199d05954d6e51975c47acd3725cac6267778f + version: fda5aea264b4183d94352a7c7caf266ae104aa52 ```

First time this issue was seen:

https://github.com/ros2/rmw_connextdds/pull/26#issuecomment-1658439136

Crola1702 commented 1 year ago

@clalancette, this is happening almost consistently in Windows builds. Do you think it's a good time to disable this test while https://github.com/ros2/rosbag2/pull/1342 is resolved? (it seems it would be a long time until we have updates on the fix PR)

MichaelOrlov commented 1 year ago

@Crola1702 @clalancette I guess it hasn't been fully fixed for Windows after my overhaul in player tests per https://github.com/ros2/rosbag2/pull/1297.
I will try to take a look at it this week. If I will not find a solution we can disable it for Windows again.

MichaelOrlov commented 1 year ago

BTW. I have come across with this failure in one of my recent PR https://github.com/ros2/rosbag2/pull/1423#issuecomment-1666421538.
My preliminary analysis was:

Failed PlayEndToEndTestFixture.play_filters_by_topic by timeout since was not able to receive confirmation about service call for player resume in 60 seconds.

Crola1702 commented 1 year ago

BTW. I have come across with this failure in one of my recent PR https://github.com/ros2/rosbag2/pull/1423#issuecomment-1666421538.

I haven't seen this failure in the buildfarm :thinking:

Crola1702 commented 1 year ago

I'm disabling the test in: https://github.com/ros2/rosbag2/pull/1452

MichaelOrlov commented 1 week ago

Preliminary analysis: There is a race condition between when service became available and when we start spining player node to be able to process service callbacks. i.e. it turns out that resume service created and on the other side we check in the test that service is available via graph and sending service request. However the rosbag2 player may not be ready to process callbacks from the services since we haven't yet even created executor for spinning player node.

The solution would be to move play/pause service creation from player constructor to the roabg2_palyer->play method.

cc: @clalancette

clalancette commented 1 week ago

@MichaelOrlov Could this be another case similar to https://github.com/ros2/rosbag2/pull/1796#pullrequestreview-2299314606 ?

MichaelOrlov commented 1 week ago

@clalancette Not the same. It is completely different. Here, the problem is that we first constructing a Rosbag2 Player class in the rosbag2_py::transport_.cpp in the constructor we creating service for play/pause and this service became visible via node_graph for the test wich is waiting for it in another process. The test start sending service request for resume playback since it see that service became available. Although on another side (Player) we haven't yet started Player::Play() and(or) haven't yet created executor to spin it in the rosbag2_py::_transport.cpp. https://github.com/ros2/rosbag2/blob/2d4d02fd1374781f7111da3b0d91777905d9f7be/rosbag2_py/src/rosbag2_py/_transport.cpp#L262-L271

Update: Even if we will add wait for exec::is_spinning() before calling Player::play() it will not help much since the race also happened between constructing Player and further exec::spin() call.

MichaelOrlov commented 1 week ago

It seems I found a workaround for tests. In test before sending successful_service_request<Resume>(cli_resume_); request https://github.com/ros2/rosbag2/blob/2d4d02fd1374781f7111da3b0d91777905d9f7be/rosbag2_tests/test/rosbag2_tests/test_rosbag2_play_end_to_end.cpp#L195-L197

Need repeatedely request for another service IsPaused, but ignoring the failure if we will not get response. When we will get response for IsPaused from Player - it means that Rosbag2 Player is fully started and ready. It is sort of current ststus of the Player. Will need to write some helper function similar to the https://github.com/ros2/rosbag2/blob/2d4d02fd1374781f7111da3b0d91777905d9f7be/rosbag2_tests/test/rosbag2_tests/test_rosbag2_play_end_to_end.cpp#L76-L89

MichaelOrlov commented 1 week ago

@r7vme Help here will be appreciated.