Open Crola1702 opened 1 year ago
@clalancette, this is happening almost consistently in Windows builds. Do you think it's a good time to disable this test while https://github.com/ros2/rosbag2/pull/1342 is resolved? (it seems it would be a long time until we have updates on the fix PR)
@Crola1702 @clalancette I guess it hasn't been fully fixed for Windows after my overhaul in player tests per https://github.com/ros2/rosbag2/pull/1297.
I will try to take a look at it this week.
If I will not find a solution we can disable it for Windows again.
BTW. I have come across with this failure in one of my recent PR https://github.com/ros2/rosbag2/pull/1423#issuecomment-1666421538.
My preliminary analysis was:
Failed PlayEndToEndTestFixture.play_filters_by_topic by timeout since was not able to receive confirmation about service call for player resume in 60 seconds.
BTW. I have come across with this failure in one of my recent PR https://github.com/ros2/rosbag2/pull/1423#issuecomment-1666421538.
I haven't seen this failure in the buildfarm :thinking:
I'm disabling the test in: https://github.com/ros2/rosbag2/pull/1452
Preliminary analysis:
There is a race condition between when service became available and when we start spining player node to be able to process service callbacks. i.e. it turns out that resume
service created and on the other side we check in the test that service is available via graph and sending service request. However the rosbag2 player may not be ready to process callbacks from the services since we haven't yet even created executor for spinning player node.
The solution would be to move play/pause
service creation from player constructor to the roabg2_palyer->play
method.
cc: @clalancette
@MichaelOrlov Could this be another case similar to https://github.com/ros2/rosbag2/pull/1796#pullrequestreview-2299314606 ?
@clalancette Not the same. It is completely different.
Here, the problem is that we first constructing a Rosbag2 Player class in the rosbag2_py::transport_.cpp
in the constructor we creating service for play/pause
and this service became visible via node_graph for the test wich is waiting for it in another process. The test start sending service request for resume playback since it see that service became available. Although on another side (Player) we haven't yet started Player::Play()
and(or) haven't yet created executor to spin it in the rosbag2_py::_transport.cpp
.
https://github.com/ros2/rosbag2/blob/2d4d02fd1374781f7111da3b0d91777905d9f7be/rosbag2_py/src/rosbag2_py/_transport.cpp#L262-L271
Update:
Even if we will add wait for exec::is_spinning()
before calling Player::play()
it will not help much since the race also happened between constructing Player and further exec::spin()
call.
It seems I found a workaround for tests.
In test before sending successful_service_request<Resume>(cli_resume_);
request
https://github.com/ros2/rosbag2/blob/2d4d02fd1374781f7111da3b0d91777905d9f7be/rosbag2_tests/test/rosbag2_tests/test_rosbag2_play_end_to_end.cpp#L195-L197
Need repeatedely request for another service IsPaused
, but ignoring the failure if we will not get response. When we will get response for IsPaused
from Player - it means that Rosbag2 Player is fully started and ready. It is sort of current ststus of the Player.
Will need to write some helper function similar to the
https://github.com/ros2/rosbag2/blob/2d4d02fd1374781f7111da3b0d91777905d9f7be/rosbag2_tests/test/rosbag2_tests/test_rosbag2_play_end_to_end.cpp#L76-L89
@r7vme Help here will be appreciated.
Description
Flaky test test_rosbag2_play_end_to_end in windows CI (debug, release and repeated)
Test regressions:
Expected Behavior
Test should pass
Actual Behavior
Test failing because of a timeout
To Reproduce
System (please complete the following information)
Additional context
Reference build: https://ci.ros2.org/view/nightly/job/nightly_win_deb/2813/
Test regressions:
Test is failing because of a timeout:
Log output:
Test gets stuck when the error pops up (normally takes 15 seconds to run)
Flakiness ratio (last 15 days)
Updated 17-08-2023
Running a diff between ros2 repos in nightly_win_deb#2806 and nightly_win_deb#2807:
First time this issue was seen:
https://github.com/ros2/rmw_connextdds/pull/26#issuecomment-1658439136