Open j-rivero opened 3 years ago
After #863 was merged, has been failing in 21 out of 35 (60%) of the latest debug builds. The test failing is playing_respects_delay
. I haven't checked repeated jobs, but right now it's failing frequently in the nightly debug jobs.
Most recent case: https://build.ros2.org/view/Rci/job/Rci__nightly-debug_ubuntu_focal_amd64/472/
Without taking into account the repeated jobs, this one has occurred 13 times in the last 20 days. In the nightly_win_rel
, nightly_win_deb
, nightly_linux_release
, nightly_linux_aarch_release
and Rci__nightly-release_ubuntu_jammy_amd64
.
Here's a recent reference: https://ci.ros2.org/view/nightly/job/nightly_win_rel/2278/testReport/junit/(root)/rosbag2_transport/test_play_services__rmw_fastrtps_cpp_gtest_missing_result/
It affects all the rmw
vendors. Related to https://github.com/ros2/rosbag2/issues/732, but this one is the one that happens more often.
@Blast545 I've spent some of my time and tried to make analysis for failures in tests:
/clock
topic./clock
topic publishing with rclcpp::ClockQoS()
QOS settings which is using
There are no guarantee that messages will be delivered on transport layer.
Therefore tests likely fails because we are trying to detect clock rate by timestamps between two nearest messages and some of them got lost on transport layer.
Those tests are flaky by design and I am honestly don't know how to rewrite them to be more deterministic.
[ Stack Trace](javascript:hideFailureSummary('id789457-stacktrace'))
C:\ci\ws\src\ros2\rosbag2\rosbag2_transport\test\rosbag2_transport\test_record_services.cpp:93
Value of: pub_manager.wait_for_matched(test_topic_.c_str())
Actual: false
Expected: true
This is very strange failure which I wouldn't expect to happen.
Basically it failures to detect matching subscription during the 10 seconds. The test and implementation looks good.
What I am not trust to 100% is that test using SingleThreadedExecutor
for recorder i.e. for subscriptions which we are expecting in failing check.
Maybe rewriting test with using std::threads
directly without SingleThreadedExecutor
will help.
@Blast545 BTW. I see that rosbag2_transport.test_play_services__rmw_fastrtps_cpp
periodically fails on CI.
Here is one of the examples https://build.ros2.org/job/Hpr__rosbag2__ubuntu_jammy_amd64/1/testReport/junit/(root)/projectroot/test_play_services__rmw_fastrtps_cpp/
I am curious about if it fails only with rmw_fastrtps ?
I've tried to make a brief analysis of the failure.
I see that failure happens in play_next_response = successful_call
It looks like 195 is some magic number specific for the fastrtps DDS. Maybe max number of some internal resources.
play_next_response = successful_call
It would be actually better to ask someone from FAST RTPS
support team to look on this failure.
Meanwhile to mitigate this failure I would suggest to try to increase const std::chrono::seconds service_calltimeout {1}; up to 3 seconds. And decrease number of messages to publish const size_t num_msgs_topublish = 200; to 150
Thanks for digging into this! @MichaelOrlov
Yeah, the rosbag2_transport.test_play_services__rmw_fastrtps_cpp
is the one flaky remaining affecting us the most. I have seen it in various rosbag2_transport
PRs.
On Linux it happens only for rmw_fastrtps
, in case it rings a bell, it only occurs on release builds.
Rci__nightly-release_ubuntu_jammy_amd64#74)
nightly_linux_release#2276
nightly_linux-rhel_release#1127
I will open the PR with your suggestion tomorrow morning @MichaelOrlov and get more feedback there as needed.
@clalancette should we ping someone in particular from the fastrtps
team to take a look?
@Blast545 I see how test fails on rmw_cyclonedds
from your link https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/.
It fails in another test toggle_pause
. With just few iterations for service call.
And seems that timeout in 1 second for service call is not enough for some cases. Probably with heavy loaded HW or network stack.
Let's try to increase [const std::chrono::seconds service_calltimeout {1}; up to 5 seconds and see if CI will only fail on fastrtps
on 195 iteration.
If this is will be the case, than make sense to call someone from fastrtps
team to take a look.
@Blast545 @clalancette I have a good news about this annoying failing test_play_services
test.
First of all I was able to reproduce it locally with some extra load for my machine.
I loaded my computer with stress -m 60
command. Please note I was able to reproduce the same failure for both rmws FastRTPS and CycloneDDS. And not only for play_next
test. Basically similar what we have seen on CI.
The second good news I found a breaking PR and commit: This is Update client API to be able to remove pending requests rclcpp#1734 and relevant commit https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559
I've tried to revert commit https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559 locally and failure doesn't reproduce any more.
@ivanpauno Could you please pick up further analysis of the failing https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/ from this point since you was an author of the breaking commit https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559
@ivanpauno Could you please pick up further analysis of the failing https://ci.ros2.org/view/nightly/job/nightly_win_rep/2591/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/ from this point since you was an author of the breaking commit https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559
Could you summarize the analysis you have done up to now?
The problem seems to be a race. Maybe https://github.com/ros2/rclcpp/commit/679fb2ba334971d9769b44258df9095025567559 introduced a race (though that seems to be unlikely based on the code changes), but it might be a pre-existing bug that became more likely to happen after the commit.
Hi @ivanpauno Sorry my late response. My preliminary analysis is following:
rosbag2_trasport
package and involve sending request and receiving responses via service calls. In particular affected tests in test_play_services.future
due to the timeout. I've tried to increase those timeout up until 30 seconds but doesn't help, observing the same failure. It's difficult to analyse this failure since the response simply doesn't arrive and it happens not so often.rosbag2
which is responsible for sending response and found no issues everything is clear and well written. Also I've tried to run valgrind
with those failing tests and haven't found anything which could cause memory corruption.rmw_fastrtps
, but we have seen some cases when the same failure happening on rmw_cyclonedds
for example here https://pipelines.actions.githubusercontent.com/serviceHosts/af0a8eef-e408-4986-ae27-20f00fbcb6f9/_apis/pipelines/1/runs/26054/signedlogcontent/2?urlExpires=2022-06-23T00%3A53%3A29.4599815Z&urlSigningMethod=HMACV1&urlSignature=K%2FQgv3z8x4SAtVeznmXFKGdg05Bur8EcI5VCv4ZaJl0%3D search for test_play_services__rmw_cyclonedds_cpp
.stress -m 60
.Please let me know if you need more information or details about this issue or if something unclear.
could be related to https://github.com/ros2/rmw_fastrtps/pull/616.
could be related to ros2/rmw_fastrtps#616.
@fujitatomoya Unlikely, since in test_play_services we are not sending many requests in burst. We are sending service requests one by one and each time verifying that we are getting corresponding response from the "server" before sending the next request. And at some iteration we just loosing response.
@MichaelOrlov thanks for the comment. i wasnt sure, just came up to mind.
I was able to reproduce the issue, I don't fully understand why it happens yet (and it's pretty hard to reproduce it). I will post here if I have further updates.
Description
The following tests have started to fail consistently (three days in a row) in the CI of https://ci.ros2.org/job/nightly_linux_repeated/:
If I'm not wrong, the build displays that the commit used is 891e08128e2d6ff36452871abda2cd776b8a1566 that correspond to the pull #848 .
Expected Behavior
Tests should pass :)
Actual Behavior
Timeout
To Reproduce
Check CI job
System (please complete the following information)