CI is newly flaky - Githubissues

SteveMacenski commented 3 years ago

Some of the integration tests just started failing sporadically for no apparent reason - must debug the regression in the codebase or in upstream repositories.

SteveMacenski commented 3 years ago

https://github.com/ros-planning/navigation2/pull/2149#issuecomment-772098123

https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/5b21c78bcfe8df0013110d01/241124988/6018ecd34b6ad53423a303d8-0-build/artifacts/opt/overlay_ws/log/test/nav2_system_tests/stdout.log?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210202T235537Z&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=AKIAJR3Q6CR467H7Z55A%2F20210202%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=c284ed4d2c93c5334be3730e829b934924a1832a73297d568bf1f271416d9717

I also just skimmed one of the tests, 13, and I never see that gazebo crash, the system just fails to ever initialize. I actually don't see either of the lines you posted on the crash / exception in that log for system tests. Instead I see

14: [lifecycle_manager-6] [ERROR] [1612246927.316509864] [lifecycle_manager_localization]: Failed to change state for node: amcl
14: [lifecycle_manager-6] [ERROR] [1612246927.316589436] [lifecycle_manager_localization]: Failed to bring up all requested nodes. Aborting bringup.

Most of them have some version of "Failed to change state for node" like the lifecycle manager services are failing. Maybe that gazebo thing you ran into was transient unless you've seen it multiple times? All 4 of the tests that are failing had "Failed to change state for node" issue for random nodes. The other snippets I showed could be caused by a particular node failing to come up (amcl, bt nav, etc).

sigh is this another service regression issue (only major use of services we have are lifecycle state transitions and clear costmap calls)? Something lifecycle specific? DDS vendor related?

https://app.circleci.com/pipelines/github/ros-planning/navigation2?branch=main interestingly, look at the nightly jobs. They're passing all with Cyclone. But with Fast DDS its occasionally failing with this issue. @EduPonz @JaimeMartin has there been a recent Fast-DDS regression? I'm not sure why the nightlies arent failing as often, maybe more bandwidth on resources late at night in the cloud servers? I think they should have the same resource allocations (CC @ruffsl )

But we changed absolutely nothing and went from 100% test success to like 40% essentially overnight, it has to be something upstream of us.

SteveMacenski commented 3 years ago

@EduPonz to continue from https://github.com/ros-planning/navigation2/pull/2149#issuecomment-772714152, I'm seeing an RMW exception from FastDDS in this log during the lifecycle manager bringup phase (where we see the CI issues)

13: [recoveries_server-9] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
13: [recoveries_server-9]   what():  failed to send response: client will not receive response, at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_response.cpp:126, at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/service.c:287
13: [ERROR] [recoveries_server-9]: process has died [pid 7933, exit code -6, cmd '/opt/overlay_ws/install/nav2_recoveries/lib/nav2_recoveries/recoveries_server --ros-args -r __node:=recoveries_server --params-file /tmp/tmpsbp496iy -r /tf:=tf -r /tf_static:=tf_static'].

Though I did only see that once - so that might not be the cause but an FYI that the service exception not being caught in RMW is still happening

I'm seeing some parameter errors too I'm working through that might also be contributing.

SteveMacenski commented 3 years ago

Unable to start transition 2 from current state active: Transition is not registered., at /home/jenkins-agent/workspace/packaging_linux/ws/src/ros2/rcl/rcl_lifecycle/src/rcl_lifecycle.c:333

I'm seeing this on some of the jobs alot (Alexey's PR) but then

Original error: parameter 'GridBased.use_astar' has invalid type: expected [bool] got [string]

on another and

14: [gzserver-1] gzserver: /usr/include/boost/smart_ptr/enable_shared_from_this.hpp:51: boost::shared_ptr<X> boost::enable_shared_from_this<T>::shared_from_this() [with T = gazebo::transport::Node]: Assertion `p.get() == this' failed.

on another as the reason for the primary number of jobs failing. This is truly baffling. None of these things have changed in months as far as I can tell.

SteveMacenski commented 3 years ago

https://github.com/ros-planning/navigation2/pull/2170 WIP

SteveMacenski commented 3 years ago

My patches to the perform seem to fix error 2 above. No idea why / how that just started to show up now. Nothing's changed in over a year on the parameters front.

Why is it working in the nightly builds and on cyclone then? That's so strange.

The release builds are turning over now but all the debugs are failing still.

SteveMacenski commented 3 years ago

It looks like the PR fixes the nav2 related issues but we still have gazebo crashes in debug:

14: [gzserver-1] gzserver: /usr/include/boost/smart_ptr/enable_shared_from_this.hpp:51: boost::shared_ptr<X> boost::enable_shared_from_this<T>::shared_from_this() [with T = gazebo::transport::Node]: Assertion `p.get() == this' failed.
14: [ERROR] [gzserver-1]: process has died [pid 12865, exit code -6, cmd 'gzserver -s libgazebo_ros_init.so --minimal_comms /opt/overlay_ws/src/navigation2/nav2_system_tests/worlds/turtlebot3_ros2_demo.world'].

It appears this isn't just one issue, this is multiple issues occurring at the same time.

SteveMacenski commented 3 years ago

https://github.com/ros-simulation/gazebo_ros_pkgs/issues/1222

Filed a ticket in gazebo to see if they know what's up. In the meantime, trying to set that variable because why not :shrug: I could see this failing from that attempted log impl_->ros_node_->get_logger(), since that's a gazebo::transport::Node I assume that would be bypassed if set.

SteveMacenski commented 3 years ago

Switching to Cyclone for our CI made it pass more but still flaky. Went from 80%+ failures to 40%ish failures.

I think this relates to either performance regressions or CircleCI has dropped some networking / CPU limits on us creating the bottleneck. That would explain why some of the changes I made fixed issues - but doesn't explain some of the other issues appearing out of the blue. Still seems like performance regressions due to a newly cut release makes the most sense, as issues that popped up looks like it came from launch API changes along with the performance issues. 2 unique problems we hit, but would explain why they happened at the same time.

ros-navigation / navigation2

CI is newly flaky #2158