moveit / moveit2

:robot: MoveIt for ROS 2
https://moveit.ai/
BSD 3-Clause "New" or "Revised" License
1.1k stars 530 forks source link

`OneRobot.rigidlyAttachedParent` test fails in jobs #3103

Open TSNoble opened 4 hours ago

TSNoble commented 4 hours ago

Description

The test mentioned in the title, defined in moveit_core/robot_state/test/robot_state_test.cpp has been frequently failing on Github jobs.

https://github.com/moveit/moveit2/actions/runs/11860308971/job/33055270697 https://github.com/moveit/moveit2/actions/runs/11860056590/job/33054441260 https://github.com/moveit/moveit2/actions/runs/11782535125/job/32817780351 https://github.com/moveit/moveit2/actions/runs/11781483295/job/32814401931

to list a few.

Most notably, this seems to consistently affect the SonarScan job on main, despite other jobs which (I assume run the same tests) consistently passing

ROS Distro

Humble

OS and version

Rolling Docker Image

Source or binary build?

Source

If binary, which release version?

N/A

If source, which branch?

main

Which RMW are you using?

FastRTPS

Steps to Reproduce

Unclear. The test fails fairly consistently on various jobs. I've tried running the test locally 10000 times with the following command...

cd ~/ws_moveit/build/moveit_core/robot_state
./test_robot_state --gtest_filter="*rigid*" --gtest_repeat=10000 | grep "PASSED" | wc -l

Which consistently prints out 10000 passes

Expected behavior

Since the test pass consistently locally, it's reasonable to expect them to do so on the jobs.

Actual behavior

Many PRs have failed with the error given below.

Backtrace or Console output

On the jobs:

    [ RUN      ] OneRobot.rigidlyConnectedParent
    Error:   Group 'base_with_bad_subgroups' has unsatisfied subgroups
             at line 304 in /home/runner/work/moveit2/moveit2/.work/upstream_ws/src/srdfdom/src/model.cpp
    [INFO] [1731689210.102607487] [moveit_1614370902.moveit.core.robot_model]: Loading robot model 'one_robot'...
    [WARN] [1731689210.102743753] [moveit_1614370902.moveit.core.robot_state]: Returning dirty link transforms
    test_robot_state: /home/runner/work/moveit2/moveit2/.work/target_ws/src/moveit2/moveit_core/robot_state/src/robot_state.cpp:1309: const Eigen::Isometry3d& moveit::core::RobotState::getFrameInfo(const std::string&, const moveit::core::LinkModel*&, bool&) const: Assertion `checkLinkTransforms()' failed.

Running locally:

Note: Google Test filter = *rigid*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from OneRobot
[ RUN      ] OneRobot.rigidlyConnectedParent
Error:   Group 'base_with_bad_subgroups' has unsatisfied subgroups
         at line 304 in /root/ws_moveit/src/srdfdom/src/model.cpp
[WARN] [1731700916.222734076] [moveit_85773106]: exception thrown while creating node for logging: failed to create guard condition: context argument is null, at ./src/rcl/guard_condition.c:65
[WARN] [1731700916.222763620] [moveit_85773106]: if rclcpp::init was not called, messages from this logger may be missing from /rosout
[INFO] [1731700916.222779245] [moveit_85773106.moveit.core.robot_model]: Loading robot model 'one_robot'...
[ERROR] [1731700916.222920470] [moveit_85773106.moveit.core.robot_state]: Unable to find link for frame 'no_object'
[ERROR] [1731700916.222934489] [moveit_85773106.moveit.core.robot_state]: Unable to find link for frame 'object/no_subframe'
[ERROR] [1731700916.222937242] [moveit_85773106.moveit.core.robot_state]: Unable to find link for frame ''
[ERROR] [1731700916.222948465] [moveit_85773106.moveit.core.robot_state]: Unable to find link for frame 'object/'
[ERROR] [1731700916.222960187] [moveit_85773106.moveit.core.robot_state]: Unable to find link for frame '/'
[       OK ] OneRobot.rigidlyConnectedParent (5 ms)
[----------] 1 test from OneRobot (5 ms total)

[----------] Global test environment tear-down
[==========] 1 test

Interestingly, the failure always comes after a Returning dirty link transforms message, which does not occur when the tests pass locally. Given the failure is related to invalid transforms, this might be related.

TSNoble commented 4 hours ago

checkLinkTransforms() calls getDirtyLinkTransforms(), which returns dirty_link_transforms_. This variable appears to be updated whenever the RobotState changes, and appears to get reset when updateLinkTransforms() is called.

I think the intent is that updateLinkTransforms() should always be called after modifying the robot state if querying transforms is expected. The OneRobot fixture has an empty teardown, so this is not guaranteed between tests. Some tests call updateLinkTransforms() in their body, so my best guess is that, if the tests are running in a random sequence the failure will occur if it modifies the state and does not call this method.

(Then again, OneRobot only has a robot_model_ member. all RobotStates are local to the tests that use them)

If my assumption is correct, I think the nicest approach would be to add a new test fixture OneRobotWithState, which builds a state and calls updateLinkTransforms() on setup and teardown to guarantee that tests aren't affecting one another.

TSNoble commented 4 hours ago

New theory: The test in question constructs a RobotState from the model, which immediately populates the dirty transforms:

RobotState::RobotState(const RobotModelConstPtr& robot_model)
  : robot_model_(robot_model)
  , has_velocity_(false)
  , has_acceleration_(false)
  , has_effort_(false)
  , dirty_link_transforms_(nullptr)
  , dirty_collision_body_transforms_(nullptr)
  , rng_(nullptr)
{
  if (robot_model == nullptr)
  {
    throw std::invalid_argument("RobotState cannot be constructed with nullptr RobotModelConstPtr");
  }

  dirty_link_transforms_ = robot_model_->getRootJoint();
  init();
}

This seems like the more likely cause, but I'm unsure why this would propagate. I'd usually suspect a race condition issue, but the tests are single-threaded as far as I'm aware

Edit: Actually, I'm unsure why the tests are passing at all given that this implies the dirty links are never reset...

TSNoble commented 2 hours ago

Fixed with #3105. See #3102 for explanation