ros2 / rmw

The ROS Middleware (rmw) Interface.
Apache License 2.0
95 stars 69 forks source link

Ubuntu 22.04 BUG: Failed to delete datawriter in nav2_util tests #362

Open Ryanf55 opened 11 months ago

Ryanf55 commented 11 months ago

Bug report

Required Info:

Steps to reproduce issue

Follow the NAV2 building from source instructions, with the repo on latest rolling hash 113564965f54009686d521902ff3fcc9d101c5b5.

colcon build --mixin debug --packages-up-to nav2_util
. install/setup.bash
colcon test --packages-select nav2_util --event-handlers=console_cohesion+ --ctest-args " -R" " test_actions"

If you want to build everything from source, you need to check out ros rolling sources, nav2 sources, and clone this in the workspace:

Expected behavior

Tests pass

Actual behavior

RMW has an internal error failing to delete the datawriter in test_actions. I already talked to the NAV2 maintainers (Steve) and he said to file the issue here. He sees it in CI, but he says since I can reproduce it locally, to file it ASAP.

      Start 12: test_actions

12: Test command: /usr/bin/python3.10 "-u" "/opt/ros/rolling/share/ament_cmake_test/cmake/run_test.py" "/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml" "--package-name" "nav2_util" "--output-file" "/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/ament_cmake_gtest/test_actions.txt" "--command" "/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test/test_actions" "--gtest_output=xml:/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml"
12: Test timeout computed to be: 60
12: -- run_test.py: invoking following command in '/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test':
12:  - /home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test/test_actions --gtest_output=xml:/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml
12: [==========] Running 7 tests from 1 test suite.
12: [----------] Global test environment set-up.
12: [----------] 7 tests from ActionTest
12: [ RUN      ] ActionTest.test_simple_action
12: [INFO] [1702060361.370104450] [fibonacci_server_node]: Activating
12:  Teardown
12:  Teardown...
12:  Teardown complete
12: [       OK ] ActionTest.test_simple_action (1119 ms)
12: [ RUN      ] ActionTest.test_simple_action_with_feedback
12:  Teardown
12:  Teardown...
12:  Teardown complete
12: [       OK ] ActionTest.test_simple_action_with_feedback (906 ms)
12: [ RUN      ] ActionTest.test_simple_action_activation_cycling
12: [INFO] [1702060363.383970802] [fibonacci_server_node]: Deactivating
12: [WARN] [1702060363.383992725] [fibonacci_server_node]: [fibonacci] [ActionServer] Requested to deactivate server but goal is still executing. Should check if action server is running before deactivating.
12: [INFO] [1702060363.484062276] [fibonacci_server_node]: [fibonacci] [ActionServer] Waiting for async process to finish.
12: [WARN] [1702060363.484068138] [fibonacci_server_node]: [fibonacci] [ActionServer] Stopping the thread per request.
12: [WARN] [1702060363.484149679] [fibonacci_server_node]: [fibonacci] [ActionServer] Aborting handle.
12: Sent goal, spinning til complete...
12: [INFO] [1702060363.492394679] [fibonacci_server_node]: Activating
12: Getting result, spinning til complete...
12:  Teardown
12:  Teardown...
12:  Teardown complete
12: [       OK ] ActionTest.test_simple_action_activation_cycling (1215 ms)
12: [ RUN      ] ActionTest.test_simple_action_preemption
12: Sent goal, spinning til complete...
12: Sent goal, spinning til complete...
12: Getting result, spinning til complete...
12:  Teardown
12:  Teardown...
12:  Teardown complete
12: [       OK ] ActionTest.test_simple_action_preemption (106 ms)
12: [ RUN      ] ActionTest.test_simple_action_preemption_after_succeeded
12: [INFO] [1702060364.705742069] [fibonacci_server_node]: Ignoring preemptions
12:  Teardown
12:  Teardown...
12:  Teardown complete
12: [       OK ] ActionTest.test_simple_action_preemption_after_succeeded (1906 ms)
12: [ RUN      ] ActionTest.test_handle_goal_deactivated
12: [INFO] [1702060366.612432927] [fibonacci_server_node]: Deactivating
12:  Teardown
12: [INFO] [1702060366.612516663] [fibonacci_server_node]: Activating
12:  Teardown...
12:  Teardown complete
12: [       OK ] ActionTest.test_handle_goal_deactivated (6 ms)
12: [ RUN      ] ActionTest.test_handle_cancel
12:  Teardown
12:  Teardown...
12:  Teardown complete
12: [       OK ] ActionTest.test_handle_cancel (6 ms)
12: [----------] 7 tests from ActionTest (5266 ms total)
12: 
12: [----------] Global test environment tear-down
12: [==========] 7 tests from 1 test suite ran. (5266 ms total)
12: [  PASSED  ] 7 tests.
12: [WARN] [1702060366.621645841] [fibonacci_server_node]: [fibonacci] [ActionServer] Current goal was not completed successfully.
12: [WARN] [1702060366.621658776] [fibonacci_server_node]: [fibonacci] [ActionServer] Client requested to cancel the goal. Cancelling.
12: 
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12: 
12:   'feedback publisher is invalid, at ./src/rcl_action/action_server.c:936'
12: 
12: with this new error message:
12: 
12:   'Fail in delete datawriter, at ./src/rmw_service.cpp:115'
12: 
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12: 
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12: 
12:   'Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262'
12: 
12: with this new error message:
12: 
12:   'cannot publish data, at ./src/rmw_publish.cpp:62'
12: 
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: 
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12: 
12:   'Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262'
12: 
12: with this new error message:
12: 
12:   'cannot publish data, at ./src/rmw_publish.cpp:62'
12: 
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: 
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12: 
12:   'Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262'
12: 
12: with this new error message:
12: 
12:   'cannot publish data, at ./src/rmw_publish.cpp:62'
12: 
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12: Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
12: Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: [ERROR] [1702060366.622855928] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:199
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: [ERROR] [1702060366.622872742] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:199
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: [ERROR] [1702060366.622885286] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:199
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: [ERROR] [1702060366.622898322] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:199
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.622925316] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.622944834] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.622964233] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.622982479] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.622998631] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.623017288] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
12: [ERROR] [1702060366.623032588] [fibonacci_server_node.rclcpp]: Error in destruction of rcl publisher handle: cannot publish data, at ./src/rmw_publish.cpp:62, at ./src/rcl/publisher.c:201
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.623059261] [fibonacci_server_node]: Error in shutdown of get_type_description service: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: 
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12: 
12:   'Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262'
12: 
12: with this new error message:
12: 
12:   'cannot publish data, at ./src/rmw_publish.cpp:62'
12: 
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12: 
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12: 
12:   'cannot publish data, at ./src/rmw_publish.cpp:62'
12: 
12: with this new error message:
12: 
12:   'Failed to delete datareader, at ./src/subscription.cpp:52'
12: 
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12: __function__:80: 'destroy_subscription' failed
12: test_actions: pthread_mutex_lock.c:94: ___pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
12: -- run_test.py: return code -6
12: -- run_test.py: inject classname prefix into gtest result file '/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml'
12: -- run_test.py: verify result file '/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml'
12/19 Test #12: test_actions .....................***Failed    5.39 sec
test 13 

Additional information

CPU: AMD® Ryzen 9 7950x 16-core processor × 32 GPU: NVIDIA GeForce RTX 3070/PCIe/SSE2 / NVIDIA Corporation GA104 [GeForce RTX 3070]

$ uname -a
Linux B650-970 6.2.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct  6 10:23:26 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
SteveMacenski commented 11 months ago

We see this in Nav2's CI periodically as well, as reported on the Slack (@clalancette ) It is pretty concerning to us that we again have new RMW errors appearing for really basic tests. This is a new thing in the last few months

fujitatomoya commented 11 months ago

@SteveMacenski @Ryanf55

This issue has been there for a long time? Or just stated failing recently? if you happen know which commit causes this error behavior on Nav2, that would be really helpful.

[ERROR] [1702060366.623059261] [fibonacci_server_node]: Error in shutdown of get_type_description service: Fail in delete datawriter

could be related to Type Description Support... but not sure.

CC: @iuhilnehc-ynos @Barry-Xu-2018

SteveMacenski commented 11 months ago

It’s not deterministic in our CI, so I can’t pinpoint the exact commit. Its been the last few months though.

Ryanf55 commented 11 months ago

@SteveMacenski @Ryanf55

This issue has been there for a long time? Or just stated failing recently? if you happen know which commit causes this error behavior on Nav2, that would be really helpful.

[ERROR] [1702060366.623059261] [fibonacci_server_node]: Error in shutdown of get_type_description service: Fail in delete datawriter

could be related to Type Description Support... but not sure.

CC: @iuhilnehc-ynos @Barry-Xu-2018

I'll try to set up a source build of rmw to run a bisect on when I get home in a few days.

Ryanf55 commented 11 months ago

Sorry, I haven't had time to do this yet. It might be easier for me to just debug this with gdb because a bisect involves recompiling ros2, then nav2, which takes a while each iteration (158 packages using --packages-up-to).

Ryanf55 commented 11 months ago

Here's the bisect report:

/home/ryan/Dev/ros2_rolling/src/ros2/rmw_dds_common/rmw_dds_common/include/rmw_dds_common/qos.hpp:242:3: error: ‘rosidl_type_hash_t’ has not been declared

If I want to bisect this further, I'll need to start doing the bisect across multiple repos since they are coupled together due to the public API additions. Are there any tools to roll back state across multiple repos, say with vcs? Or, do I manually need to figure out which hashes of each repo correspond to each other?

Ryanf55 commented 11 months ago

@SteveMacenski Does NAV2 CI give you any way to look at test regressions per test to see the first date that test started failing? I know cdash has regression reporting capabilities.

From the CDash blog, it looks like you can look at test failure frequency, but not over time. https://circleci.com/blog/how-to-output-junit-tests-through-circleci-2-0-for-expanded-insights/

SteveMacenski commented 11 months ago

Unless CircleCI produces that, we dont have any unique tools

vooon commented 9 months ago

I encountered very similar error, https://github.com/mavlink/mavros/actions/runs/8129217925/job/22234646274 the same code works on Humble, but fails on Iron.