ros-controls / ros2_control

Generic and simple controls framework for ROS 2
https://control.ros.org
Apache License 2.0
481 stars 294 forks source link

Segfault in hardware interface when reading robot_description from topic #1442

Closed firesurfer closed 3 months ago

firesurfer commented 6 months ago

Describe the bug

I have a hardware interface for some piece of internal hardware. When I have the controller_manager obtain the robot_description from the robot_state_publisher instead of passing it as a parameter I get a segfault in the read method of the hardware interface. This behavior does not occur when starting it with gdb attached!

To Reproduce

Start the ros2_control_node configured to read the robot_description from the corresponding topic.

Expected behavior

Should not crash ;)

Environment (please complete the following information):

Additional context

I already call the read method once in the activate method in order to obtain the current positions of the hardware. This call succeeds. But in the read method there is quite a bit of logic as I need to workaround some quirks of our hardware.

Stacktrace (I replaced the names):

[ros2_control_node-12] #7    Source "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __clone3 [0x75c70952684f]
[ros2_control_node-12] #6    Source "./nptl/pthread_create.c", line 442, in start_thread [0x75c709494ac2]
[ros2_control_node-12] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x75c7098dc252, in 
[ros2_control_node-12] #4    Object "/opt/ros/iron/lib/controller_manager/ros2_control_node", at 0x58f626c40a08, in 
[ros2_control_node-12] #3    Object "/opt/ros/iron/lib/libcontroller_manager.so", at 0x75c70a0869c2, in controller_manager::ControllerManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #2    Object "/opt/ros/iron/lib/libhardware_interface.so", at 0x75c709eb3942, in hardware_interface::ResourceManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #1    Object "/opt/ros/iron/lib/libhardware_interface.so", at 0x75c709ed524f, in hardware_interface::System::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #0    Object "Path to our hw interface plugin lib", at 0x75c6fc4a5da1, in NameOftheComponent...::read(rclcpp::Time const&, rclcpp::Duration const&)

Starting the node with GDB attached delays startup a tiny bit. So it might be somehow timing related. In the hardware interface I access the URDF via system_info.original_xml and read some parameters such as joint limits from it.

saikishor commented 6 months ago

Hello!

Earlier to these logs do you see any info about resource successfully initialized the components from the URDF from the topic before the crash?. Can you try using mock components and check if this is the case?

Thank you

firesurfer commented 6 months ago

Hi, I spent a bit more time on figuring out what happens. I get really inconsistent results:

  1. The described segfault - I get some log messages of the kind: [resource_manager]: Successful activate of hardware <hardware name>
  2. Sometimes I run into the error handler - the previous state is then inactive. I do not think this is due to our code in the hardware interface. As every case where I would return Failure/Error as a return code, does also log it.
  3. A successful start

I get one of these behaviors at each startup. Adding /removing additional log statements changes which behavior I get most of the time. (Passing the robot_description as a parameter always results in a successful start). For me this looks like some kind of race condition / threading issue.

Using mocked hardware instead this issue does not exist.

I have the same hardware interface loaded 4 times btw. The error occurs in a differnt one each time.

saikishor commented 6 months ago

Ok, based on what you said, can you add the following condition if (resource_manager_->is_urdf_already_loaded()) at the beginning of the read and write cycles of the resource_manager? and see it it happens again. So, basically until the whole URDF and components are loaded you tend to skip the read and write cycle. Let's see if this fixes your issue

firesurfer commented 6 months ago

I guess I have to compile the resource manager myself then ? I will try that.

firesurfer commented 6 months ago

Could you please point me out where I have to add this exactly ? I am not too familiar with the ros2 control code base

saikishor commented 6 months ago

Hello @firesurfer !

You can tryout this branch: https://github.com/saikishor/ros2_control/tree/robot_description/crash/iron and let us know.

firesurfer commented 6 months ago

@saikishor Thanks for your help. I probably won't be able to try it out this week but I will definitely find some time next week.

saikishor commented 6 months ago

Sure. No worries

firesurfer commented 6 months ago

@saikishor Just tried your patch. It seems to solve the segfault issue in the read method. But in like 1 of 10 starts one of the hardware interfaces goes into its on_error error handler with the previous state: inactive which is interesting as I log at the end of the on_activate an "Activate finished" message just before I return SUCCESS

saikishor commented 6 months ago

@firesurfer thanks for confirming it. Looking at the code last night, I think I might know the root cause of the issue. In 1 hour, I can send you another branch to test, maybe it might solve the issue

saikishor commented 6 months ago

Hello @firesurfer,

Could you test this branch: https://github.com/saikishor/ros2_control/tree/fix/iron/missing/resources_lock and let me know if this fixes your issue. Ideally, it should as the issue might be coming from some mutexes

firesurfer commented 6 months ago

@saikishor Just tried your branch and it introduces a new segfault in the read method:

[ros2_control_node-12] #7    Source "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __clone3 [0x796d3cd2684f]
[ros2_control_node-12] #6    Source "./nptl/pthread_create.c", line 442, in start_thread [0x796d3cc94ac2]
[ros2_control_node-12] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x796d3d0dc252, in 
[ros2_control_node-12] #4    Object "/install/controller_manager/lib/controller_manager/ros2_control_node", at 0x57819bbb53a4, in main::{lambda()#1}::operator()() const
[ros2_control_node-12] #3    Object "/install/controller_manager/lib/libcontroller_manager.so", at 0x796d3d875d32, in controller_manager::ControllerManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #2    Object "/install/hardware_interface/lib/libhardware_interface.so", at 0x796d3d3a71f4, in hardware_interface::ResourceManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #1    Object "/install/hardware_interface/lib/libhardware_interface.so", at 0x796d3d3e2d0f, in hardware_interface::System::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #0    Object "<path>/my_component.so", at 0x796d3821dfe1, in <my component>::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] Segmentation fault (Address not mapped to object [0x60])

Starting it again and it worked. So the race condition / threading issue still seems to be there

EDIT: Looks a bit like it works roughly every second time ;)

saikishor commented 6 months ago

Lol, so now it works once in every 2 runs. Right before this segmentation fault, do you have more logs on activation of the hardware components?

firesurfer commented 6 months ago

Soo this time I had in 8 startups:

3x on_error -> Log was before the on_activate finished but on activate did print Activate finished

1x on_error > Log was after on_activate -> did print Activate finished before.

1x successful

2x Segfault -> All hardware interface activated successfully

1x on_error-> Log was after all hardware interfaces reported successful activation

Looks like I was just lucky when I said it works like every 2 runs....

saikishor commented 6 months ago

@firesurfer I've pushed another similar commit to the same branch, At least with this the read and write cycles shouldn't run unless the hardware components are loaded and initialized, I've added some logging as well for easier reference. Can you please pull the change and check?

firesurfer commented 6 months ago

@saikishor I pulled this commit: d98c0ee6a654459ffeefc00c7783707aeb9a0f27

3x on_error I see before activation of the hw interface: [ros2_control_node-12] [INFO] [1710763241.386743179] [resource_manager]: Releasing the resources lock as hardware is initialized! +

[ros2_control_node-12] [INFO] [1710763241.387091189] [resource_manager]: Successful 'configure' of hardware 'my hardware'
[ros2_control_node-12] [INFO] [1710763241.387106778] [resource_manager]: 'activate' hardware 'my hardware' 

1x Segfault after everything was successfully activated

1x on_error but print out after [ERROR] [ros2_control_node-12]: process has died (Could because one of the output streams is buffered and the other is not?!)

Interestingly it tells me something about: [ros2_control_node-12] Stack trace (most recent call last) in thread 2751259: but does not print one afterwards

1x Segfault but with complaining about empty activate / deactivate list (we have node in our system which tries to disable all our motion controllers under certain conditions )

[ros2_control_node-12] [INFO] [1710763734.794714775] [controller_manager]: Empty activate and deactivate list, not requesting switch [ros2_control_node-12] #7 Source "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __clone3 [0x768c97b2684f] [ros2_control_node-12] #6 Source "./nptl/pthread_create.c", line 442, in start_thread [0x768c97a94ac2] [ros2_control_node-12] #5 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x768c97edc252, in [ros2_control_node-12] #4 Object "install/controller_manager/lib/controller_manager/ros2_control_node", at 0x56f2f2f553a4, in main::{lambda()#1}::operator()() const [ros2_control_node-12] #3 Object "/install/controller_manager/lib/libcontroller_manager.so", at 0x768c9871cd32, in controller_manager::ControllerManager::read(rclcpp::Time const&, rclcpp::Duration const&) [ros2_control_node-12] #2 Object "install/hardware_interface/lib/libhardware_interface.so", at 0x768c984e71f4, in hardware_interface::ResourceManager::read(rclcpp::Time const&, rclcpp::Duration const&) [ros2_control_node-12] #1 Object "install/hardware_interface/lib/libhardware_interface.so", at 0x768c98522fef, in hardware_interface::System::read(rclcpp::Time const&, rclcpp::Duration const&) [ros2_control_node-12] #0 Object "install/scicore_driver/lib/libscicore_driver_ros_plugin.so", at 0x768c940c3fe1, in mycomponent::read(rclcpp::Time const&, rclcpp::Duration const&) [ros2_control_node-12] Segmentation fault (Address not mapped to object [0x60])

saikishor commented 6 months ago

then the problem might be coming from here https://github.com/ros-controls/ros2_control/blob/20e01d867fd14c1a3a0b8ea999d41d9d27dcf3bb/controller_manager/src/controller_manager.cpp#L339-L357 https://github.com/ros-controls/ros2_control/blob/20e01d867fd14c1a3a0b8ea999d41d9d27dcf3bb/hardware_interface/src/resource_manager.cpp#L335

This is executed right after the hardware components are loaded from the URDF method in the load_urdf method.

saikishor commented 6 months ago

@firesurfer I've pushed another commit to the same branch. Can you test it please?

firesurfer commented 6 months ago

Good news @saikishor . I did like 10 starts and all of them were successful :)

saikishor commented 6 months ago

Awesome. I'm glad that the changes worked. I would then try to open a PR to have the fix upstream

saikishor commented 6 months ago

@firesurfer If you have time, please review the changes in the above linked PR.

Thank you