Closed firesurfer closed 3 months ago
Hello!
Earlier to these logs do you see any info about resource successfully initialized the components from the URDF from the topic before the crash?. Can you try using mock components and check if this is the case?
Thank you
Hi, I spent a bit more time on figuring out what happens. I get really inconsistent results:
[resource_manager]: Successful activate of hardware <hardware name>
inactive
. I do not think this is due to our code in the hardware interface. As every case where I would return Failure/Error as a return code, does also log it. I get one of these behaviors at each startup. Adding /removing additional log statements changes which behavior I get most of the time. (Passing the robot_description as a parameter always results in a successful start). For me this looks like some kind of race condition / threading issue.
Using mocked hardware instead this issue does not exist.
I have the same hardware interface loaded 4 times btw. The error occurs in a differnt one each time.
Ok, based on what you said, can you add the following condition if (resource_manager_->is_urdf_already_loaded())
at the beginning of the read and write cycles of the resource_manager? and see it it happens again. So, basically until the whole URDF and components are loaded you tend to skip the read and write cycle. Let's see if this fixes your issue
I guess I have to compile the resource manager myself then ? I will try that.
Could you please point me out where I have to add this exactly ? I am not too familiar with the ros2 control code base
Hello @firesurfer !
You can tryout this branch: https://github.com/saikishor/ros2_control/tree/robot_description/crash/iron and let us know.
@saikishor Thanks for your help. I probably won't be able to try it out this week but I will definitely find some time next week.
Sure. No worries
@saikishor Just tried your patch.
It seems to solve the segfault issue in the read
method. But in like 1 of 10 starts one of the hardware interfaces goes into its on_error
error handler with the previous state: inactive
which is interesting as I log at the end of the on_activate
an "Activate finished" message just before I return SUCCESS
@firesurfer thanks for confirming it. Looking at the code last night, I think I might know the root cause of the issue. In 1 hour, I can send you another branch to test, maybe it might solve the issue
Hello @firesurfer,
Could you test this branch: https://github.com/saikishor/ros2_control/tree/fix/iron/missing/resources_lock and let me know if this fixes your issue. Ideally, it should as the issue might be coming from some mutexes
@saikishor Just tried your branch and it introduces a new segfault in the read method:
[ros2_control_node-12] #7 Source "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __clone3 [0x796d3cd2684f]
[ros2_control_node-12] #6 Source "./nptl/pthread_create.c", line 442, in start_thread [0x796d3cc94ac2]
[ros2_control_node-12] #5 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x796d3d0dc252, in
[ros2_control_node-12] #4 Object "/install/controller_manager/lib/controller_manager/ros2_control_node", at 0x57819bbb53a4, in main::{lambda()#1}::operator()() const
[ros2_control_node-12] #3 Object "/install/controller_manager/lib/libcontroller_manager.so", at 0x796d3d875d32, in controller_manager::ControllerManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #2 Object "/install/hardware_interface/lib/libhardware_interface.so", at 0x796d3d3a71f4, in hardware_interface::ResourceManager::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #1 Object "/install/hardware_interface/lib/libhardware_interface.so", at 0x796d3d3e2d0f, in hardware_interface::System::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] #0 Object "<path>/my_component.so", at 0x796d3821dfe1, in <my component>::read(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-12] Segmentation fault (Address not mapped to object [0x60])
Starting it again and it worked. So the race condition / threading issue still seems to be there
EDIT: Looks a bit like it works roughly every second time ;)
Lol, so now it works once in every 2 runs. Right before this segmentation fault, do you have more logs on activation of the hardware components?
Soo this time I had in 8 startups:
3x on_error
-> Log was before the on_activate
finished but on activate did print Activate finished
1x on_error
> Log was after on_activate
-> did print Activate finished
before.
1x successful
2x Segfault -> All hardware interface activated successfully
1x on_error
-> Log was after all hardware interfaces reported successful activation
Looks like I was just lucky when I said it works like every 2 runs....
@firesurfer I've pushed another similar commit to the same branch, At least with this the read and write cycles shouldn't run unless the hardware components are loaded and initialized, I've added some logging as well for easier reference. Can you please pull the change and check?
@saikishor I pulled this commit: d98c0ee6a654459ffeefc00c7783707aeb9a0f27
3x on_error
I see before activation of the hw interface: [ros2_control_node-12] [INFO] [1710763241.386743179] [resource_manager]: Releasing the resources lock as hardware is initialized!
+
[ros2_control_node-12] [INFO] [1710763241.387091189] [resource_manager]: Successful 'configure' of hardware 'my hardware'
[ros2_control_node-12] [INFO] [1710763241.387106778] [resource_manager]: 'activate' hardware 'my hardware'
1x Segfault after everything was successfully activated
1x on_error
but print out after [ERROR] [ros2_control_node-12]: process has died
(Could because one of the output streams is buffered and the other is not?!)
Interestingly it tells me something about: [ros2_control_node-12] Stack trace (most recent call last) in thread 2751259:
but does not print one afterwards
1x Segfault but with complaining about empty activate / deactivate list (we have node in our system which tries to disable all our motion controllers under certain conditions )
[ros2_control_node-12] [INFO] [1710763734.794714775] [controller_manager]: Empty activate and deactivate list, not requesting switch [ros2_control_node-12] #7 Source "../sysdeps/unix/sysv/linux/x86_64/clone3.S", line 81, in __clone3 [0x768c97b2684f] [ros2_control_node-12] #6 Source "./nptl/pthread_create.c", line 442, in start_thread [0x768c97a94ac2] [ros2_control_node-12] #5 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x768c97edc252, in [ros2_control_node-12] #4 Object "install/controller_manager/lib/controller_manager/ros2_control_node", at 0x56f2f2f553a4, in main::{lambda()#1}::operator()() const [ros2_control_node-12] #3 Object "/install/controller_manager/lib/libcontroller_manager.so", at 0x768c9871cd32, in controller_manager::ControllerManager::read(rclcpp::Time const&, rclcpp::Duration const&) [ros2_control_node-12] #2 Object "install/hardware_interface/lib/libhardware_interface.so", at 0x768c984e71f4, in hardware_interface::ResourceManager::read(rclcpp::Time const&, rclcpp::Duration const&) [ros2_control_node-12] #1 Object "install/hardware_interface/lib/libhardware_interface.so", at 0x768c98522fef, in hardware_interface::System::read(rclcpp::Time const&, rclcpp::Duration const&) [ros2_control_node-12] #0 Object "install/scicore_driver/lib/libscicore_driver_ros_plugin.so", at 0x768c940c3fe1, in mycomponent::read(rclcpp::Time const&, rclcpp::Duration const&) [ros2_control_node-12] Segmentation fault (Address not mapped to object [0x60])
then the problem might be coming from here https://github.com/ros-controls/ros2_control/blob/20e01d867fd14c1a3a0b8ea999d41d9d27dcf3bb/controller_manager/src/controller_manager.cpp#L339-L357 https://github.com/ros-controls/ros2_control/blob/20e01d867fd14c1a3a0b8ea999d41d9d27dcf3bb/hardware_interface/src/resource_manager.cpp#L335
This is executed right after the hardware components are loaded from the URDF method in the load_urdf
method.
@firesurfer I've pushed another commit to the same branch. Can you test it please?
Good news @saikishor . I did like 10 starts and all of them were successful :)
Awesome. I'm glad that the changes worked. I would then try to open a PR to have the fix upstream
@firesurfer If you have time, please review the changes in the above linked PR.
Thank you
Describe the bug
I have a hardware interface for some piece of internal hardware. When I have the
controller_manager
obtain therobot_description
from therobot_state_publisher
instead of passing it as a parameter I get a segfault in theread
method of the hardware interface. This behavior does not occur when starting it with gdb attached!To Reproduce
Start the
ros2_control_node
configured to read therobot_description
from the corresponding topic.Expected behavior
Should not crash ;)
Environment (please complete the following information):
hardware_interface
version: 3.24.0Additional context
I already call the
read
method once in theactivate
method in order to obtain the current positions of the hardware. This call succeeds. But in the read method there is quite a bit of logic as I need to workaround some quirks of our hardware.Stacktrace (I replaced the names):
Starting the node with GDB attached delays startup a tiny bit. So it might be somehow timing related. In the hardware interface I access the URDF via
system_info.original_xml
and read some parameters such as joint limits from it.