ros2 / rclcpp

rclcpp (ROS Client Library for C++)
Apache License 2.0
536 stars 417 forks source link

Timer in MultiThreadedExecutor crashes #2455

Open roncapat opened 6 months ago

roncapat commented 6 months ago

Platform: Ubuntu 22.04 - Iron Irwini compiled from source. I've just vcs pulled, rebuilt and re-tested everything.

I built a very simple rclcpp component with a timer spinning at 200 Hz, publishing an empty Imu message. When loading the component in a component_container_mt, in a few seconds this is what happens (see below backtrace).

Seems that something wrong happens during allocation/deallocation of some resources. I attach the .zip file with demo package for debug purposes.

Thread 19 "component_conta" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe95fa640 (LWP 848599)]
0x00007ffff7c616cc in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
(gdb) bt
#0  0x00007ffff7c616cc in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#1  0x00007ffff7c29c3e in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#2  0x00007ffff7c8c98c in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#3  0x00007ffff7c8e3a1 in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#4  0x00007ffff7581e99 in __gnu_cxx::new_allocator<void*>::deallocate(void**, unsigned long) ()
   from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#5  0x00007ffff758020b in std::allocator_traits<std::allocator<void*> >::deallocate(std::allocator<void*>&, void**, unsigned long) () from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#6  0x00007ffff757e2bc in void* rclcpp::allocator::retyped_reallocate<void*, std::allocator<void*> >(void*, unsigned long, void*) () from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#7  0x00007ffff79ed8eb in rcl_wait_set_resize () from /home/user/ros2_iron/install/rcl/lib/librcl.so
#8  0x00007ffff750d27c in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#9  0x00007ffff750e0c6 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#10 0x00007ffff7536b25 in rclcpp::executors::MultiThreadedExecutor::run(unsigned long) ()
   from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#11 0x00007ffff7538c4d in void std::__invoke_impl<void, void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&>(std::__invoke_memfun_deref, void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&) ()
   from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#12 0x00007ffff7538b5f in std::__invoke_result<void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&>::type std::__invoke<void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&>(void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&) ()
   from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#13 0x00007ffff7538a68 in void std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>::__call<void, , 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) () from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#14 0x00007ffff75389ab in void std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>::operator()<, void>() ()
   from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#15 0x00007ffff7538952 in void std::__invoke_impl<void, std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>>(std::__invoke_other, std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>&&) () from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#16 0x00007ffff75388fb in std::__invoke_result<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>>::type std::__invoke<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>>(std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>&&) () from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#17 0x00007ffff753889c in void std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>) () from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#18 0x00007ffff753886c in std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > >::operator()() ()
   from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#19 0x00007ffff753884c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > > >::_M_run() () from /home/user/ros2_iron/install/rclcpp/lib/librclcpp.so
#20 0x00007ffff66e6793 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007ffff6294ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#22 0x00007ffff6326660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

timer_test.zip

Zard-C commented 6 months ago

Seems it crashed in rclcpp, since you are using Iron Irwini compiled from source, I think you could build a debug version of rclcpp package, reproduce this bug, and use gdb to see where it crashed. Btw, I noticed that you are using jemalloc

roncapat commented 6 months ago

Btw, I noticed that you are using jemalloc

Are there choices? I mean, I didn't know I can have other choices here... can you explain me this point, or direct me to resources to read?

Zard-C commented 6 months ago

Btw, I noticed that you are using jemalloc

Are there choices? I mean, I didn't know I can have other choices here... can you explain me this point, or direct me to resolurces to read? Maybe we can run without linking libjemalloc, and see if this problem can be reproduced 😸

roncapat commented 6 months ago

Sorry for my lack of knowledge on this, but... is there a procedure to run this without jemalloc? And is jemalloc used at wich level, like rclcpp, rclc or deeper? Is jemalloc a sort of default choice that can be overridden by rebuilding some packages?

alsora commented 6 months ago

ROS doesn't depend on jemalloc, nor uses it. This comes from your system configuration

roncapat commented 6 months ago

New backtrace, with rclcpp in RelWithDebInfo mode:

#3  0x00007ffff7c8e3a1 in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#4  0x00007ffff77bc41a in __gnu_cxx::new_allocator<void*>::deallocate (__t=1, __p=<optimized out>, this=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:145
#5  std::allocator_traits<std::allocator<void*> >::deallocate (__n=1, __p=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:496
#6  rclcpp::allocator::retyped_reallocate<void*, std::allocator<void*> > (untyped_pointer=<optimized out>, size=8, untyped_allocator=<optimized out>)
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/include/rclcpp/allocator/allocator_common.hpp:78
#7  0x00007ffff796c98f in rcl_wait_set_resize () from /home/user/ros2_iron/install/rcl/lib/librcl.so
#8  0x00007ffff77899c2 in rclcpp::Executor::wait_for_work (this=0x7ffff02cf890, timeout=...) at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:794
#9  0x00007ffff778a073 in rclcpp::Executor::get_next_executable (this=0x7ffff02cf890, any_executable=..., timeout=std::chrono::duration = { -1ns })
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:965
#10 0x00007ffff779a372 in rclcpp::executors::MultiThreadedExecutor::run (this=0x7ffff02cf890, this_thread_number=<optimized out>)
    at /home/user/ros2_iron/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:92
#11 0x00007ffff72e6793 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007ffff6e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#13 0x00007ffff6f26660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Seems that in rcl_wait_set_resize something goes wrong. I'm also rebuilding rcl to dig more...

roncapat commented 6 months ago

I found this possibly related issue: https://github.com/ros2/rmw_fastrtps/issues/728 I'm rebuilding rmw_fastrtps to investigate more.

Seems related to another run I did:

#4  0x00007ffff7c0e0e4 in ?? () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#5  0x00007ffff7c998fd in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
#6  0x00007ffff538aafc in __gnu_cxx::new_allocator<eprosima::fastdds::dds::Condition*>::allocate(unsigned long, void const*) ()
   from /home/user/ros2_iron/install/rmw_fastrtps_shared_cpp/lib/librmw_fastrtps_shared_cpp.so
#7  0x00007ffff538a9cf in std::allocator_traits<std::allocator<eprosima::fastdds::dds::Condition*> >::allocate(std::allocator<eprosima::fastdds::dds::Condition*>&, unsigned long) ()
   from /home/user/ros2_iron/install/rmw_fastrtps_shared_cpp/lib/librmw_fastrtps_shared_cpp.so
#8  0x00007ffff538a8ac in std::_Vector_base<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_allocate(unsigned long) ()
   from /home/user/ros2_iron/install/rmw_fastrtps_shared_cpp/lib/librmw_fastrtps_shared_cpp.so
#9  0x00007ffff538a535 in void std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::_M_realloc_insert<eprosima::fastdds::dds::Condition*>(__gnu_cxx::__normal_iterator<eprosima::fastdds::dds::Condition**, std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> > >, eprosima::fastdds::dds::Condition*&&) ()
   from /home/user/ros2_iron/install/rmw_fastrtps_shared_cpp/lib/librmw_fastrtps_shared_cpp.so
#10 0x00007ffff538a3a0 in eprosima::fastdds::dds::Condition*& std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::emplace_back<eprosima::fastdds::dds::Condition*>(eprosima::fastdds::dds::Condition*&&) () from /home/user/ros2_iron/install/rmw_fastrtps_shared_cpp/lib/librmw_fastrtps_shared_cpp.so
#11 0x00007ffff538a102 in std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >::push_back(eprosima::fastdds::dds::Condition*&&) ()
   from /home/user/ros2_iron/install/rmw_fastrtps_shared_cpp/lib/librmw_fastrtps_shared_cpp.so
#12 0x00007ffff53877d3 in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) () from /home/user/ros2_iron/install/rmw_fastrtps_shared_cp
Zard-C commented 6 months ago

Seems that in rcl_wait_set_resize something goes wrong. I'm also rebuilding rcl to dig more...

Hi, I believe the crash is related to memory deallocation. To address this, you can use AddressSanitizer by following these steps:

Rebuild your application with the following command:

colcon build --cmake-args -DCMAKE_CXX_FLAGS="-fsanitize=address -g" -DCMAKE_C_FLAGS="-fsanitize=address -g" --packages-select <name-of-your package>

Run your application and examine the error messages produced by the sanitizer for further insights.

Zard-C commented 6 months ago

I've tried your code in iron-desktop docker, unfortunately, I can't reproduce:

I modified your launch.py

           # prefix="tmux split-window gdb -ex run --args", # run then with "tmux new-session ros2 launch ..."       

and use

ros2 launch  node.launch.py 

about 10 minutes

[INFO] [launch]: All log files can be found below /root/.ros/log/2024-03-20-17-46-17-374247-9988e08b3540-27165
[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [component_container_mt-1]: process started with pid [27189]
[component_container_mt-1] [INFO] [1710956777.786610833] [localisation_master]: Load Library: /share/debug_wks/install/timer_test/lib/libTTest.so
[component_container_mt-1] [INFO] [1710956777.790359426] [localisation_master]: Found class: rclcpp_components::NodeFactoryTemplate<test::timer>
[component_container_mt-1] [INFO] [1710956777.790432846] [localisation_master]: Instantiate class: rclcpp_components::NodeFactoryTemplate<test::timer>
[INFO] [launch_ros.actions.load_composable_nodes]: Loaded node '/timer_test' in container '/localisation_master'
^C^C[WARNING] [launch]: user interrupted with ctrl-c (SIGINT)
[component_container_mt-1] [INFO] [1710957541.966848000] [rclcpp]: signal_handler(signum=2)
[INFO] [component_container_mt-1]: process has finished cleanly [pid 27189]
roncapat commented 6 months ago

@Zard-C thank you for checking. I will also try with every machine I have available... Tried also with two different RMWs, both induce a crash.

And I will try also what you suggest with the sanitizer.

mjcarroll commented 6 months ago

I'm going to see if I can reproduce this locally with and without ros2/rclcpp#2142

clalancette commented 6 months ago

For what it is worth, I also tried to reproduce this locally, and I can't do it.

What I did was to download the timer_test.zip from the original comment. Then I built it locally, against both iron on Ubuntu 22.04 and rolling on Ubuntu 24.04. Then I sourced the environment and ran:

ros2 run timer_test TTimerExecutable

And I was able to run it for at least a couple of minutes without a crash. I even tried putting some additional stress on my machine (which can cause race conditions), and I didn't see it either.

@roncapat Is there something else I need to do to be able to reproduce this?

roncapat commented 6 months ago

@clalancette not really, for me it was enough to trigger the issue.

Nevertheless I may get some spare time to try again in the upcoming days, testing also latest rolling. I'm seeing a lot of activities recently on rclcpp executors and related stuff, so it may also be possible that this got somehow solved.

Unfortunately, in the latter case, I may not have enough time to do a full search in the last N commits of the repo to identify the specific commit eventually responsible of the fix.

At least for now, thank you all for running the test and checking this issue :)

roncapat commented 6 months ago

@clalancette one last thing... Could you try using the launch file instead? I'm on mobile now, but IIRC the executable by default may not make use of the MultiThreadedExecutor, while by composition via launch file I explicitly set the type of container (thus, the type of executor).

clalancette commented 6 months ago

@clalancette one last thing... Could you try using the launch file instead? I'm on mobile now, but IIRC the executable by default may not make use of the MultiThreadedExecutor, while by composition via launch file I explicitly set the type of container (thus, the type of executor).

I tried that as well, at least on Rolling. And I couldn't reproduce there.

mjcarroll commented 6 months ago

For what it is worth, I also tried to reproduce this locally, and I can't do it.

I attempted to reproduce this with the timer_test packages from the original issue last week and forgot to send my update. I tested before and after ros2/rclcpp#2142 went in and wasn't able to in either case.

I did not apply stress, but it doesn't seem that did much.