ros-sports / nao_lola

Apache License 2.0
5 stars 6 forks source link

Sometimes segfaults when receiving joint commands #10

Closed sidrdesai closed 1 year ago

sidrdesai commented 1 year ago

When nao_lola is running on its own there is no issue, but when there is another node publishing joint commands, sometimes nao_lola will randomly segfault. The robot will run fine for 3 or 4 minutes initially before crashing. This becomes more frequent over time though until the robot is rebooted.

We are running nao_lola and the joint publisher node on the Nao V6 robot running Ubuntu 20.04 and ros2 galactic.

ijnek commented 1 year ago

Thanks for the report of the issue @sidrdesai, if you have been able to catch the segmentation fault in gdb, could you share the backtrace please?

sidrdesai commented 1 year ago
Core was generated by `./ros_pkgs/lib/nao_lola/nao_lola'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  malloc_consolidate (av=av@entry=0x7f62dd276b80 <main_arena>) at malloc.c:4475
4475    malloc.c: No such file or directory.
[Current thread is 1 (Thread 0x7f62dcdcb640 (LWP 2965))]
(gdb) backtrace
#0  malloc_consolidate (av=av@entry=0x7f62dd276b80 <main_arena>) at malloc.c:4475
#1  0x00007f62dd121c83 in _int_malloc (av=av@entry=0x7f62dd276b80 <main_arena>, bytes=bytes@entry=1728)
    at malloc.c:3699
#2  0x00007f62dd124299 in __GI___libc_malloc (bytes=1728) at malloc.c:3066
#3  0x00007f62dd341b39 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f62dd4daff6 in rcl_wait_set_resize () from /opt/ros/galactic/lib/librcl.so
#5  0x00007f62dd6473c0 in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/galactic/lib/librclcpp.so
#6  0x00007f62dd6478c3 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/galactic/lib/librclcpp.so
#7  0x00007f62dd64b0ad in rclcpp::executors::SingleThreadedExecutor::spin() ()
   from /opt/ros/galactic/lib/librclcpp.so
#8  0x00007f62dd649c78 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) ()
   from /opt/ros/galactic/lib/librclcpp.so
#9  0x00007f62dd649f20 in rclcpp::spin(std::shared_ptr<rclcpp::Node>) ()
   from /opt/ros/galactic/lib/librclcpp.so
#10 0x0000560d6e80f869 in main (argc=1, argv=0x7ffdad482938)
    at /home/sid/spl_ros/src/nao_lola/src/nao_lola_node.cpp:22
ijnek commented 1 year ago

Hmm, I can't tell from that backtrack what the issues is. It does seem to be memory related, so my next step if I was you would be to run it through valgrind.

Just make sure, you are using the galactic branch right?

sidrdesai commented 1 year ago

Just confirmed it is using the galactic branch. Still crashes. I'm not really experienced with using valgrind, but I will try to figure it out in the next couple days

ijnek commented 1 year ago

Install valgrind with:

apt install valgrind

and inside your workspace (on the robot):

colcon build --cmake-args -DCMAKE_BUILD_TYPE=Debug
source install/local_setup.bash
valgrind --tool=memcheck install/nao_lola/lib/nao_lola/nao_lola

if it detected memory issues, it'll print a bunch of stuff. Note that things will run about 100x slower while running with valgrind.

sidrdesai commented 1 year ago

Here is the output of valgrind. Not sure exactly how to make sense of it.

==2617== Invalid read of size 8
==2617==    at 0x486489C: _M_get (shared_ptr_base.h:1021)
==2617==    by 0x486489C: operator-> (shared_ptr_base.h:1015)
==2617==    by 0x486489C: MsgpackPacker::setJointPositions(std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >) (msgpack_packer.cpp:102)
==2617==    by 0x153E00: operator() (nao_lola.cpp:90)
==2617==    by 0x153E00: std::_Function_handler<void (std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >), NaoLola::createSubscriptions()::{lambda(std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >&&) (std_function.h:300)
==2617==    by 0x1B0618: operator() (std_function.h:688)
==2617==    by 0x1B0618: operator()<std::function<void(std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >)>&> (any_subscription_callback.hpp:255)
==2617==    by 0x1B0618: __invoke_impl<void, rclcpp::AnySubscriptionCallback<MessageT, AllocatorT>::dispatch(std::shared_ptr<_Tp>, const rclcpp::MessageInfo&) [with MessageT = nao_command_msgs::msg::JointPositions_<std::allocator<void> >; AllocatorT = std::allocator<void>]::<lambda(auto:22&&)>, std::function<void(std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >)>&> (invoke.h:60)
==2617==    by 0x1B0618: __invoke<rclcpp::AnySubscriptionCallback<MessageT, AllocatorT>::dispatch(std::shared_ptr<_Tp>, const rclcpp::MessageInfo&) [with MessageT = nao_command_msgs::msg::JointPositions_<std::allocator<void> >; AllocatorT = std::allocator<void>]::<lambda(auto:22&&)>, std::function<void(std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >)>&> (invoke.h:95)
==2617==    by 0x1B0618: __visit_invoke_impl (variant:981)
==2617==    by 0x1B0618: __do_visit_invoke (variant:989)
==2617==    by 0x1B0618: std::__detail::__variant::__gen_vtable_impl<true, std::__detail::__variant::_Multi_array<void (*)(rclcpp::AnySubscriptionCallback<nao_command_msgs::msg::JointPositions_<std::allocator<void> >, std::allocator<void> >::dispatch(std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)::{lambda(auto:1&&)#1}&&, std::variant<std::function<void (nao_command_msgs::msg::JointPositions_<std::allocator<void> > const&)>, std::variant<void (nao_command_msgs::msg::JointPositions_<std::allocator<void> > const, rclcpp::MessageInfo const&)>, std::variant<void (std::unique_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> >, std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > > >)>, std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)>, std::variant<void (std::shared_ptr<std::function>)>, std::variant<void (std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)>, rclcpp::MessageInfo const&)>, std::variant<void (std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)> const&)>, std::variant<void (std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)> const, rclcpp::MessageInfo const&)>, std::variant<void (std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >)>, std::variant<void (std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)> >&)>, std::tuple<std::variant<std::function<void (nao_command_msgs::msg::JointPositions_<std::allocator<void> > const&)>, std::variant<void (nao_command_msgs::msg::JointPositions_<std::allocator<void> > const, rclcpp::MessageInfo const&)>, std::variant<void (std::unique_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> >, std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > > >)>, std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)>, std::variant<void (std::shared_ptr<std::function>)>, std::variant<void (std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)>, rclcpp::MessageInfo const&)>, std::variant<void (std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)> const&)>, std::variant<void (std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)> const, rclcpp::MessageInfo const&)>, std::variant<void (std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >)>, std::variant<void (std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)> > >, std::integer_sequence<unsigned long, 8ul> >::__visit_invoke(rclcpp::AnySubscriptionCallback<nao_command_msgs::msg::JointPositions_<std::allocator<void> >, std::allocator<void> >::dispatch(std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfoconst&)::{lambda(auto:1&&)#1}, std::variant<std::function<void (nao_command_msgs::msg::JointPositions_<std::allocator<void> > const&)>, std::variant<void (nao_command_msgs::msg::JointPositions_<std::allocator<void> > const, rclcpp::MessageInfo const&)>, std::variant<void (std::unique_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> >, std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > > >)>, std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)>, std::variant<void (std::shared_ptr<std::function>)>, std::variant<void (std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)>, rclcpp::MessageInfo const&)>, std::variant<void (std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)> const&)>, std::variant<void (std::variant<void (std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)> const, rclcpp::MessageInfo const&)>, std::variant<void (std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >)>, std::variant<void (std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, rclcpp::MessageInfo const&)> >) (variant:1005)
==2617==    by 0x21D52B: __do_visit<> (variant:1652)
==2617==    by 0x21D52B: visit<rclcpp::AnySubscriptionCallback<MessageT, AllocatorT>::dispatch(std::shared_ptr<_Tp>, const rclcpp::MessageInfo&) [with MessageT = nao_command_msgs::msg::JointPositions_<std::allocator<void> >; AllocatorT = std::allocator<void>]::<lambda(auto:22&&)>, std::variant<std::function<void(const nao_command_msgs::msg::JointPositions_<std::allocator<void> >&)>, std::function<void(const nao_command_msgs::msg::JointPositions_<std::allocator<void> >&, const rclcpp::MessageInfo&)>, std::function<void(std::unique_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> >, std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > > >)>, std::function<void(std::unique_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> >, std::default_delete<nao_command_msgs::msg::JointPositions_<std::allocator<void> > > >, const rclcpp::MessageInfo&)>, std::function<void(std::shared_ptr<const nao_command_msgs::msg::JointPositions_<std::allocator<void> > >)>, std::function<void(std::shared_ptr<const nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, const rclcpp::MessageInfo&)>, std::function<void(const std::shared_ptr<const nao_command_msgs::msg::JointPositions_<std::allocator<void> > >&)>, std::function<void(const std::shared_ptr<const nao_command_msgs::msg::JointPositions_<std::allocator<void> > >&, const rclcpp::MessageInfo&)>, std::function<void(std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >)>, std::function<void(std::shared_ptr<nao_command_msgs::msg::JointPositions_<std::allocator<void> > >, const rclcpp::MessageInfo&)> >&> (variant:1663)
==2617==    by 0x21D52B: dispatch (any_subscription_callback.hpp:238)
==2617==    by 0x21D52B: rclcpp::Subscription<nao_command_msgs::msg::JointPositions_<std::allocator<void> >, std::allocator<void>, rclcpp::message_memory_strategy::MessageMemoryStrategy<nao_command_msgs::msg::JointPositions_<std::allocator<void> >, std::allocator<void> > >::handle_message(std::shared_ptr<void>&, rclcpp::MessageInfo const&) (subscription.hpp:284)
==2617==    by 0x4984FA3: ??? (in /opt/ros/galactic/lib/librclcpp.so)
==2617==    by 0x4985873: rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>) (in /opt/ros/galactic/lib/librclcpp.so)
==2617==    by 0x4985FA4: rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) (in /opt/ros/galactic/lib/librclcpp.so)
==2617==    by 0x498D0BB: rclcpp::executors::SingleThreadedExecutor::spin() (in /opt/ros/galactic/lib/librclcpp.so)
==2617==    by 0x498BC77: rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) (in /opt/ros/galactic/lib/librclcpp.so)
==2617==    by 0x498BF1F: rclcpp::spin(std::shared_ptr<rclcpp::Node>) (in /opt/ros/galactic/lib/librclcpp.so)
==2617==    by 0x14E9AE: main (nao_lola_node.cpp:22)
==2617==  Address 0xbe22100 is 16 bytes inside a block of size 208 free'd
==2617==    at 0x483CFBF: operator delete(void*) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2617==    by 0x155634: _M_release (shared_ptr_base.h:171)
==2617==    by 0x155634: _M_release (shared_ptr_base.h:148)
==2617==    by 0x155634: ~__shared_count (shared_ptr_base.h:730)
==2617==    by 0x155634: ~__shared_ptr (shared_ptr_base.h:1169)
==2617==    by 0x155634: operator= (shared_ptr_base.h:1265)
==2617==    by 0x155634: operator= (shared_ptr.h:335)
==2617==    by 0x155634: NaoLola::NaoLola()::{lambda()#1}::operator()() const [clone .isra.0] (nao_lola.cpp:51)
==2617==    by 0x4C50DE3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==2617==    by 0x4ADF608: start_thread (pthread_create.c:477)
==2617==    by 0x4E96132: clone (clone.S:95)
==2617==  Block was alloc'd at
==2617==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2617==    by 0x154FF5: allocate (new_allocator.h:114)
==2617==    by 0x154FF5: allocate (alloc_traits.h:443)
==2617==    by 0x154FF5: __allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<MsgpackPacker, std::allocator<MsgpackPacker>, (__gnu_cxx::_Lock_policy)2> > > (allocated_ptr.h:97)
==2617==    by 0x154FF5: __shared_count<MsgpackPacker, std::allocator<MsgpackPacker> > (shared_ptr_base.h:677)
==2617==    by 0x154FF5: __shared_ptr<std::allocator<MsgpackPacker> > (shared_ptr_base.h:1344)
==2617==    by 0x154FF5: shared_ptr<std::allocator<MsgpackPacker> > (shared_ptr.h:359)
==2617==    by 0x154FF5: allocate_shared<MsgpackPacker, std::allocator<MsgpackPacker> > (shared_ptr.h:702)
==2617==    by 0x154FF5: make_shared<MsgpackPacker> (shared_ptr.h:718)
==2617==    by 0x154FF5: NaoLola::NaoLola()::{lambda()#1}::operator()() const [clone .isra.0] (nao_lola.cpp:51)
==2617==    by 0x4C50DE3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==2617==    by 0x4ADF608: start_thread (pthread_create.c:477)
==2617==    by 0x4E96132: clone (clone.S:95)
ijnek commented 1 year ago

Thanks, this output is useful. I think there is a race condition that is happening. I am able to reproduce this problem. I will have a look at it and try and get a fix in.

ijnek commented 1 year ago

I've put up a fix but I don't have a NAO to test this with. Could you try out that branch and tell me how it goes @sidrdesai ? It'll also be nice if you could review it.