Executor Segmentation fault when callback group is destroyed while spinning

msplr commented 2 weeks ago

Bug report

Required Info:

Operating System: Ubuntu 24.04
Installation type: debian binaries
Version:
- ros-jazzy-rclcpp Version: 28.1.5-1noble.20240922.085057
- ros-rolling-rclcpp Version: 28.3.3-1noble.20240919.221549
DDS implementation: Fast-RTPS
Client library: rclcpp

Issue description

The rclcpp::Executor crashes when a timer with a non-default callback group is destroyed before the executor stops spinning.

The bug also appears if a callback group is created and destroyed without being added to the timer.

The bug does not manifest in the example below when:

The timer's callback group has a longer lifetime than the executor spin thread (e.g. declared in the outer scope).
The timer uses the default callback group.
The callback group is moved to the timer factory.

Steps to reproduce issue

See the full gist

#include <rclcpp/rclcpp.hpp>
#include <std_msgs/msg/string.hpp>

using namespace std::chrono_literals;

int main(int argc, char** argv) {
  rclcpp::init(argc, argv);
  auto node = rclcpp::Node::make_shared("test");
  auto log = node->get_logger();
  rclcpp::executors::SingleThreadedExecutor executor;
  auto spinThread = std::thread([&] {
    executor.add_node(node);
    executor.spin();
    executor.remove_node(node);
  });

  while (!executor.is_spinning()) {
    std::this_thread::sleep_for(1ms);
  }

  {
    RCLCPP_INFO(log, "BEGIN");
    auto cg = node->create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive);
    auto timer = node->create_timer(1s, [&log] { RCLCPP_INFO(log, "timer"); }, cg);  // SIGSEGV
    // auto timer = node->create_timer(1s, [&log] { RCLCPP_INFO(log, "timer"); }, std::move(cg)); // OK
    std::this_thread::sleep_for(100ms);  // wait a bit for the executor to catch up
    RCLCPP_INFO(log, "END");             // timer and cg go out of scope here
  }

  executor.cancel();
  if (spinThread.joinable()) {
    spinThread.join();
  }
  rclcpp::shutdown();
}

Expected behavior

The node runs and terminates without issue.

Actual behavior

The node fails with a Segmentation fault in rclcpp::Executor::spin().

Additional information

Backtrace:

Thread 16 "rclcpp_bug" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe90006c0 (LWP 66063)]
0x00007ffff72fd301 in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
   from /opt/ros/jazzy/lib/librmw_fastrtps_shared_cpp.so
(gdb) bt
#0  0x00007ffff72fd301 in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
   from /opt/ros/jazzy/lib/librmw_fastrtps_shared_cpp.so
#1  0x00007ffff735d8f8 in rmw_wait () from /opt/ros/jazzy/lib/librmw_fastrtps_cpp.so
#2  0x00007ffff7f6622c in rcl_wait () from /opt/ros/jazzy/lib/librcl.so
#3  0x00007ffff7de565f in ?? () from /opt/ros/jazzy/lib/librclcpp.so
#4  0x00007ffff7d08e00 in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/jazzy/lib/librclcpp.so
#5  0x00007ffff7d0962b in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/jazzy/lib/librclcpp.so
#6  0x00007ffff7d193a5 in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/jazzy/lib/librclcpp.so
#7  0x0000555555703dc1 in operator() (__closure=0x555555b6e0a8) at /home/mspieler/work/brahma_ws/ros2_ws/src/rclcpp_bug/bug.cpp:16
#8  0x000055555570c934 in std::__invoke_impl<void, main(int, char**)::<lambda()> >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#9  0x000055555570c8f7 in std::__invoke<main(int, char**)::<lambda()> >(struct {...} &&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:96
#10 0x000055555570c8a4 in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x555555b6e0a8) at /usr/include/c++/13/bits/std_thread.h:292
#11 0x000055555570c7ee in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::operator()(void) (this=0x555555b6e0a8) at /usr/include/c++/13/bits/std_thread.h:299
#12 0x000055555570c4f0 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > > >::_M_run(void) (this=0x555555b6e0a0) at /usr/include/c++/13/bits/std_thread.h:244
#13 0x00007ffff78ecdb4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007ffff749ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#15 0x00007ffff7529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

msplr commented 2 weeks ago

Possible duplicate with #2445

Barry-Xu-2018 commented 2 weeks ago

I reproduced this issue in rolling.

GDB output

Thread 16 "rclcpp_2664" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe74006c0 (LWP 292621)]
0x00007ffff5b8de9d in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=0x7ffff6558140 "rmw_fastrtps_cpp", subscriptions=0x555555919678, guard_conditions=0x555555919690, services=0x5555559196c0, 
    clients=0x5555559196a8, events=0x5555559196d8, wait_set=0x5555559156a0, wait_timeout=0x7fffe73fdc80) at /root/ros2_src_rolling/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:310
310       if (!condition->get_trigger_value()) {
(gdb) bt
#0  0x00007ffff5b8de9d in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=0x7ffff6558140 "rmw_fastrtps_cpp", subscriptions=0x555555919678, guard_conditions=0x555555919690, services=0x5555559196c0, 
    clients=0x5555559196a8, events=0x5555559196d8, wait_set=0x5555559156a0, wait_timeout=0x7fffe73fdc80) at /root/ros2_src_rolling/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:310
#1  0x00007ffff6548a5f in rmw_wait (subscriptions=0x555555919678, guard_conditions=0x555555919690, services=0x5555559196c0, clients=0x5555559196a8, events=0x5555559196d8, wait_set=0x5555559156a0, 
    wait_timeout=0x7fffe73fdc80) at /root/ros2_src_rolling/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:33
#2  0x00007ffff69a4146 in rmw_wait (v7=0x555555919678, v6=0x555555919690, v5=0x5555559196c0, v4=0x5555559196a8, v3=0x5555559196d8, v2=0x5555559156a0, v1=0x7fffe73fdc80)
    at /root/ros2_src_rolling/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:578
#3  0x00007ffff6fbe161 in rcl_wait (wait_set=0x7fffffff2718, timeout=-1) at /root/ros2_src_rolling/src/ros2/rcl/rcl/src/rcl/wait.c:657
#4  0x00007ffff7b0fa8d in rclcpp::wait_set_policies::SequentialSynchronization::sync_wait<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > >(std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::function<void ()>, std::function<rcl_wait_set_s& ()>, std::function<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > (rclcpp::WaitResultKind)>) (this=0x7fffffff2718, time_to_wait_ns=std::chrono::duration = { -1ns }, rebuild_rcl_wait_set=..., 
    get_rcl_wait_set=..., create_wait_result=...) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/wait_set_policies/sequential_synchronization.hpp:280
#5  0x00007ffff7b08461 in rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage>::wait<long, std::ratio<1l, 1000000000l> > (this=0x7fffffff2718, 
    time_to_wait=std::chrono::duration = { -1ns }) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/wait_set_template.hpp:712
#6  0x00007ffff7af31bb in rclcpp::Executor::wait_for_work (this=0x7fffffff24b0, timeout=std::chrono::duration = { -1ns }) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:738
#7  0x00007ffff7af42a6 in rclcpp::Executor::get_next_executable (this=0x7fffffff24b0, any_executable=..., timeout=std::chrono::duration = { -1ns })
    at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:886
#8  0x00007ffff7b48d53 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x7fffffff24b0) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp:41
#9  0x00005555555639f4 in operator() (__closure=0x5555559155e8) at /root/ros2_src_rolling/src/bug_reproduce/rclcpp_bug/src/rclcpp_2664/rclcpp_2664.cpp:13
#10 0x0000555555565df0 in std::__invoke_impl<void, main(int, char**)::<lambda()> >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#11 0x0000555555565db3 in std::__invoke<main(int, char**)::<lambda()> >(struct {...} &&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:96
#12 0x0000555555565d60 in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x5555559155e8) at /usr/include/c++/13/bits/std_thread.h:292
#13 0x0000555555565cee in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::operator()(void) (this=0x5555559155e8) at /usr/include/c++/13/bits/std_thread.h:299
#14 0x0000555555565a6c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > > >::_M_run(void) (this=0x5555559155e0) at /usr/include/c++/13/bits/std_thread.h:244
#15 0x00007ffff6d89bb4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007ffff6afaa94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#17 0x00007ffff6b87c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

A non-existent guard condition is accessed.

It's currently suspected that it might be related to the notify_guardcondition in the temporarily created callback group.
https://github.com/ros2/rclcpp/blob/37e36880260f10c85fb20809accec7cc77a3b1dd/rclcpp/include/rclcpp/callback_group.hpp#L269

The newly created callback group is stored in the node using a weak pointer.
https://github.com/ros2/rclcpp/blob/37e36880260f10c85fb20809accec7cc77a3b1dd/rclcpp/include/rclcpp/node_interfaces/node_base.hpp#L159

I will investigate further.

mjcarroll commented 2 weeks ago

I was also able to reproduce, but it does seem to exclusively happen on fastrtps for me and not with the cyclone middleware, perhaps a sequencing issue? https://github.com/ros2/rclcpp/pull/2665

fujitatomoya commented 2 weeks ago

@mjcarroll @Barry-Xu-2018 i only can reproduce this issue with rmw_fastrtps but rmw_cyclonedds.

fujitatomoya commented 2 weeks ago

Possible duplicate with https://github.com/ros2/rclcpp/issues/2445

according to the stack trace between https://github.com/ros2/rclcpp/issues/2664#issuecomment-2464256660 and https://github.com/ros2/rclcpp/issues/2445#issuecomment-1978366522, it looks like a different problem.

Barry-Xu-2018 commented 2 weeks ago

rmw_guard_condition_t created by temporary callback group has been freed after temporary callback group is removed.

https://github.com/ros2/rmw_fastrtps/blob/9d2150ffc1b549d0adc71971b7f350e325b61340/rmw_fastrtps_shared_cpp/src/rmw_guard_condition.cpp#L36-L42

But freed GuardCondition is used in __rmw_wait().

https://github.com/ros2/rmw_fastrtps/blob/9d2150ffc1b549d0adc71971b7f350e325b61340/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp#L306-L310

Barry-Xu-2018 commented 2 weeks ago

The root cause:

The GuardCondition in tempararay callback group was added to Executor::notify_waitable_

https://github.com/ros2/rclcpp/blob/37e36880260f10c85fb20809accec7cc77a3b1dd/rclcpp/src/rclcpp/executors/executor_entities_collector.cpp#L299-L302

In Executor::collect_entities(), all guard conditions (include interrupt, shutdown, nodes, callback group guard condition) in Executor::notify_waitable_ are added as one waitable.

https://github.com/ros2/rclcpp/blob/37e36880260f10c85fb20809accec7cc77a3b1dd/rclcpp/src/rclcpp/executor.cpp#L672-L677

In storage_rebuild_rcl_wait_set_with_sets(), waitable was added to wait set.

https://github.com/ros2/rclcpp/blob/37e36880260f10c85fb20809accec7cc77a3b1dd/rclcpp/include/rclcpp/wait_set_policies/detail/storage_policy_common.hpp#L378

Finally, it will call ExecutorNotifyWaitable::add_to_wait_set().
You will find it doesn't lock guard condition. So while callback group is freed, related guard condition is also freed. If rwm_wait access freed guard condition, this issue occurs. https://github.com/ros2/rclcpp/blob/37e36880260f10c85fb20809accec7cc77a3b1dd/rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp#L54-L65

If guard condition isn't freed, executor will get notify about callback group changed. Guard condition will be removed before next calling rwm_wait(). This issue doesn't occur.

So the root cause is unrelated to rmw.
I haven't discovered a better solution(A suitable location in the code for locking and unlocking). Do you have any suggestions? @mjcarroll @fujitatomoya

fujitatomoya commented 1 week ago

@Barry-Xu-2018 thank you for checking on this issue. could you also enlighten me why this problem does not happen with cyclonedds if that is the problem with rclcpp?

Barry-Xu-2018 commented 1 week ago

@fujitatomoya

@Barry-Xu-2018 thank you for checking on this issue. could you also enlighten me why this problem does not happen with cyclonedds if that is the problem with rclcpp?

It is related to the implementation of rmw_cyclonedds.

https://github.com/ros2/rmw_cyclonedds/blob/0c2904d725dade83cc0b1ffde931b46f9d28269f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L4485-L4498

    // Detach guard conditions
    if (gcs) {
      for (size_t i = 0; i < gcs->guard_condition_count; i++) {
        auto x = static_cast<CddsGuardCondition *>(gcs->guard_conditions[i]);
        if (ws->trigs[trig_idx] == static_cast<dds_attach_t>(nelems)) {
          bool dummy;
          dds_take_guardcondition(x->gcondh, &dummy);
          trig_idx++;
        } else {
          gcs->guard_conditions[i] = nullptr;
        }
        nelems++;
      }
    }

Based on the debugging results, when the guard condition is released, ws->trigs[trig_idx] == static_cast<dds_attach_t>(nelems) in rmw_wait() is not satisfied. So rmw_cyclonedds doesn't access freed pointer of guard condition.

Barry-Xu-2018 commented 6 days ago

https://github.com/ros2/rclcpp/blob/fbe602fa7c85b4c09b132f175edcaf90d84aae35/rclcpp/src/rclcpp/executor.cpp#L748

There's no way to directly lock the guard condition of the callback group in waitset.wait().
So I consider locking the callback group before waitset.wait(), and then releasing the callback group after calling waitset.wait().
Modify as follows
https://github.com/Barry-Xu-2018/rclcpp/commit/fbe602fa7c85b4c09b132f175edcaf90d84aae35#diff-e09e1b6262f7c9c86a3fc1a128c0da842729af46311955be86d1c97bf2558fbeR747

DiogoRoseira commented 6 days ago

Possible, the same problem of https://github.com/ros2/demos/issues/654

Barry-Xu-2018 commented 6 days ago

Possible, the same problem of https://github.com/ros2/demos/issues/654

Not sure. After fixing the issue, we can try to see if the problem still occurs.

Barry-Xu-2018 commented 6 days ago

@mjcarroll @fujitatomoya @clalancette

The root cause has been found. Please look at https://github.com/ros2/rclcpp/issues/2664#issuecomment-2467540388.
About the solution, I'd like to hear your opinions. I have a solution (https://github.com/ros2/rclcpp/issues/2664#issuecomment-2484986914. But I think there might be a better solution.).

DiogoRoseira commented 5 days ago

Possible, the same problem of ros2/demos#654

Not sure. After fixing the issue, we can try to see if the problem still occurs.

My code is not equal of the issue https://github.com/ros2/demos/issues/654, but i have tested not reset() the callback group on the on_cleanup() and the program dont throw a segmentation fault anymore, so i think it is the same bug.

fujitatomoya commented 4 days ago

Based on the debugging results, when the guard condition is released, ws->trigs[trig_idx] == static_cast(nelems) in rmw_wait() is not satisfied.

it looks like rmw_cyclonedds checks waitset trigger index after wait call. if the index does not match, it sets nullptr for guard_conditions without accessing it. on the other hand, rmw_fastrtps expects all these entities exist during this rmw_wait call. (i think rmw_connextdds also has the same expectation.)

I haven't discovered a better solution(A suitable location in the code for locking and unlocking). Do you have any suggestions?

to be honest, i am not really sure either, including this approach is the appropriate path...

@mjcarroll @alsora @jmachowinski any thoughts?

jmachowinski commented 3 days ago

This is a design failure in the ExecutorNotifyWaitable. From my point of view this is clearly a rclcpp issue, not a RWM issue.

Before creating the rwm waitset, we aquire all shared pointers, to make sure, that they don't run out of scope during the rmw_wait. The BIG assumption here is, that if I hold a valid shared ptr to anything inside the collection, none of the rcl primitves will go out of scope. Therefore all pointers inside the rwm_waitset should be valid during the rwm_wait, and it is totally fine, to access the during the rwm_wait.

The bug is in the ExecutorNotifyWaitable, as it only holds weak pointers to the guard condition. Therefore the GuardConditions can run out of scope during the rmw_wait, causing the segfault.

As for a fix for jazzy, I would suggest something like this :

void
Executor::wait_for_work(std::chrono::nanoseconds timeout)
{
  TRACETOOLS_TRACEPOINT(rclcpp_executor_wait_for_work, timeout.count());

  // Clear any previous wait result
  this->wait_result_.reset();

  // we need to make sure that callback groups don't get out of scope
  // during the wait. As in jazzy, they are not covered by the DynamicStorage,
  // we explicitly hold them here as a bugfix
  std::vector<rclcpp::CallbackGroup::SharedPtr> cbgs;

  {
    std::lock_guard<std::mutex> guard(mutex_);

    if (this->entities_need_rebuild_.exchange(false) || current_collection_.empty()) {
      this->collect_entities();
    }

    auto callback_groups = this->collector_.get_all_callback_groups();
    cbgs.resize(callback_groups.size());
    for(const auto &w_ptr : callback_groups)
    {
      auto shr_ptr = w_ptr.lock();
      if(shr_ptr)
      {
        cbgs.push_back(std::move(shr_ptr));
      }
    }
  }
  this->wait_result_.emplace(wait_set_.wait(timeout));
  if (!this->wait_result_ || this->wait_result_->kind() == WaitResultKind::Empty) {
    RCUTILS_LOG_WARN_NAMED(
      "rclcpp",
      "empty wait set received in wait(). This should never happen.");
  } else {
    // drop references to the callback groups, before trying to execute anything
    cbgs.clear();
    if (this->wait_result_->kind() == WaitResultKind::Ready && current_notify_waitable_) {
      auto & rcl_wait_set = this->wait_result_->get_wait_set().get_rcl_wait_set();
      if (current_notify_waitable_->is_ready(rcl_wait_set)) {
        current_notify_waitable_->execute(current_notify_waitable_->take_data());
      }
    }
  }
}

fujitatomoya commented 3 days ago

@jmachowinski thanks!

@Barry-Xu-2018 can you try https://github.com/ros2/rclcpp/pull/2683 to see if that fixes the problem?

Barry-Xu-2018 commented 1 day ago

@fujitatomoya

Jmachowinski's solution is almost same as what I thought https://github.com/ros2/rclcpp/issues/2664#issuecomment-2484986914. I've already verified that this method can fix the problem.
Currently, acquiring ownership is done in one poistion. https://github.com/ros2/rclcpp/blob/7e9ff5f4c774e6c0b28268f2d2b1b03fc52383a2/rclcpp/include/rclcpp/wait_set_template.hpp#L676-L677 The fix, however, will handle the callback group separately in a different place, which is why I was considering if there might be a better approach.

jmachowinski commented 17 hours ago

@Barry-Xu-2018 Sorry, I missed your solution...

I think this was already written somewhere else, but just to sum it up. My patch is just a quick way to fix the problem in rolling and jazzy.

For rolling, we should afterwards implement a cleaner solution, like locking the callbackgroups in the DynamicStorage.

Barry-Xu-2018 commented 2 hours ago

@jmachowinski

It's okay. Our goal is to fix this problem. :smile:

For rolling, we should afterwards implement a cleaner solution, like locking the callbackgroups in the DynamicStorage.

Totally agree. It is better to lock callbackgroups in the DynamicStorage.

ros2 / rclcpp