Open msplr opened 2 weeks ago
Possible duplicate with #2445
I reproduced this issue in rolling.
GDB output
Thread 16 "rclcpp_2664" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe74006c0 (LWP 292621)]
0x00007ffff5b8de9d in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=0x7ffff6558140 "rmw_fastrtps_cpp", subscriptions=0x555555919678, guard_conditions=0x555555919690, services=0x5555559196c0,
clients=0x5555559196a8, events=0x5555559196d8, wait_set=0x5555559156a0, wait_timeout=0x7fffe73fdc80) at /root/ros2_src_rolling/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:310
310 if (!condition->get_trigger_value()) {
(gdb) bt
#0 0x00007ffff5b8de9d in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=0x7ffff6558140 "rmw_fastrtps_cpp", subscriptions=0x555555919678, guard_conditions=0x555555919690, services=0x5555559196c0,
clients=0x5555559196a8, events=0x5555559196d8, wait_set=0x5555559156a0, wait_timeout=0x7fffe73fdc80) at /root/ros2_src_rolling/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:310
#1 0x00007ffff6548a5f in rmw_wait (subscriptions=0x555555919678, guard_conditions=0x555555919690, services=0x5555559196c0, clients=0x5555559196a8, events=0x5555559196d8, wait_set=0x5555559156a0,
wait_timeout=0x7fffe73fdc80) at /root/ros2_src_rolling/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:33
#2 0x00007ffff69a4146 in rmw_wait (v7=0x555555919678, v6=0x555555919690, v5=0x5555559196c0, v4=0x5555559196a8, v3=0x5555559196d8, v2=0x5555559156a0, v1=0x7fffe73fdc80)
at /root/ros2_src_rolling/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:578
#3 0x00007ffff6fbe161 in rcl_wait (wait_set=0x7fffffff2718, timeout=-1) at /root/ros2_src_rolling/src/ros2/rcl/rcl/src/rcl/wait.c:657
#4 0x00007ffff7b0fa8d in rclcpp::wait_set_policies::SequentialSynchronization::sync_wait<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > >(std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::function<void ()>, std::function<rcl_wait_set_s& ()>, std::function<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > (rclcpp::WaitResultKind)>) (this=0x7fffffff2718, time_to_wait_ns=std::chrono::duration = { -1ns }, rebuild_rcl_wait_set=...,
get_rcl_wait_set=..., create_wait_result=...) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/wait_set_policies/sequential_synchronization.hpp:280
#5 0x00007ffff7b08461 in rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage>::wait<long, std::ratio<1l, 1000000000l> > (this=0x7fffffff2718,
time_to_wait=std::chrono::duration = { -1ns }) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/wait_set_template.hpp:712
#6 0x00007ffff7af31bb in rclcpp::Executor::wait_for_work (this=0x7fffffff24b0, timeout=std::chrono::duration = { -1ns }) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:738
#7 0x00007ffff7af42a6 in rclcpp::Executor::get_next_executable (this=0x7fffffff24b0, any_executable=..., timeout=std::chrono::duration = { -1ns })
at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:886
#8 0x00007ffff7b48d53 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x7fffffff24b0) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp:41
#9 0x00005555555639f4 in operator() (__closure=0x5555559155e8) at /root/ros2_src_rolling/src/bug_reproduce/rclcpp_bug/src/rclcpp_2664/rclcpp_2664.cpp:13
#10 0x0000555555565df0 in std::__invoke_impl<void, main(int, char**)::<lambda()> >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#11 0x0000555555565db3 in std::__invoke<main(int, char**)::<lambda()> >(struct {...} &&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:96
#12 0x0000555555565d60 in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x5555559155e8) at /usr/include/c++/13/bits/std_thread.h:292
#13 0x0000555555565cee in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::operator()(void) (this=0x5555559155e8) at /usr/include/c++/13/bits/std_thread.h:299
#14 0x0000555555565a6c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > > >::_M_run(void) (this=0x5555559155e0) at /usr/include/c++/13/bits/std_thread.h:244
#15 0x00007ffff6d89bb4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007ffff6afaa94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#17 0x00007ffff6b87c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
A non-existent guard condition is accessed.
It's currently suspected that it might be related to the notify_guardcondition in the temporarily created callback group.
https://github.com/ros2/rclcpp/blob/37e36880260f10c85fb20809accec7cc77a3b1dd/rclcpp/include/rclcpp/callback_group.hpp#L269
The newly created callback group is stored in the node using a weak pointer.
https://github.com/ros2/rclcpp/blob/37e36880260f10c85fb20809accec7cc77a3b1dd/rclcpp/include/rclcpp/node_interfaces/node_base.hpp#L159
I will investigate further.
I was also able to reproduce, but it does seem to exclusively happen on fastrtps for me and not with the cyclone middleware, perhaps a sequencing issue? https://github.com/ros2/rclcpp/pull/2665
@mjcarroll @Barry-Xu-2018 i only can reproduce this issue with rmw_fastrtps
but rmw_cyclonedds
.
Possible duplicate with https://github.com/ros2/rclcpp/issues/2445
according to the stack trace between https://github.com/ros2/rclcpp/issues/2664#issuecomment-2464256660 and https://github.com/ros2/rclcpp/issues/2445#issuecomment-1978366522, it looks like a different problem.
rmw_guard_condition_t
created by temporary callback group has been freed after temporary callback group is removed.
But freed GuardCondition
is used in __rmw_wait().
The root cause:
The GuardCondition
in tempararay callback group was added to Executor::notify_waitable_
In Executor::collect_entities(), all guard conditions (include interrupt, shutdown, nodes, callback group guard condition) in Executor::notify_waitable_
are added as one waitable.
In storage_rebuild_rcl_wait_set_with_sets(), waitable was added to wait set.
Finally, it will call ExecutorNotifyWaitable::add_to_wait_set().
You will find it doesn't lock guard condition. So while callback group is freed, related guard condition is also freed.
If rwm_wait access freed guard condition, this issue occurs.
https://github.com/ros2/rclcpp/blob/37e36880260f10c85fb20809accec7cc77a3b1dd/rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp#L54-L65
If guard condition isn't freed, executor will get notify about callback group changed. Guard condition will be removed before next calling rwm_wait(). This issue doesn't occur.
So the root cause is unrelated to rmw.
I haven't discovered a better solution(A suitable location in the code for locking and unlocking). Do you have any suggestions? @mjcarroll @fujitatomoya
@Barry-Xu-2018 thank you for checking on this issue. could you also enlighten me why this problem does not happen with cyclonedds if that is the problem with rclcpp?
@fujitatomoya
@Barry-Xu-2018 thank you for checking on this issue. could you also enlighten me why this problem does not happen with cyclonedds if that is the problem with rclcpp?
It is related to the implementation of rmw_cyclonedds.
// Detach guard conditions
if (gcs) {
for (size_t i = 0; i < gcs->guard_condition_count; i++) {
auto x = static_cast<CddsGuardCondition *>(gcs->guard_conditions[i]);
if (ws->trigs[trig_idx] == static_cast<dds_attach_t>(nelems)) {
bool dummy;
dds_take_guardcondition(x->gcondh, &dummy);
trig_idx++;
} else {
gcs->guard_conditions[i] = nullptr;
}
nelems++;
}
}
Based on the debugging results, when the guard condition is released, ws->trigs[trig_idx] == static_cast<dds_attach_t>(nelems)
in rmw_wait() is not satisfied. So rmw_cyclonedds doesn't access freed pointer of guard condition.
There's no way to directly lock the guard condition of the callback group in waitset.wait().
So I consider locking the callback group before waitset.wait(), and then releasing the callback group after calling waitset.wait().
Modify as follows
https://github.com/Barry-Xu-2018/rclcpp/commit/fbe602fa7c85b4c09b132f175edcaf90d84aae35#diff-e09e1b6262f7c9c86a3fc1a128c0da842729af46311955be86d1c97bf2558fbeR747
Possible, the same problem of https://github.com/ros2/demos/issues/654
Possible, the same problem of https://github.com/ros2/demos/issues/654
Not sure. After fixing the issue, we can try to see if the problem still occurs.
@mjcarroll @fujitatomoya @clalancette
The root cause has been found. Please look at https://github.com/ros2/rclcpp/issues/2664#issuecomment-2467540388.
About the solution, I'd like to hear your opinions. I have a solution (https://github.com/ros2/rclcpp/issues/2664#issuecomment-2484986914. But I think there might be a better solution.).
Possible, the same problem of ros2/demos#654
Not sure. After fixing the issue, we can try to see if the problem still occurs.
My code is not equal of the issue https://github.com/ros2/demos/issues/654, but i have tested not reset()
the callback group on the on_cleanup()
and the program dont throw a segmentation fault anymore, so i think it is the same bug.
Based on the debugging results, when the guard condition is released, ws->trigs[trig_idx] == static_cast
(nelems) in rmw_wait() is not satisfied.
it looks like rmw_cyclonedds checks waitset trigger index after wait call. if the index does not match, it sets nullptr for guard_conditions
without accessing it.
on the other hand, rmw_fastrtps expects all these entities exist during this rmw_wait
call. (i think rmw_connextdds also has the same expectation.)
I haven't discovered a better solution(A suitable location in the code for locking and unlocking). Do you have any suggestions?
to be honest, i am not really sure either, including this approach is the appropriate path...
@mjcarroll @alsora @jmachowinski any thoughts?
This is a design failure in the ExecutorNotifyWaitable. From my point of view this is clearly a rclcpp issue, not a RWM issue.
Before creating the rwm waitset, we aquire all shared pointers, to make sure, that they don't run out of scope during the rmw_wait. The BIG assumption here is, that if I hold a valid shared ptr to anything inside the collection, none of the rcl primitves will go out of scope. Therefore all pointers inside the rwm_waitset should be valid during the rwm_wait, and it is totally fine, to access the during the rwm_wait.
The bug is in the ExecutorNotifyWaitable, as it only holds weak pointers to the guard condition. Therefore the GuardConditions can run out of scope during the rmw_wait, causing the segfault.
As for a fix for jazzy, I would suggest something like this :
void
Executor::wait_for_work(std::chrono::nanoseconds timeout)
{
TRACETOOLS_TRACEPOINT(rclcpp_executor_wait_for_work, timeout.count());
// Clear any previous wait result
this->wait_result_.reset();
// we need to make sure that callback groups don't get out of scope
// during the wait. As in jazzy, they are not covered by the DynamicStorage,
// we explicitly hold them here as a bugfix
std::vector<rclcpp::CallbackGroup::SharedPtr> cbgs;
{
std::lock_guard<std::mutex> guard(mutex_);
if (this->entities_need_rebuild_.exchange(false) || current_collection_.empty()) {
this->collect_entities();
}
auto callback_groups = this->collector_.get_all_callback_groups();
cbgs.resize(callback_groups.size());
for(const auto &w_ptr : callback_groups)
{
auto shr_ptr = w_ptr.lock();
if(shr_ptr)
{
cbgs.push_back(std::move(shr_ptr));
}
}
}
this->wait_result_.emplace(wait_set_.wait(timeout));
if (!this->wait_result_ || this->wait_result_->kind() == WaitResultKind::Empty) {
RCUTILS_LOG_WARN_NAMED(
"rclcpp",
"empty wait set received in wait(). This should never happen.");
} else {
// drop references to the callback groups, before trying to execute anything
cbgs.clear();
if (this->wait_result_->kind() == WaitResultKind::Ready && current_notify_waitable_) {
auto & rcl_wait_set = this->wait_result_->get_wait_set().get_rcl_wait_set();
if (current_notify_waitable_->is_ready(rcl_wait_set)) {
current_notify_waitable_->execute(current_notify_waitable_->take_data());
}
}
}
}
@jmachowinski thanks!
@Barry-Xu-2018 can you try https://github.com/ros2/rclcpp/pull/2683 to see if that fixes the problem?
@fujitatomoya
Jmachowinski's solution is almost same as what I thought https://github.com/ros2/rclcpp/issues/2664#issuecomment-2484986914. I've already verified that this method can fix the problem.
Currently, acquiring ownership is done in one poistion.
https://github.com/ros2/rclcpp/blob/7e9ff5f4c774e6c0b28268f2d2b1b03fc52383a2/rclcpp/include/rclcpp/wait_set_template.hpp#L676-L677
The fix, however, will handle the callback group separately in a different place, which is why I was considering if there might be a better approach.
@Barry-Xu-2018 Sorry, I missed your solution...
I think this was already written somewhere else, but just to sum it up. My patch is just a quick way to fix the problem in rolling and jazzy.
For rolling, we should afterwards implement a cleaner solution, like locking the callbackgroups in the DynamicStorage.
@jmachowinski
It's okay. Our goal is to fix this problem. :smile:
For rolling, we should afterwards implement a cleaner solution, like locking the callbackgroups in the DynamicStorage.
Totally agree. It is better to lock callbackgroups in the DynamicStorage.
Bug report
Required Info:
Issue description
The
rclcpp::Executor
crashes when a timer with a non-default callback group is destroyed before the executor stops spinning.The bug also appears if a callback group is created and destroyed without being added to the timer.
The bug does not manifest in the example below when:
Steps to reproduce issue
See the full gist
Expected behavior
The node runs and terminates without issue.
Actual behavior
The node fails with a Segmentation fault in
rclcpp::Executor::spin()
.Additional information
Backtrace: