Segmentation fault when adding a short-lived node to an executor

adamlm commented 2 months ago

Bug report

Description

I am trying to set up an executor "spinner thread" that will spin a specified rclcpp::Executor until commanded to stop. My goal is to construct the executor and spinner thread in one scope then add nodes to that executor in another scope. However, the rclcpp::Executor classes cause a segmentation fault under certain conditions. There appears to be a race condition in the RMW layer that manifests with short-lived nodes.

Operating System: Ubuntu 22.04 (Docker container); Ubuntu 22.04 host
Installation type: Binaries
Version or commit hash:
- ros-humble-rclcpp: 16.0.10
- ros-humble-rmw-fastrtps-cpp: 6.2.7
- ros-humble-fastrtps: 2.6.8
- ros-humble-rmw-cyclonedds-cpp: 0.10.4
- ros-humble-cyclonedds: 1.3.4
DDS implementation:
- Fast-DDS (Fast-RTPS)
- Cyclone DDS
Client library (if applicable): rclcpp
Compiler:
- Clang 14
- GCC 11.4.0

Steps to reproduce issue

Minimal reproducible example

#include <atomic>
#include <memory>
#include <thread>

#include <rclcpp/executors/multi_threaded_executor.hpp>
#include <rclcpp/node.hpp>

auto function() -> void
{
    rclcpp::init(0, nullptr);

    rclcpp::executors::MultiThreadedExecutor executor;

    std::atomic_bool stop_spinning{false};
    auto spinner_thread{std::thread([&] {
        while (rclcpp::ok() && !stop_spinning.load()) {
            executor.spin_some();
        }
    })};

    {
        auto node{std::make_shared<rclcpp::Node>("node")};
        executor.add_node(node);
    }

    stop_spinning.store(true);
    spinner_thread.join();

    rclcpp::shutdown();
}

auto main() -> int
{
    while (true) {
        function();
    }

    return 0;
}

Steps

Compile and run the above example.
Wait for a segmentation fault error. It usually happens within a minute.

Expected behavior

The program runs indefinitely.

Actual behavior

The program exits due to a segmentation fault.

Program output when compiled with Clang

Segmentation fault (core dumped)

Program output when compiled with GCC

Segmentation fault (core dumped)

In (only) one of the many runs I did, I got the following output (when compiled with GCC):

terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  failed to add guard condition to wait set: guard condition implementation is invalid, at ./src/rcl/guard_condition.c:172, at ./src/rcl/wait.c:460
Aborted (core dumped)

Additional information

Relevant core dump

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f4154232e3f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
   from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
[Current thread is 1 (Thread 0x7f41531e8640 (LWP 1202848))]
(gdb) bt
#0  0x00007f4154232e3f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
    at /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
#1  0x00007f415428a887 in rmw_wait(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_events_t*, rmw_wait_set_t*, rmw_time_t const*)
    (subscriptions=<optimized out>, guard_conditions=<optimized out>, services=<optimized out>, clients=<optimized out>, events=<optimized out>, wait_set=<optimized out>, wait_timeout=0x7f41531d5510) at ./src/rmw_wait.cpp:33
#2  0x00007f42759b1848 in rcl_wait () at /opt/ros/humble/lib/librcl.so
#3  0x00007f4274f2470c in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) (this=0x7ffdbe823c50, timeout=...) at /usr/include/c++/11/chrono:521
#4  0x00007f4274f28853 in rclcpp::Executor::spin_some_impl(std::chrono::duration<long, std::ratio<1l, 1000000000l> >, bool) (this=0x7ffdbe823c50, max_duration=..., exhaustive=false) at /usr/include/c++/11/chrono:537
#5  0x000063ca4387dbde in function()::$_0::operator()() const (this=0x63ca44fc84e8) at <redacted_executable_path>:17
#6  0x000063ca4387db1d in std::__invoke_impl<void, function()::$_0>(std::__invoke_other, function()::$_0&&) (__f=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61
#7  0x000063ca4387daad in std::__invoke<function()::$_0>(function()::$_0&&) (__fn=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96
#8  0x000063ca4387da85 in std::thread::_Invoker<std::tuple<function()::$_0> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x63ca44fc84e8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:259
#9  0x000063ca4387da55 in std::thread::_Invoker<std::tuple<function()::$_0> >::operator()() (this=0x63ca44fc84e8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:266
#10 0x000063ca4387d979 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<function()::$_0> > >::_M_run() (this=0x63ca44fc84e0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211
#11 0x00007f4273705253 in  () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007f427338dac3 in  () at /usr/lib/x86_64-linux-gnu/libc.so.6
#13 0x00007f427341f850 in  () at /usr/lib/x86_64-linux-gnu/libc.so.6

Other notes

The issue occurs when using either the SingleThreadedExecutor or MultiThreadedExecutor.

No segmentation fault occurs* when relocating node to the outer scope.

auto node{std::make_shared<rclcpp::Node>("node")};

{
  executor.add_node(node);
}

No segmentation fault occurs* when using Cyclone DDS

No segmentation fault occurs* when sleeping the main thread for a short duration before adding node to the executor.

{
  auto node{std::make_shared<rclcpp::Node>("node")};
  std::this_thread::sleep_for(std::chrono::seconds(1));
  executor.add_node(node);
}

* I concluded a segmentation fault was unlikely after running the program for several minutes.

fujitatomoya commented 2 months ago

I am not sure if this is fixed with specific PR, but this problem cannot be observed with rolling.

christophebedard commented 2 months ago

Can you try removing the node from the executor before it goes out of scope?

adamlm commented 2 months ago

I am not sure if this is fixed with specific PR, but this problem cannot be observed with rolling.

@fujitatomoya I ran the example code in the latest Rolling Docker image (rolling-ros-base, image ID/digest f875aa41a6f0) and got a segfault there too.

Container start command:

docker run --rm -it ros:rolling-ros-base

The only changes I made inside the container were:

updating apt's package list (not upgrading the packages themselves)
installing Vim
creating a workspace at /workspace/src
creating an C++ package with ros2 pkg create
adding the example code from above

Is there anything else I can provide to help pinpoint where the issue might be?

Can you try removing the node from the executor before it goes out of scope?

@christophebedard I made the following change to the example.

{
    auto node{std::make_shared<rclcpp::Node>("node")};
    executor.add_node(node);
    executor.remove_node(node);
}

The program ran fine for over 10 minutes running on Humble with FastDDS.

christophebedard commented 2 months ago

Can you try removing the node from the executor before it goes out of scope?

@christophebedard I made the following change to the example.
{
    auto node{std::make_shared<rclcpp::Node>("node")};
    executor.add_node(node);
    executor.remove_node(node);
}
The program ran fine for over 10 minutes running on Humble with FastDDS.

I think it's reasonable to expect the user to remove the node from the executor before it goes out of scope/gets destroyed. See this test which does something similar to your reproducer: https://github.com/ros2/rclcpp/blob/2f71d6e249f626772da3f8a1bb7c8d141d9d0d52/rclcpp/test/rclcpp/test_executor.cpp#L75-L102

I'm not sure if this would cover your full use-case, so please let us know.

fujitatomoya commented 2 months ago

@adamlm thanks for checking this! actually what i tried is to source build with rolling. so maybe latest patches in rolling source (not available as package) fix this issue.

TSNoble commented 1 month ago

We're having a similar issue when creating a short-lived subscriber. I think it may be the same since:

We're creating and attaching the subscriber within the scope of a method (specifically an action callback)
The gdb backtrace is very similar (crashes within the same place in rclcpp / rmw)
The method call is very quick (and adding a sleep within it seems to prevent the crash)
We occasionally get the same rclcpp::exceptions::RCLError

We were originally upgrading from noetic to humble. I noticed https://github.com/ros2/rmw_fastrtps/issues/728, which seemed to be resolved by https://github.com/ros2/rclcpp/pull/2142. This PR appears to be in jazzy, however, upgrading to jazzy has not fixed the issue for us.

ros2 / rmw_fastrtps