ros2 / rclcpp

rclcpp (ROS Client Library for C++)
Apache License 2.0
559 stars 421 forks source link

Segment fault occurred when node was added to executor if the callback group variable created was local #2445

Open Chaochao0215 opened 8 months ago

Chaochao0215 commented 8 months ago

ROS2

Bug report

Required Info:

Steps to reproduce issue

Create a node and call the create_callback_group interface to create a callback group local variable, then add the node to the executor and spin the executor, segment fault occuered sometimes.

#include <memory>
#include <thread>
#include <string>
#include <vector>

#include "rclcpp/executors.hpp"
#include "rclcpp/node.hpp"
#include "rclcpp/timer.hpp"
#include "std_msgs/msg/string.hpp"

int main(int argc, char **argv) {
  rclcpp::init(argc, argv);

  auto node = std::make_shared<rclcpp::Node>("test_callback_group");
  auto sub = node->create_subscription<std_msgs::msg::String>(
      "/test_callback_group", 10,
      [](const std_msgs::msg::String::ConstSharedPtr) {});
  rclcpp::executors::MultiThreadedExecutor executor(rclcpp::ExecutorOptions(),
                                                    3u);
  std::thread thr;
  {
    auto group = node->create_callback_group(
        rclcpp::CallbackGroupType::MutuallyExclusive);
    executor.add_node(node);
    thr = std::thread([&executor]() { executor.spin(); });
    std::cout << "wait for executor running..." << std::endl;
  } // executor wait on dds and then destory the callback group

  while (rclcpp::ok()) {
    std::this_thread::sleep_for(std::chrono::seconds(1));
  }
  rclcpp::shutdown();

  if (thr.joinable()) {
    thr.join();
  }

  rclcpp::shutdown();
  return 0;
} 


#### Expected behavior
process runs normally
#### Actual behavior
segment fault
#### Additional information
It seems to be related to the lifetime of the condition in the callback group.
Zard-C commented 8 months ago

Hello @Chaochao0215, maybe we could build a debug type of this source code and use gdb to backtrace when it crashed

Barry-Xu-2018 commented 8 months ago

I tried your code, but I cannot find segment fault.
My test environment is based on ros2 Debian package (ros-humble-rclcpp 16.0.8-1jammy.20240217.065743).

Chaochao0215 commented 8 months ago

Hello @Chaochao0215, maybe we could build a debug type of this source code and use gdb to backtrace when it crashed Glad to receive your reply, this issue reoccuered and i pasted the backtrace information HowHgpAxRD

Zard-C commented 8 months ago

I tried your code and I can't reproduce the segment fault. However, in gdb backtrace it seems like a multi-thread problem, rebuild your program with

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread -g")

and check if there's data race in ThreadSanitizer report.

Chaochao0215 commented 8 months ago

I tried your code and I can't reproduce the segment fault. However, in gdb backtrace it seems like a multi-thread problem, rebuild your program with

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread -g")

and check if there's data race in ThreadSanitizer report.

hi, I found a quick way to reproduce the problem

  1. set breakpoint 1 at line 21 of the test file
  2. set breakpoint 2 at line 25 of the test file
  3. set breakpoint 3 at " rcl_ret_t status = rcl_wait(&waitset, std::chrono::duration_cast(timeout).count());" in executor.cpp
  4. set gdb to not stop when hit breakpoints , then run the test
  5. after breakpoints 1 && 3 both hit , continue the main thread and wait to hit breakpoint 2
  6. after breakpoint 2 hit, send the message subscribed in the new shell terminal, such as 'ros2 topic pub /test_callback_group std_msgs/msg/String "data: hello"' and continue the thread interrupted at breakpoint 3 , segment fault should occurred
Zard-C commented 8 months ago

6. send the message subscribed in the new shell terminal, such as 'ros2 topic pub /test_callback_group std_msgs/msg/String "data: hello"

I missed this step.

Chaochao0215 commented 8 months ago
  1. send the message subscribed in the new shell terminal, such as 'ros2 topic pub /test_callback_group std_msgs/msg/String "data: hello"

I missed this step.

I'm sorry i didn't specify the need to publish message at the beginning, hope you reproduce it successfully.

Chaochao0215 commented 8 months ago

I tried your code, but I cannot find segment fault. My test environment is based on ros2 Debian package (ros-humble-rclcpp 16.0.8-1jammy.20240217.065743). Thank you for your attention, it occurs with low probability and I built ROS2 from source so that it can be controlled through gdb

fujitatomoya commented 8 months ago

@Chaochao0215 it would be helpful if you could provide self-contained reproducible test.

zxf101826 commented 4 months ago

@Chaochao0215 Did you find the cause of this problem in the end? How did you solve it? I also encountered the same problem when using MultiThreadedExecutor, The program stack is as follows

Program terminated with signal SIGSEGV, Segmentation fault.

0 0x0000007f874cdac4 in ?? () from /opt/ros/humble/lib/librmw_cyclonedds_cpp.so

[Current thread is 1 (Thread 0x7f5f7ee8e0 (LWP 402))] (gdb) bt

0 0x0000007f874cdac4 in ?? () from /opt/ros/humble/lib/librmw_cyclonedds_cpp.so

1 0x0000007f874cfd50 in rmw_wait () from /opt/ros/humble/lib/librmw_cyclonedds_cpp.so

2 0x0000007f87f16c38 in rcl_wait () from /opt/ros/humble/lib/librcl.so

3 0x0000007f88266850 in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/humble/lib/librclcpp.so

4 0x0000007f8826989c in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) ()

from /opt/ros/humble/lib/librclcpp.so

5 0x0000007f88270d38 in rclcpp::executors::MultiThreadedExecutor::run(unsigned long) () from /opt/ros/humble/lib/librclcpp.so

6 0x0000007f87d531fc in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6

7 0x0000007f87b1d5c8 in start_thread (arg=0x0) at ./nptl/pthread_create.c:442

8 0x0000007f87b85d9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

Zard-C commented 4 months ago

@Chaochao0215 Did you find the cause of this problem in the end? How did you solve it? I also encountered the same problem when using MultiThreadedExecutor, The program stack is as follows

Program terminated with signal SIGSEGV, Segmentation fault.

could you give a minimum demo code to regenerate this error ? I tried codes above, but it didn't crash.

zxf101826 commented 4 months ago

@Chaochao0215 Did you find the cause of this problem in the end? How did you solve it? I also encountered the same problem when using MultiThreadedExecutor, The program stack is as follows Program terminated with signal SIGSEGV, Segmentation fault.

could you give a minimum demo code to regenerate this error ? I tried codes above, but it didn't crash.

My code is mixed with a lot of logic, I don't know which call caused it, this crash is quite sporadic, The program only appears once every few days.

adamlm commented 2 months ago

This issue looks similar to https://github.com/ros2/rmw_fastrtps/issues/777.