Open EricCousineau-TRI opened 1 week ago
Relates to Drake, drake-ros
, and pydrake
.
Due to a workaround in pydrake https://github.com/RobotLocomotion/drake/pull/14356, we may not always release certain objects. (Non-ideal, but will take some effort to fix.)
If we have components that use rclcpp::Node
, but are not deleted, we can encounter segfaults.
Example minimal code (from src/ros_segfault_min_lib.cc
) capturing the core issue:
// Expected case: rclcpp::Node will be freed.
void InitNode() {
RclcppInit();
rclcpp::Node("node");
}
// Bad case: rclcpp::Node is leaked :(
void InitAndLeakNode() {
RclcppInit();
new rclcpp::Node("node");
}
This shows three cases, running 200 trials, 50 in parallel
direct_leak
: Link a shared lib directly, call InitAndLeakNode()
. No segfaults.dlopen_noleak
: dlopen()
shared lib, call InitNode()
. No segfaults.dlopen_leak
: dlopen()
lib, call InitAndLeakNode()
. Many segfaults.The ideal workaround is that dlclose()
ideally does not trigger segfaults when (poorly)
leaking rclcpp::Node
instances.
Errors w/ dlopen_leak
are non-deterministic
# Non-standard build, but just want to focus on issue.
$ source /opt/ros/humble/setup.bash
$ cmake -S . -B build
$ cmake --build build -j
$ python ./src/ros_segfault_test.py
[ direct_leak ]
num_fail: 0 / 200
returncodes: {0}
[ dlopen_noleak ]
num_fail: 0 / 200
returncodes: {0}
[ dlopen_leak ]
num_fail: 185 / 200
returncodes: {0, -11, 127}
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsrt_avl_swap_node
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsrt_ehh_new
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsi_update_proxy_writer
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsi_update_proxy_reader
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: plist_fini_generic
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsrt_avl_free_arg
# Terminal 1: Run background stuff to incite segfault.
$ python ./src/ros_segfault_test.py --modes dlopen_leak --count 0
# Terminal 2: Run with gdb.
$ gdb --args ./build/ros_segfault_min_dlopen ./build/libros_segfault_min_lib.so InitAndLeakNode
# show backtrace on all threads
(gdb) thread apply all bt
Example stacktrace from main thread:
Thread 1 (Thread 0x7ffff7f8c740 (LWP 207729) "ros_segfault_mi"):
#0 __GI__dl_debug_state () at ./elf/dl-debug.c:117
#1 0x00007ffff7fc5c24 in _dl_close_worker (force=force@entry=false, map=<optimized out>, map=<optimized out>) at ./elf/dl-close.c:769
#2 0x00007ffff7fc62a2 in _dl_close_worker (force=false, map=0x555555579330) at ./elf/dl-close.c:150
#3 _dl_close (_map=0x555555579330) at ./elf/dl-close.c:818
#4 0x00007ffff7d74a98 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffd090, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208
#5 0x00007ffff7d74b63 in __GI__dl_catch_error (objname=0x7fffffffd0e8, errstring=0x7fffffffd0f0, mallocedp=0x7fffffffd0e7, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#6 0x00007ffff7c9012e in _dlerror_run (operate=<optimized out>, args=<optimized out>) at ./dlfcn/dlerror.c:138
#7 0x00007ffff7c8fe58 in __dlclose (handle=<optimized out>) at ./dlfcn/dlclose.c:31
#8 0x00007ffff7fa021d in rcutils_unload_shared_library () from /opt/ros/humble/lib/librcutils.so
#9 0x00007ffff7ea9370 in rcpputils::SharedLibrary::~SharedLibrary() () from /opt/ros/humble/lib/librcpputils.so
#10 0x00007ffff7ed0a66 in ?? () from /opt/ros/humble/lib/librmw_implementation.so
#11 0x00007ffff7c45495 in __run_exit_handlers (status=0, listp=0x7ffff7e1a838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#12 0x00007ffff7c45610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#13 0x00007ffff7c29d97 in __libc_start_call_main (main=main@entry=0x5555555551a9 <main>, argc=argc@entry=3, argv=argv@entry=0x7fffffffd748) --Type <RET> for more, q to quit, c to continue without paging--
at ../sysdeps/nptl/libc_start_call_main.h:74
#14 0x00007ffff7c29e40 in __libc_start_main_impl (main=0x5555555551a9 <main>, argc=3, argv=0x7fffffffd748, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd738) at ../csu/libc-start.c:392
#15 0x00005555555550e5 in _start ()
Thanks for the detailed analysis and reproduction.
My initial hunch would be that the Node is holding a function pointer to something that is getting unloaded before the library in question.
Have you tried running this under asan
or valgrind
to see if there is more detailed stack traces?
I ran it under asan
, ubsan
, and valgrind
, but didn't see anything obvious leap out:
https://github.com/EricCousineau-TRI/rclcpp_dlclose_segfault/commit/425f602c5acf295b476a5452026b957185b8d3e5
Perhaps rcl_node_init
-> pthread_create
is something that could have function pointer / some other data race condition when dlcose()
is triggered, like you said?
Bug report
Required Info:
humble
rmw_cyclonedds_cpp
rclcpp
Steps to reproduce issue
See https://github.com/EricCousineau-TRI/rclcpp_dlclose_segfault
Expected behavior
dlclose()
does not trigger race conditions with threads -- ideally, shutdowns existing workers, or tries to orderatexit
operations manually.Actual behavior
dlclose()
appears to trigger race conditions when user leaksrclcpp::Node
.Additional information
See README in https://github.com/EricCousineau-TRI/rclcpp_dlclose_segfault