ros2 / rmw_zenoh

RMW for ROS 2 using Zenoh as the middleware
Apache License 2.0
144 stars 29 forks source link

Hang loading components #182

Open hwoithe opened 1 month ago

hwoithe commented 1 month ago

I am experience a hang while loading many components with rmw_zenoh.

Terminal 1:

$ ros2 run rmw_zenoh_cpp rmw_zenohd

Terminal 2:

$ export RMW_IMPLEMENTATION=rmw_zenoh_cpp
$ ros2 run rclcpp_components component_container

Terminal 3:

$ export RMW_IMPLEMENTATION=rmw_zenoh_cpp
$ for i in {1..100}; do ros2 component load --node-namespace /talker${i} /ComponentManager composition composition::Talker; done
...
[INFO] [1716659586.681864543] [rmw_zenoh_cpp]: Successfully connected to a Zenoh router with id 7d769066d659797f52ea9cc8c559737.
Loaded component 76 into '/ComponentManager' container node as '/talker79/talker'
[INFO] [1716659589.848647783] [rmw_zenoh_cpp]: Successfully connected to a Zenoh router with id 7d769066d659797f52ea9cc8c559737.
Loaded component 77 into '/ComponentManager' container node as '/talker80/talker'
[INFO] [1716659593.097119225] [rmw_zenoh_cpp]: Successfully connected to a Zenoh router with id 7d769066d659797f52ea9cc8c559737.
Loaded component 78 into '/ComponentManager' container node as '/talker81/talker'
[INFO] [1716659596.418323636] [rmw_zenoh_cpp]: Successfully connected to a Zenoh router with id 7d769066d659797f52ea9cc8c559737.

I would expect the for loop above to eventually finish but it does not.

Let's attach to the process and investigate:

$ sudo gdb attach -p `pgrep -f talker82`
...
(gdb) info threads
  Id   Target Id                                      Frame 
* 1    Thread 0x7f13abacd740 (LWP 1649280) "ros2"     syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    Thread 0x7f13a9d2c640 (LWP 1649281) "ros2-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3    Thread 0x7f13a952b640 (LWP 1649282) "ros2-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  4    Thread 0x7f13a104e640 (LWP 1649283) "ros2"     __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f13aabeb280 <(anonymous namespace)::g_signal_handler_sem>) at ./nptl/futex-internal.c:57
  5    Thread 0x7f13a084d640 (LWP 1649284) "app-0"    0x00007f13abbf5e2e in epoll_wait (epfd=3, events=0x556946c61670, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  6    Thread 0x7f13a064c640 (LWP 1649285) "net-0"    0x00007f13abbf5e2e in epoll_wait (epfd=8, events=0x556946c64680, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  7    Thread 0x7f13a044b640 (LWP 1649286) "acc-0"    0x00007f13abbf5e2e in epoll_wait (epfd=12, events=0x556946c93120, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  8    Thread 0x7f138ffff640 (LWP 1649288) "tx-0"     0x00007f13abbf5e2e in epoll_wait (epfd=16, events=0x7f139000a5e0, maxevents=1024, timeout=462) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  9    Thread 0x7f138fdfe640 (LWP 1649289) "rx-0"     futex_wait (private=0, expected=2, futex_word=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>) at ../sysdeps/nptl/futex-internal.h:146
  10   Thread 0x7f138fbfd640 (LWP 1649290) "rx-1"     syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  11   Thread 0x7f138f9fc640 (LWP 1649291) "ros2"     __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7f138f9fbde0, op=137, expected=0, futex_word=0x556946f65780) at ./nptl/futex-internal.c:57
...
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f13abacd740 (LWP 1649280))]
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38  in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f13a26a66b4 in std::sys::pal::unix::futex::futex_wait () at library/std/src/sys/pal/unix/futex.rs:62
#2  std::sys::sync::mutex::futex::Mutex::lock_contended () at library/std/src/sys/sync/mutex/futex.rs:68
#3  0x00007f13a29b3ba5 in std::sys::sync::mutex::futex::Mutex::lock () at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys/sync/mutex/futex.rs:40
#4  std::sync::mutex::Mutex<alloc::boxed::Box<(dyn zenoh::net::routing::hat::HatTrait + core::marker::Send + core::marker::Sync), alloc::alloc::Global>>::lock<alloc::boxed::Box<(dyn zenoh::net::routing::hat::HatTrait + core::marker::Send + core::marker::Sync), alloc::alloc::Global>> (self=0x556946c7e120)
    at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sync/mutex.rs:317
#5  0x00007f13a29c4509 in zenoh::net::routing::dispatcher::face::{impl#4}::send_declare (self=0x5569468e6750, msg=...) at src/net/routing/dispatcher/face.rs:189
#6  0x00007f13a2a3dc20 in zenoh::session::Session::declare_queryable_inner (self=0x5569466acfb0, key_expr=0x7fff48bb4490, complete=true, origin=zenoh::sample::Locality::Any, callback=...) at src/session.rs:1249
#7  0x00007f13a280b6c6 in zenoh::queryable::{impl#20}::res_sync<zenoh::handlers::DefaultHandler> (self=...) at <***removed***>/.cargo/git/checkouts/zenoh-cc237f2570fab813/763a05f/zenoh/src/queryable.rs:632
#8  0x00007f13a28023c2 in zenoh_ext::publication_cache::PublicationCache::new (conf=...) at src/publication_cache.rs:160
#9  0x00007f13a28012dd in zenoh_ext::publication_cache::{impl#2}::res_sync (self=<error reading variable: Cannot access memory at address 0x89>) at src/publication_cache.rs:99
#10 0x00007f13a2779c0b in zenohcd::publication_cache::ze_declare_publication_cache (session=..., keyexpr=..., options=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/publication_cache.rs:136
#11 0x00007f13a8ccfa7d in rmw_create_publisher (node=0x556946f65c40, type_supports=0x7f13a9f3eb80, topic_name=0x556946f67630 "/rosout", qos_profile=0x7fff48bb9b30, publisher_options=0x7fff48bb9bb0) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp:586
#12 0x00007f13aa26512c in rcl_publisher_init (publisher=0x7fff48bb9b28, node=0x556946f65120, type_support=0x7f13a9f3eb80, topic_name=0x7f13aa283c93 "/rosout", options=0x7fff48bb9b30) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/publisher.c:111
#13 0x00007f13aa25e0bb in rcl_logging_rosout_init_publisher_for_node (node=0x556946f65120) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/logging_rosout.c:276
#14 0x00007f13aa261c93 in rcl_node_init (node=0x556946f65120, name=0x556946f5b0a0 "_ros2cli_1649280", namespace_=0x7fff48bc0b48 "", context=0x556946cabbb0, options=0x7fff48bc07e0) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/node.c:297
#15 0x00007f13aaa22cec in rclpy::Node::Node (this=0x556946b6b170, node_name=0x556946f5b0a0 "_ros2cli_1649280", namespace_=0x7fff48bc0b48 "", context=..., pycli_args=..., use_global_arguments=true, enable_rosout=true) at <***removed***>/ws_rmw_zenoh/src/rclpy/rclpy/src/rclpy/node.cpp:464
#16 0x00007f13aaa2b7f5 in pybind11::detail::initimpl::construct_or_initialize<rclpy::Node, char const*, char const*, rclpy::Context&, pybind11::object, bool, bool, 0> () at /usr/include/pybind11/detail/init.h:61
#17 0x00007f13aaa299cb in pybind11::detail::initimpl::constructor<char const*, char const*, rclpy::Context&, pybind11::object, bool, bool>::execute<pybind11::class_<rclpy::Node, rclpy::Destroyable, std::shared_ptr<rclpy::Node> >, , 0>(pybind11::class_<rclpy::Node, rclpy::Destroyable, std::shared_ptr<rclpy::Node> >&)::{lambda(pybind11::detail::value_and_holder&, char const*, char const*, rclpy::Context&, pybind11::object, bool, bool)#1}::operator()(pybind11::detail::value_and_holder&, char const*, char const*, rclpy::Context&, pybind11::object, bool, bool) const (__closure=0x5569467ced28, v_h=..., args#0=0x556946f5b0a0 "_ros2cli_1649280", 
    args#1=0x7fff48bc0b48 "", args#2=..., args#3=..., args#4=true, args#5=true) at /usr/include/pybind11/detail/init.h:178
...

We can see from frame 5 in thread 1 that it wants to acquire the ctrl_lock: https://github.com/eclipse-zenoh/zenoh/blob/763a05f0d61a894cf3cbeb07223e2ea726777325/zenoh/src/net/routing/dispatcher/face.rs#L189

Another thread must be holding that lock:

(gdb) thread 9
[Switching to thread 9 (Thread 0x7f138fdfe640 (LWP 1649289))]
#0  futex_wait (private=0, expected=2, futex_word=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>) at ../sysdeps/nptl/futex-internal.h:146
146 ../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f13abb6805d in lll_mutex_lock_optimized (mutex=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>) at ./nptl/pthread_mutex_lock.c:128
#4  0x00007f13aaa1c75c in __gthread_mutex_lock (__mutex=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>) at /usr/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749
#5  0x00007f13aaa1c77d in __gthread_recursive_mutex_lock (__mutex=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>) at /usr/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:811
#6  0x00007f13aaa1c79c in std::recursive_mutex::lock (this=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>) at /usr/include/c++/11/mutex:108
#7  0x00007f13aaa1c7e0 in std::lock_guard<std::recursive_mutex>::lock_guard (this=0x7f138fdf0408, __m=...) at /usr/include/c++/11/bits/std_mutex.h:229
#8  0x00007f13aaa1b742 in rclpy::LoggingGuard::LoggingGuard (this=0x7f138fdf0408) at <***removed***>/ws_rmw_zenoh/src/rclpy/rclpy/src/rclpy/logging.cpp:33
#9  0x00007f13aaa1b786 in rclpy::rclpy_thread_safe_logging_output_handler (location=0x7f13a8d29ed0 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f13a8ceb6c5 "rmw_zenoh_cpp", 
    timestamp=1716659597585131569, format=0x7f13a8cebf30 "Received liveliness token to remove node /%s from the graph before all pub/subs/clients/services for this node have been removed. Removing all entities first...", args=0x7f138fdf04c0)
    at <***removed***>/ws_rmw_zenoh/src/rclpy/rclpy/src/rclpy/logging.cpp:51
#10 0x00007f13aa224d9c in vrcutils_log_internal (location=0x7f13a8d29ed0 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f13a8ceb6c5 "rmw_zenoh_cpp", 
    format=0x7f13a8cebf30 "Received liveliness token to remove node /%s from the graph before all pub/subs/clients/services for this node have been removed. Removing all entities first...", args=0x7f138fdf04c0) at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1095
#11 0x00007f13aa224f7c in rcutils_log_internal (location=0x7f13a8d29ed0 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f13a8ceb6c5 "rmw_zenoh_cpp", 
    format=0x7f13a8cebf30 "Received liveliness token to remove node /%s from the graph before all pub/subs/clients/services for this node have been removed. Removing all entities first...") at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1119
#12 0x00007f13a8c80fe4 in rmw_zenoh_cpp::GraphCache::parse_del (this=0x556946a04f70, keyexpr="@ros2_lv/0/b89f34a8f6a786788926e9c41d11be8e/295/295/NN/%talker25/talker", ignore_from_current_session=false) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/detail/graph_cache.cpp:603
#13 0x00007f13a8cc87fe in graph_sub_data_handler (sample=0x7f138fdf1a70, data=0x556946bbe550) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_init.cpp:73
#14 0x00007f13a26acead in zenohcd::closures::sample_closure::z_closure_sample_call (closure=0x556946d08ab0, sample=0x7f138fdf1a70) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/closures/sample_closure.rs:52
#15 0x00007f13a2785be1 in zenohcd::liveliness::zc_liveliness_declare_subscriber::{closure#0} (sample=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/liveliness.rs:186
#16 0x00007f13a2a426c8 in zenoh::session::Session::handle_data (self=0x556946bf5ed0, local=false, key_expr=0x7f138fdf52d0, info=..., payload=..., attachment=...) at src/session.rs:1712
#17 0x00007f13a2a4ce44 in zenoh::session::{impl#13}::send_declare (self=0x556946bf5ed0, msg=...) at src/session.rs:2195
#18 0x00007f13a2a58414 in zenoh::session::{impl#16}::send_declare (self=0x556946bf5ed0, ctx=...) at src/session.rs:2701
#19 0x00007f13a2a19fdf in zenoh::net::routing::hat::p2p_peer::pubsub::propagate_forget_simple_subscription (tables=0x556946c7e040, res=0x7f138fdf7548) at src/net/routing/hat/p2p_peer/pubsub.rs:175
#20 0x00007f13a2a1a8b1 in zenoh::net::routing::hat::p2p_peer::pubsub::undeclare_client_subscription (tables=0x556946c7e040, face=0x7f138fdf8028, res=0x7f138fdf7548) at src/net/routing/hat/p2p_peer/pubsub.rs:205
#21 0x00007f13a2a1aeb7 in zenoh::net::routing::hat::p2p_peer::pubsub::forget_client_subscription (tables=0x556946c7e040, face=0x7f138fdf8028, res=0x7f138fdf7548) at src/net/routing/hat/p2p_peer/pubsub.rs:236
#22 0x00007f13a2b1e7dd in zenoh::net::routing::hat::p2p_peer::pubsub::{impl#0}::undeclare_subscription (self=0x1, tables=0x556946c7e040, face=0x7f138fdf8028, res=0x7f138fdf7548, _node_id=0) at src/net/routing/hat/p2p_peer/pubsub.rs:275
#23 0x00007f13a2b60bd5 in zenoh::net::routing::dispatcher::pubsub::undeclare_subscription (hat_code=..., tables=0x556946c7e030, face=0x7f138fdf8028, expr=0x7f138fdf8000, node_id=0) at src/net/routing/dispatcher/pubsub.rs:108
#24 0x00007f13a29c4da9 in zenoh::net::routing::dispatcher::face::{impl#4}::send_declare (self=0x7f1390003bd0, msg=...) at src/net/routing/dispatcher/face.rs:208
#25 0x00007f13a29c25fe in zenoh::net::primitives::demux::{impl#1}::handle_message (self=0x7f1390003bd0, msg=...) at src/net/primitives/demux.rs:69
#26 0x00007f13a2b3267e in zenoh::net::runtime::{impl#6}::handle_message (self=0x7f1390009e10, msg=...) at src/net/runtime/mod.rs:385
#27 0x00007f13a2d41a79 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::trigger_callback (self=0x7f1390136cc0, callback=..., msg=...) at src/unicast/universal/rx.rs:49
#28 0x00007f13a2d42ae3 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::handle_frame (self=0x7f1390136cc0, frame=...) at src/unicast/universal/rx.rs:100
#29 0x00007f13a2d45b21 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::read_messages (self=0x7f1390136cc0, batch=..., link=0x7f1390136d88) at src/unicast/universal/rx.rs:204
#30 0x00007f13a2ce826e in zenoh_transport::unicast::universal::link::rx_task::{async_fn#0} () at src/unicast/universal/link.rs:266
#31 0x00007f13a2ce4ba5 in zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block#0} () at src/unicast/universal/link.rs:124
#32 0x00007f13a2d50a94 in tokio_util::task::task_tracker::{impl#8}::poll<zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block_env#0}> (self=..., cx=0x7f138fdfc5a8) at <***removed***>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-util-0.7.10/src/task/task_tracker.rs:669
...

Frame 24 on thread 9 indicates the thread should have acquired the ctrl_lock earlier in the function if it is calling undeclare_subscription(): https://github.com/eclipse-zenoh/zenoh/blob/763a05f0d61a894cf3cbeb07223e2ea726777325/zenoh/src/net/routing/dispatcher/face.rs#L208

This means that thread 9 is holding the ctrl_lock that thread 1 is waiting on.

Thread 9 is also waiting:

(gdb) thread 9
[Switching to thread 9 (Thread 0x7f138fdfe640 (LWP 1649289))]
#0  futex_wait (private=0, expected=2, futex_word=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>) at ../sysdeps/nptl/futex-internal.h:146
146 in ../sysdeps/nptl/futex-internal.h
(gdb) up 3
#3  ___pthread_mutex_lock (mutex=0x7f13aabeaa60 <rclpy::LoggingGuard::logging_mutex_>) at ./nptl/pthread_mutex_lock.c:128
128 ./nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) print mutex.__data.__owner
$1 = 1649280
(gdb) thread find 1649280
Thread 1 has target id 'Thread 0x7f13abacd740 (LWP 1649280)'

Thread 9 is waiting on some resource held by thread 1.

Frame 9 on thread 9 wants to acquire logging_mutex_ in rclpy_thread_safe_logging_output_handler(): https://github.com/ros2/rclpy/blob/52f2a69325721e1368ac3176adcc348e1f53f93e/rclpy/src/rclpy/logging.cpp#L51

Frame 15 on thread 1 should have acquired logging_mutex_ right before rcl_node_init(): https://github.com/ros2/rclpy/blob/52f2a69325721e1368ac3176adcc348e1f53f93e/rclpy/src/rclpy/node.cpp#L464

This means that thread 1 is holding the logging_mutex_ that thread 9 is waiting on.

Hopefully I have read the situation correctly here and, at the very least, others are able to reproduce the problem. I will continue this investigation after the US holiday weekend.

ROS Distribution: Iron OS: Ubuntu Jammy rmw_zenohd: commit 1ac6bd2ae1d3a9f741167bb2c771e42bea50229f. rust: rustc 1.78.0 (9b00956e5 2024-04-29)

Yadunund commented 1 month ago

I think this is similar to https://github.com/ros2/rmw_zenoh/issues/143 which may be a consequence of #170.

hwoithe commented 1 month ago

I have updated my Iron environment and am still experiencing the same issue where a node loading a component in the bash for loop eventually hangs. I am observing a different issue in Jazzy and will follow up with a more detailed description.

Between the Iron and Jazzy releases, the use of the global logger mutex in rclpy has changed. In Iron we hold it for the entire execution of rcl_node_init(): https://github.com/ros2/rclpy/blob/52f2a69325721e1368ac3176adcc348e1f53f93e/rclpy/src/rclpy/node.cpp#L463

That is not the case for Jazzy: https://github.com/ros2/rclpy/blob/e181737f19d6b8cb335b0326a40ec2075267c555/rclpy/src/rclpy/node.cpp#L503

This was likely changed in rclpy by: https://github.com/ros2/rclpy/pull/1121 https://github.com/ros2/rclpy/commit/827d59729f266d9449d8fcb9752707bbd052527d

That pull request was to address a similar kind of issue, but in our case a different set of RMW specific locks: https://github.com/ros2/rclcpp/issues/2147

hwoithe commented 1 month ago

As mentioned above, in Iron, a node loading a component in the bash for loop will eventually experience a hang. In Jazzy, the component_container node will eventually experience a hang.

$ echo $ROS_DISTRO
jazzy
$ sudo gdb -p `pgrep -f 'rclcpp_components/component_container'`
...
(gdb) info threads
  Id   Target Id                                             Frame 
* 1    Thread 0x7f557abe18c0 (LWP 3525805) "component_conta" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    Thread 0x7f5575f746c0 (LWP 3525848) "rx-2"            futex_wait (private=0, expected=2, futex_word=0x55e65761d250) at ../sysdeps/nptl/futex-internal.h:146
  3    Thread 0x7f557696b6c0 (LWP 3525819) "component_conta" 0x00007f557b10cd61 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55e65761eb78) at ./nptl/futex-internal.c:57
  4    Thread 0x7f557716c6c0 (LWP 3525818) "component_conta" 0x00007f557b10cd61 in __futex_abstimed_wait_common64 (private=1997974368, cancel=true, abstime=0x7f557716acd0, op=137, expected=0, futex_word=0x55e65761d4c0) at ./nptl/futex-internal.c:57
  5    Thread 0x7f557736d6c0 (LWP 3525817) "rx-1"            syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  6    Thread 0x7f557756e6c0 (LWP 3525816) "rx-0"            syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7f557776f6c0 (LWP 3525815) "tx-0"            0x00007f557b19e042 in epoll_wait (epfd=16, events=0x7f556800a140, maxevents=1024, timeout=1894) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  8    Thread 0x7f5577b716c0 (LWP 3525810) "acc-0"           syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  9    Thread 0x7f5577d726c0 (LWP 3525809) "net-0"           0x00007f557b19e042 in epoll_wait (epfd=8, events=0x55e65760fb10, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  10   Thread 0x7f5577f736c0 (LWP 3525808) "app-0"           0x00007f557b19e042 in epoll_wait (epfd=3, events=0x55e65760a580, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  11   Thread 0x7f557a3b56c0 (LWP 3525807) "component_c-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  12   Thread 0x7f557abb66c0 (LWP 3525806) "component_c-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f557abe18c0 (LWP 3525805))]
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38  in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f557816ba20 in std::sys::pal::unix::futex::futex_wait () at library/std/src/sys/pal/unix/futex.rs:62
#2  std::sys::sync::rwlock::futex::RwLock::read_contended () at library/std/src/sys/sync/rwlock/futex.rs:137
#3  0x00007f557838adc9 in std::sys::sync::rwlock::futex::RwLock::read (self=0x55e65760ee10) at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys/sync/rwlock/futex.rs:86
#4  0x00007f5578590115 in std::sync::rwlock::RwLock<zenoh::net::routing::dispatcher::tables::Tables>::read<zenoh::net::routing::dispatcher::tables::Tables> (self=0x55e65760ee10) at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sync/rwlock.rs:273
#5  0x00007f5578628e4b in zenoh::net::routing::dispatcher::pubsub::full_reentrant_route_data (tables_ref=0x55e65760a410, face=0x55e65760a2d0, expr=0x7ffedb5ce938, ext_qos=..., ext_tstamp=..., payload=..., routing_context=0) at src/net/routing/dispatcher/pubsub.rs:436
#6  0x00007f557848a3a8 in zenoh::net::routing::dispatcher::face::{impl#4}::send_push (self=0x55e65760a410, msg=...) at src/net/routing/dispatcher/face.rs:246
#7  0x00007f557853da7d in zenoh::publication::resolve_put (publisher=0x55e657839110, value=..., kind=zenoh_protocol::core::SampleKind::Put, attachment=...) at src/publication.rs:841
#8  0x00007f557853bcb6 in zenoh::publication::{impl#13}::res_sync (self=...) at src/publication.rs:653
#9  0x00007f55781e8ef7 in zenohcd::publisher::z_publisher_put (publisher=..., payload=0x55e657bf72b0, len=200, options=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/publisher.rs:274
#10 0x00007f5579b5fbe9 in rmw_publish (publisher=0x55e6577e8d30, ros_message=0x7ffedb5d0240, allocation=0x0) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp:945
#11 0x00007f557b025a15 in rcl_publish (publisher=0x7ffedb5d01e8, ros_message=0x7ffedb5d0240, allocation=0x0) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/publisher.c:283
#12 0x00007f557b01de82 in rcl_logging_rosout_output_handler (location=0x7f55760bb7b0 <composition::Talker::on_timer()::__rcutils_logging_location>, severity=20, name=0x55e6577d6280 "talker7.talker", timestamp=1717591806607772690, format=0x7f557609a644 "Publishing: '%s'", args=0x7ffedb5d2050) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/logging_rosout.c:354
#13 0x00007f557b01f0fe in rcl_logging_multiple_output_handler (location=0x7f55760bb7b0 <composition::Talker::on_timer()::__rcutils_logging_location>, severity=20, name=0x55e6577d6280 "talker7.talker", timestamp=1717591806607772690, format=0x7f557609a644 "Publishing: '%s'", args=0x7ffedb5d2050) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/logging.c:162
#14 0x00007f557c04fc30 in rclcpp_logging_output_handler (location=0x7f55760bb7b0 <composition::Talker::on_timer()::__rcutils_logging_location>, severity=20, name=0x55e6577d6280 "talker7.talker", timestamp=1717591806607772690, format=0x7f557609a644 "Publishing: '%s'", args=0x7ffedb5d2050) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/context.cpp:134
#15 0x00007f557b55d427 in vrcutils_log_internal (location=0x7f55760bb7b0 <composition::Talker::on_timer()::__rcutils_logging_location>, severity=20, name=0x55e6577d6280 "talker7.talker", format=0x7f557609a644 "Publishing: '%s'", args=0x7ffedb5d2050) at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1212
#16 0x00007f557b55d607 in rcutils_log_internal (location=0x7f55760bb7b0 <composition::Talker::on_timer()::__rcutils_logging_location>, severity=20, name=0x55e6577d6280 "talker7.talker", format=0x7f557609a644 "Publishing: '%s'") at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1236
#17 0x00007f557606c6ac in composition::Talker::on_timer (this=0x55e657838bb0) at <***removed***>/ws_rmw_zenoh/src/demos/composition/src/talker_component.cpp:47
#18 0x00007f557606c02f in operator() (__closure=0x55e65787b768) at <***removed***>/ws_rmw_zenoh/src/demos/composition/src/talker_component.cpp:40
#19 0x00007f557606e0f4 in rclcpp::GenericTimer<composition::Talker::Talker(const rclcpp::NodeOptions&)::<lambda()>, 0>::execute_callback_delegate<>(const rcl_timer_call_info_t &) (this=0x55e65787b6f0) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/include/rclcpp/timer.hpp:306
#20 0x00007f557606e077 in rclcpp::GenericTimer<composition::Talker::Talker(const rclcpp::NodeOptions&)::<lambda()>, 0>::execute_callback(const std::shared_ptr<void> &) (this=0x55e65787b6f0, data=std::shared_ptr<void> (use count 1, weak count 0) = {...}) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/include/rclcpp/timer.hpp:292
#21 0x00007f557c072044 in rclcpp::Executor::execute_timer (timer=warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<rclcpp::WallTimer<composition::Talker::Talker(rclcpp::NodeOptions const&)::{lambda()#1}, (void*)0>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<rclcpp::WallTimer<composition::Talker::Talker(rclcpp::NodeOptions const&)::{lambda()#1}, (void*)0>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
std::shared_ptr<rclcpp::TimerBase> (use count 4, weak count 3) = {...}, data_ptr=std::shared_ptr<void> (use count 1, weak count 0) = {...}) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:629
#22 0x00007f557c07111e in rclcpp::Executor::execute_any_executable (this=0x55e65761ed50, any_exec=...) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:465
#23 0x00007f557c0c9042 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x55e65761ed50) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp:42
#24 0x000055e655edd636 in main (argc=1, argv=0x7ffedb5d2d88) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp_components/src/component_container.cpp:28

Thread 1 wants to acquire a read lock on tables_ref.tables in frame 5. https://github.com/eclipse-zenoh/zenoh/blob/763a05f0d61a894cf3cbeb07223e2ea726777325/zenoh/src/net/routing/dispatcher/pubsub.rs#L436

Another thread must be have a write lock on tables_ref.tables:

(gdb) thread 2
[Switching to thread 2 (Thread 0x7f5575f746c0 (LWP 3525848))]
#0  futex_wait (private=0, expected=2, futex_word=0x55e65761d250) at ../sysdeps/nptl/futex-internal.h:146
warning: 146    ../sysdeps/nptl/futex-internal.h: No such file or directory
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x55e65761d250) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x55e65761d250, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f557b114147 in lll_mutex_lock_optimized (mutex=0x55e65761d250) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x55e65761d250) at ./nptl/pthread_mutex_lock.c:128
#4  0x00007f557c641162 in __gthread_mutex_lock (__mutex=0x55e65761d250) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749
#5  0x00007f557c6411b2 in __gthread_recursive_mutex_lock (__mutex=0x55e65761d250) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:811
#6  0x00007f557c641822 in std::recursive_mutex::lock (this=0x55e65761d250) at /usr/include/c++/13/mutex:120
#7  0x00007f557c644b3c in std::lock_guard<std::recursive_mutex>::lock_guard (this=0x7f5575f66980, __m=...) at /usr/include/c++/13/bits/std_mutex.h:249
#8  0x00007f557c04fc0b in rclcpp_logging_output_handler (location=0x7f5579bb3ef0 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f5579b796ed "rmw_zenoh_cpp", timestamp=1717591806608511791, 
    format=0x7f5579b79e60 "Received liveliness token to remove unknown node /%s from the graph. Ignoring...", args=0x7f5575f66a40) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/context.cpp:133
#9  0x00007f557b55d427 in vrcutils_log_internal (location=0x7f5579bb3ef0 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f5579b796ed "rmw_zenoh_cpp", format=0x7f5579b79e60 "Received liveliness token to remove unknown node /%s from the graph. Ignoring...", args=0x7f5575f66a40)
    at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1212
#10 0x00007f557b55d607 in rcutils_log_internal (location=0x7f5579bb3ef0 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f5579b796ed "rmw_zenoh_cpp", format=0x7f5579b79e60 "Received liveliness token to remove unknown node /%s from the graph. Ignoring...")
    at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1236
#11 0x00007f5579b093f6 in rmw_zenoh_cpp::GraphCache::parse_del (this=0x55e65763ec80, keyexpr="@ros2_lv/0/e6d7b9ee7d7b70fc9b3d883b9dd6e1/0/3/SS/%/_ros2cli_3526286/%_ros2cli_3526286%get_type_description/type_description_interfaces::srv::dds_::GetTypeDescription_/1:2:1,10", ignore_from_current_session=false)
    at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/detail/graph_cache.cpp:584
#12 0x00007f5579b5667a in graph_sub_data_handler (sample=0x7f5575f68000, data=0x55e657608e30) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_init.cpp:73
#13 0x00007f5578171ead in zenohcd::closures::sample_closure::z_closure_sample_call (closure=0x55e65761c5e0, sample=0x7f5575f68000) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/closures/sample_closure.rs:52
#14 0x00007f557824abe1 in zenohcd::liveliness::zc_liveliness_declare_subscriber::{closure#0} (sample=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/liveliness.rs:186
#15 0x00007f55785076c8 in zenoh::session::Session::handle_data (self=0x55e657616090, local=false, key_expr=0x7f5575f6b860, info=..., payload=..., attachment=...) at src/session.rs:1712
#16 0x00007f5578511e44 in zenoh::session::{impl#13}::send_declare (self=0x55e657616090, msg=...) at src/session.rs:2195
#17 0x00007f557851d414 in zenoh::session::{impl#16}::send_declare (self=0x55e657616090, ctx=...) at src/session.rs:2701
#18 0x00007f55784defdf in zenoh::net::routing::hat::p2p_peer::pubsub::propagate_forget_simple_subscription (tables=0x55e65760ee20, res=0x7f5575f6dba0) at src/net/routing/hat/p2p_peer/pubsub.rs:175
#19 0x00007f55784df8b1 in zenoh::net::routing::hat::p2p_peer::pubsub::undeclare_client_subscription (tables=0x55e65760ee20, face=0x7f5575f6d978, res=0x7f5575f6dba0) at src/net/routing/hat/p2p_peer/pubsub.rs:205
#20 0x00007f55785ea8fc in zenoh::net::routing::hat::p2p_peer::{impl#1}::close_face (self=0x1, tables=0x55e65760ee10, face=0x7f5575f6e5f0) at src/net/routing/hat/p2p_peer/mod.rs:188
#21 0x00007f55785e36e2 in zenoh::net::routing::dispatcher::tables::close_face (tables=0x55e65760ee10, face=0x7f5575f6eb38) at src/net/routing/dispatcher/tables.rs:177
#22 0x00007f557848aee2 in zenoh::net::routing::dispatcher::face::{impl#4}::send_close (self=0x7f556c1518b0) at src/net/routing/dispatcher/face.rs:297
#23 0x00007f5578487e9a in zenoh::net::primitives::demux::{impl#1}::closing (self=0x7f556c1518b0) at src/net/primitives/demux.rs:90
#24 0x00007f55785f7a11 in zenoh::net::runtime::{impl#6}::closing (self=0x7f556c00e0d0) at src/net/runtime/mod.rs:403
#25 0x00007f5578804128 in zenoh_transport::unicast::universal::transport::{impl#0}::delete::{async_fn#0} () at src/unicast/universal/transport.rs:138
#26 0x00007f5578805a5b in zenoh_transport::unicast::universal::transport::{impl#0}::del_link::{async_fn#0} () at src/unicast/universal/transport.rs:202
#27 0x00007f55787aa68c in zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block#0}::{async_block#0} () at src/unicast/universal/link.rs:134
#28 0x00007f557882d5cb in tokio::runtime::task::core::{impl#6}::poll::{closure#0}<zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block#0}::{async_block_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>> (ptr=0x7f555003cf30) at <***removed***>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:328
...
(gdb) up 3
#3  ___pthread_mutex_lock (mutex=0x55e65761d250) at ./nptl/pthread_mutex_lock.c:128
warning: 128    ./nptl/pthread_mutex_lock.c: No such file or directory
(gdb) print mutex.__data.__owner
$1 = 3525805
(gdb) thread find 3525805
Thread 1 has target id 'Thread 0x7f557abe18c0 (LWP 3525805)'

We can see that thread 2 on frame 20 acquires the write lock tables.tables early in close_face() before that function calls undeclare_client_subscription() (frame 19). https://github.com/eclipse-zenoh/zenoh/blob/763a05f0d61a894cf3cbeb07223e2ea726777325/zenoh/src/net/routing/hat/p2p_peer/mod.rs#L165

Thread 2 is also blocked and cannot release the tables.tables write lock. Thread 2 is blocked on logging_mutex in rclcpp_logging_output_handler() in frame 8. https://github.com/ros2/rclcpp/blob/0beba9715675e06bf63f041fd1046b2f23d3b3ce/rclcpp/src/rclcpp/context.cpp#L133

We can see that thread 1 is also in rclcpp_logging_output_handler() (frame 14) and has acquired logging_mutex before calling rcl_logging_multiple_output_handler() (frame 13). https://github.com/ros2/rclcpp/blob/0beba9715675e06bf63f041fd1046b2f23d3b3ce/rclcpp/src/rclcpp/context.cpp#L134

rmw_zenoh: rolling (66ec842030fda0cd381835f8ed2577f331fb5729) rclcpp: jazzy (0beba9715675e06bf63f041fd1046b2f23d3b3ce) zenoh: 763a05f0d61a894cf3cbeb07223e2ea726777325

JEnoch commented 1 month ago

I don't think the problem is related to #170 or to python. In both the provided stacks the pattern is the same:

So the root of the problem occurs on Zenoh control messages (e.g. liveliness events) leading to call a user callback: Zenoh is keeping an internal lock that is blocking any other thread trying to send a message with Zenoh. We'll investigate a solution to this issue in https://github.com/eclipse-zenoh/zenoh/issues/1092

@Yadunund in the meanwhile, a workaround could be to make sure no mutex is taken in liveliness callbacks. Do you think deactivating the logging in the graph cache is possible and would be sufficient ?

Yadunund commented 1 month ago

@Yadunund in the meanwhile, a workaround could be to make sure no mutex is taken in liveliness callbacks. Do you think deactivating the logging in the graph cache is possible and would be sufficient ?

As discussed yesterday, It is a bit strange that logging from the rmw layer is publishing to rosout. I will look into this.

JEnoch commented 1 month ago

As discussed yesterday, It is a bit strange that logging from the rmw layer is publishing to rosout. I will look into this.

Actually I was wrong, sorry for the confusion... I was confused with frame #13 in first stack: #13 0x00007f13aa25e0bb in rcl_logging_rosout_init_publisher_for_node (node=0x556946f65120) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/logging_rosout.c:276 Which is probably be just the initialisation of rosout Publisher, and leads to creation of a Zenoh publisher, hence waiting a lock in routing layer. Still this thread is holding the logging_mutex. And that's the common point with other stacks, even if publication to rosout is not occurring.

Yadunund commented 4 weeks ago

That's good to know. Shall wait for the fix for https://github.com/eclipse-zenoh/zenoh/issues/1092.

JEnoch commented 3 weeks ago

We have a fix candidate in a branch: https://github.com/eclipse-zenoh/zenoh/commits/dev/reentrant_declares/

200 is tracking this branch.

@hwoithe I reproduced the same issue with your scenario, and switching to #200 fixes the issue for me. Could you please test and confirm ?

hwoithe commented 3 weeks ago

I am a bit short on time today, but I was able to quickly take a look and I am still experiencing the same problem under Jazzy where the component_container eventually hangs. Overall, it appears quite similar to the Jazzy test above. https://github.com/ros2/rmw_zenoh/issues/182#issuecomment-2150633477

I have not yet retested Iron.

A quick debug session:

$ sudo gdb -p `pgrep -f 'rclcpp_components/component_container'`
...
(gdb) info threads
  Id   Target Id                                             Frame 
* 1    Thread 0x7fe64b65b8c0 (LWP 3416556) "component_conta" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    Thread 0x7fe63a2056c0 (LWP 3500878) "net-2"           syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3    Thread 0x7fe63affa6c0 (LWP 3416696) "component_conta" 0x00007fe64bb86d61 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x559ddd95d368) at ./nptl/futex-internal.c:57
  4    Thread 0x7fe63b7fb6c0 (LWP 3416695) "component_conta" 0x00007fe64bb86d61 in __futex_abstimed_wait_common64 (private=998218592, cancel=true, abstime=0x7fe63b7f9cd0, op=137, expected=0, futex_word=0x559ddd95cef0) at ./nptl/futex-internal.c:57
  5    Thread 0x7fe63b9fc6c0 (LWP 3416584) "rx-1"            futex_wait (private=0, expected=2, futex_word=0x559ddd95b6f0) at ../sysdeps/nptl/futex-internal.h:146
  6    Thread 0x7fe63bbfd6c0 (LWP 3416583) "rx-0"            syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7fe63bdfe6c0 (LWP 3416582) "tx-0"            0x00007fe64bc18042 in epoll_wait (epfd=16, events=0x7fe63c00a220, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  8    Thread 0x7fe64839e6c0 (LWP 3416579) "acc-0"           syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  9    Thread 0x7fe6487a06c0 (LWP 3416577) "net-0"           syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  10   Thread 0x7fe6489a16c0 (LWP 3416576) "app-0"           0x00007fe64bc18042 in epoll_wait (epfd=3, events=0x559ddd948200, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  11   Thread 0x7fe64ae2f6c0 (LWP 3416575) "component_c-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  12   Thread 0x7fe64b6306c0 (LWP 3416574) "component_c-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
...
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fe64b65b8c0 (LWP 3416556))]
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38  in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fe648b9ca80 in std::sys::pal::unix::futex::futex_wait () at library/std/src/sys/pal/unix/futex.rs:62
#2  std::sys::pal::unix::locks::futex_rwlock::RwLock::read_contended () at library/std/src/sys/pal/unix/locks/futex_rwlock.rs:137
#3  0x00007fe649062a17 in std::sys::pal::unix::locks::futex_rwlock::RwLock::read (self=0x559ddd94cb50) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys/pal/unix/locks/futex_rwlock.rs:86
#4  0x00007fe648dc8805 in std::sync::rwlock::RwLock<zenoh::net::routing::dispatcher::tables::Tables>::read<zenoh::net::routing::dispatcher::tables::Tables> (self=0x559ddd94cb50) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sync/rwlock.rs:210
#5  0x00007fe648f313ab in zenoh::net::routing::dispatcher::pubsub::full_reentrant_route_data (tables_ref=0x559ddd947a30, face=0x559ddd953d40, expr=0x7ffc82822868, ext_qos=..., ext_tstamp=..., payload=..., routing_context=0) at src/net/routing/dispatcher/pubsub.rs:451
#6  0x00007fe648e574e8 in zenoh::net::routing::dispatcher::face::{impl#4}::send_push (self=0x559ddd947a30, msg=...) at src/net/routing/dispatcher/face.rs:269
#7  0x00007fe6490b9f13 in zenoh::publication::resolve_put (publisher=0x559dddc42d90, value=..., kind=zenoh_protocol::core::SampleKind::Put, attachment=...) at src/publication.rs:841
#8  0x00007fe6490b8136 in zenoh::publication::{impl#13}::res_sync (self=...) at src/publication.rs:653
#9  0x00007fe648c28247 in zenohcd::publisher::z_publisher_put (publisher=..., payload=0x559dde1a98a0, len=208, options=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/publisher.rs:274
#10 0x00007fe64a5d9be9 in rmw_publish (publisher=0x559dddf46d90, ros_message=0x7ffc82824170, allocation=0x0) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp:945
#11 0x00007fe64ba9fa15 in rcl_publish (publisher=0x7ffc82824118, ros_message=0x7ffc82824170, allocation=0x0) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/publisher.c:283
#12 0x00007fe64ba97e82 in rcl_logging_rosout_output_handler (location=0x7fe6480e87b0 <composition::Talker::on_timer()::__rcutils_logging_location>, severity=20, name=0x559dddf3dc70 "talker19.talker", timestamp=1718042153752681058, format=0x7fe6480c7651 "Publishing: '%s'", args=0x7ffc82825f80) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/logging_rosout.c:354
#13 0x00007fe64ba990fe in rcl_logging_multiple_output_handler (location=0x7fe6480e87b0 <composition::Talker::on_timer()::__rcutils_logging_location>, severity=20, name=0x559dddf3dc70 "talker19.talker", timestamp=1718042153752681058, format=0x7fe6480c7651 "Publishing: '%s'", args=0x7ffc82825f80) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/logging.c:162
#14 0x00007fe64cac1c30 in rclcpp_logging_output_handler (location=0x7fe6480e87b0 <composition::Talker::on_timer()::__rcutils_logging_location>, severity=20, name=0x559dddf3dc70 "talker19.talker", timestamp=1718042153752681058, format=0x7fe6480c7651 "Publishing: '%s'", args=0x7ffc82825f80) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/context.cpp:134
#15 0x00007fe64bfcf427 in vrcutils_log_internal (location=0x7fe6480e87b0 <composition::Talker::on_timer()::__rcutils_logging_location>, severity=20, name=0x559dddf3dc70 "talker19.talker", format=0x7fe6480c7651 "Publishing: '%s'", args=0x7ffc82825f80) at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1212
#16 0x00007fe64bfcf607 in rcutils_log_internal (location=0x7fe6480e87b0 <composition::Talker::on_timer()::__rcutils_logging_location>, severity=20, name=0x559dddf3dc70 "talker19.talker", format=0x7fe6480c7651 "Publishing: '%s'") at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1236
#17 0x00007fe6480996ac in composition::Talker::on_timer (this=0x559dddf553b0) at <***removed***>/ws_rmw_zenoh/src/demos/composition/src/talker_component.cpp:47
#18 0x00007fe64809902f in operator() (__closure=0x559dddfa2248) at <***removed***>/ws_rmw_zenoh/src/demos/composition/src/talker_component.cpp:40
#19 0x00007fe64809b0f4 in rclcpp::GenericTimer<composition::Talker::Talker(const rclcpp::NodeOptions&)::<lambda()>, 0>::execute_callback_delegate<>(const rcl_timer_call_info_t &) (this=0x559dddfa21d0) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/include/rclcpp/timer.hpp:306
#20 0x00007fe64809b077 in rclcpp::GenericTimer<composition::Talker::Talker(const rclcpp::NodeOptions&)::<lambda()>, 0>::execute_callback(const std::shared_ptr<void> &) (this=0x559dddfa21d0, data=std::shared_ptr<void> (use count 1, weak count 0) = {...}) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/include/rclcpp/timer.hpp:292
#21 0x00007fe64cae4044 in rclcpp::Executor::execute_timer (timer=warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<rclcpp::WallTimer<composition::Talker::Talker(rclcpp::NodeOptions const&)::{lambda()#1}, (void*)0>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<rclcpp::WallTimer<composition::Talker::Talker(rclcpp::NodeOptions const&)::{lambda()#1}, (void*)0>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
std::shared_ptr<rclcpp::TimerBase> (use count 4, weak count 3) = {...}, data_ptr=std::shared_ptr<void> (use count 1, weak count 0) = {...}) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:629
#22 0x00007fe64cae311e in rclcpp::Executor::execute_any_executable (this=0x559ddd95f2a0, any_exec=...) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:465
#23 0x00007fe64cb3b042 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x559ddd95f2a0) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp:42
#24 0x0000559ddc0d5636 in main (argc=1, argv=0x7ffc82826cb8) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp_components/src/component_container.cpp:28
...
(gdb) thread 5
[Switching to thread 5 (Thread 0x7fe63b9fc6c0 (LWP 3416584))]
#0  futex_wait (private=0, expected=2, futex_word=0x559ddd95b6f0) at ../sysdeps/nptl/futex-internal.h:146
warning: 146    ../sysdeps/nptl/futex-internal.h: No such file or directory
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x559ddd95b6f0) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x559ddd95b6f0, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007fe64bb8e147 in lll_mutex_lock_optimized (mutex=0x559ddd95b6f0) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x559ddd95b6f0) at ./nptl/pthread_mutex_lock.c:128
#4  0x00007fe64d0b3162 in __gthread_mutex_lock (__mutex=0x559ddd95b6f0) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749
#5  0x00007fe64d0b31b2 in __gthread_recursive_mutex_lock (__mutex=0x559ddd95b6f0) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:811
#6  0x00007fe64d0b3822 in std::recursive_mutex::lock (this=0x559ddd95b6f0) at /usr/include/c++/13/mutex:120
#7  0x00007fe64d0b6b3c in std::lock_guard<std::recursive_mutex>::lock_guard (this=0x7fe63b9ee770, __m=...) at /usr/include/c++/13/bits/std_mutex.h:249
#8  0x00007fe64cac1c0b in rclcpp_logging_output_handler (location=0x7fe64a62def0 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7fe64a5f36fa "rmw_zenoh_cpp", timestamp=1718042153754446017, format=0x7fe64a5f3e68 "Received liveliness token to remove unknown node /%s from the graph. Ignoring...", args=0x7fe63b9ee830)
    at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/context.cpp:133
#9  0x00007fe64bfcf427 in vrcutils_log_internal (location=0x7fe64a62def0 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7fe64a5f36fa "rmw_zenoh_cpp", format=0x7fe64a5f3e68 "Received liveliness token to remove unknown node /%s from the graph. Ignoring...", args=0x7fe63b9ee830)
    at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1212
#10 0x00007fe64bfcf607 in rcutils_log_internal (location=0x7fe64a62def0 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7fe64a5f36fa "rmw_zenoh_cpp", format=0x7fe64a5f3e68 "Received liveliness token to remove unknown node /%s from the graph. Ignoring...")
    at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1236
#11 0x00007fe64a5833f6 in rmw_zenoh_cpp::GraphCache::parse_del (this=0x559ddd97c8b0, keyexpr="@ros2_lv/0/5fd5376fb274d9ef1288575539b7c2/0/3/SS/%/_ros2cli_3420537/%_ros2cli_3420537%get_type_description/type_description_interfaces::srv::dds_::GetTypeDescription_/1:2:1,10", ignore_from_current_session=false) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/detail/graph_cache.cpp:584
#12 0x00007fe64a5d067a in graph_sub_data_handler (sample=0x7fe63b9efdf0, data=0x559ddd946e60) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_init.cpp:73
#13 0x00007fe648c4c070 in zenohcd::closures::sample_closure::z_closure_sample_call (closure=0x559ddd95ec50, sample=0x7fe63b9efdf0) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/closures/sample_closure.rs:52
#14 0x00007fe648c50bf1 in zenohcd::liveliness::zc_liveliness_declare_subscriber::{closure#0} (sample=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/liveliness.rs:186
#15 0x00007fe649079616 in zenoh::session::Session::handle_data (self=0x559ddd953c40, local=false, key_expr=0x7fe63b9f3650, info=..., payload=..., attachment=...) at src/session.rs:1712
#16 0x00007fe649083e71 in zenoh::session::{impl#13}::send_declare (self=0x559ddd953c40, msg=...) at src/session.rs:2195
#17 0x00007fe64908f4a4 in zenoh::session::{impl#16}::send_declare (self=0x559ddd953c40, ctx=...) at src/session.rs:2701
#18 0x00007fe648ecc458 in zenoh::net::routing::dispatcher::tables::close_face::{closure#0} (p=0x559ddd953d40, m=...) at src/net/routing/dispatcher/tables.rs:177
#19 0x00007fe648f52cf0 in zenoh::net::routing::hat::p2p_peer::pubsub::propagate_forget_simple_subscription (tables=0x559ddd94cb60, res=0x7fe63b9f5bd0, send_declare=...) at src/net/routing/hat/p2p_peer/pubsub.rs:192
#20 0x00007fe648f535f7 in zenoh::net::routing::hat::p2p_peer::pubsub::undeclare_client_subscription (tables=0x559ddd94cb60, face=0x7fe63b9f59a8, res=0x7fe63b9f5bd0, send_declare=...) at src/net/routing/hat/p2p_peer/pubsub.rs:226
#21 0x00007fe648ed379c in zenoh::net::routing::hat::p2p_peer::{impl#1}::close_face (self=0x1, tables=0x559ddd94cb50, face=0x7fe63b9f6630, send_declare=...) at src/net/routing/hat/p2p_peer/mod.rs:195
#22 0x00007fe648ecc371 in zenoh::net::routing::dispatcher::tables::close_face (tables=0x559ddd94cb50, face=0x7fe63b9f6b78) at src/net/routing/dispatcher/tables.rs:177
#23 0x00007fe648e58022 in zenoh::net::routing::dispatcher::face::{impl#4}::send_close (self=0x7fe6342a7de0) at src/net/routing/dispatcher/face.rs:320
#24 0x00007fe648e53eea in zenoh::net::primitives::demux::{impl#1}::closing (self=0x7fe6342a7de0) at src/net/primitives/demux.rs:96
#25 0x00007fe648ee3261 in zenoh::net::runtime::{impl#6}::closing (self=0x7fe6342958d0) at src/net/runtime/mod.rs:403
#26 0x00007fe64926785e in zenoh_transport::unicast::universal::transport::{impl#0}::delete::{async_fn#0} () at src/unicast/universal/transport.rs:138
#27 0x00007fe64926919b in zenoh_transport::unicast::universal::transport::{impl#0}::del_link::{async_fn#0} () at src/unicast/universal/transport.rs:202
#28 0x00007fe649214a2c in zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block#0}::{async_block#0} () at src/unicast/universal/link.rs:134
...
(gdb) up 3
#3  ___pthread_mutex_lock (mutex=0x559ddd95b6f0) at ./nptl/pthread_mutex_lock.c:128
warning: 128    ./nptl/pthread_mutex_lock.c: No such file or directory
(gdb) print mutex.__data.__owner
$1 = 3416556
(gdb) thread find 3416556
Thread 1 has target id 'Thread 0x7fe64b65b8c0 (LWP 3416556)'

Thread 1 wants the read lock in frame 5. https://github.com/eclipse-zenoh/zenoh/blob/d070e7ce777222692d34f466c0f1a95f0de2f068/zenoh/src/net/routing/dispatcher/pubsub.rs#L451

Thread 1 has acquired the logging_mutex in rclcpp_logging_output_handler() (frame 14) because it has called rcl_logging_multiple_output_handler() (frame 13): https://github.com/ros2/rclcpp/blob/8ab10aabc0304fe2a26f54a67274cdbc51b65602/rclcpp/src/rclcpp/context.cpp#L134

Thread 5 wants to acquire the logging_mutex in rclcpp_logging_output_handler() (frame 8): https://github.com/ros2/rclcpp/blob/8ab10aabc0304fe2a26f54a67274cdbc51b65602/rclcpp/src/rclcpp/context.cpp#L133

Thread 5 has the write lock in close_face() (frame 21) before calling undeclare_client_subscription() (frame 20): https://github.com/eclipse-zenoh/zenoh/blob/d070e7ce777222692d34f466c0f1a95f0de2f068/zenoh/src/net/routing/hat/p2p_peer/mod.rs#L172

I don't know the codebase well enough yet and haven't found if Thread 5 is supposed to drop the write lock and/or ctrl_lock further in the call chain.

Other threads:

(gdb) thread 6
[Switching to thread 6 (Thread 0x7fe63bbfd6c0 (LWP 3416583))]
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
warning: 38 ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fe648b9c954 in std::sys::pal::unix::futex::futex_wait () at library/std/src/sys/pal/unix/futex.rs:62
#2  std::sys::pal::unix::locks::futex_mutex::Mutex::lock_contended () at library/std/src/sys/pal/unix/locks/futex_mutex.rs:56
#3  0x00007fe648e05058 in std::sys::pal::unix::locks::futex_mutex::Mutex::lock (self=0x559ddd94cc40) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys/pal/unix/locks/futex_mutex.rs:28
#4  0x00007fe649036425 in std::sync::mutex::Mutex<alloc::boxed::Box<(dyn zenoh::net::routing::hat::HatTrait + core::marker::Send + core::marker::Sync), alloc::alloc::Global>>::lock<alloc::boxed::Box<(dyn zenoh::net::routing::hat::HatTrait + core::marker::Send + core::marker::Sync), alloc::alloc::Global>> (self=0x559ddd94cc40) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sync/mutex.rs:273
#5  0x00007fe648e55699 in zenoh::net::routing::dispatcher::face::{impl#4}::send_declare (self=0x7fe634006a00, msg=...) at src/net/routing/dispatcher/face.rs:189
#6  0x00007fe648e5355e in zenoh::net::primitives::demux::{impl#1}::handle_message (self=0x7fe634006a00, msg=...) at src/net/primitives/demux.rs:69
#7  0x00007fe648ee2ece in zenoh::net::runtime::{impl#6}::handle_message (self=0x7fe634009370, msg=...) at src/net/runtime/mod.rs:385
#8  0x00007fe64926a1b9 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::trigger_callback (self=0x7fe63400a840, callback=..., msg=...) at src/unicast/universal/rx.rs:49
#9  0x00007fe64926b226 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::handle_frame (self=0x7fe63400a840, frame=...) at src/unicast/universal/rx.rs:100
#10 0x00007fe64926e27c in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::read_messages (self=0x7fe63400a840, batch=..., link=0x7fe63400a908) at src/unicast/universal/rx.rs:204
#11 0x00007fe64921769e in zenoh_transport::unicast::universal::link::rx_task::{async_fn#0} () at src/unicast/universal/link.rs:266
#12 0x00007fe649213f35 in zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block#0} () at src/unicast/universal/link.rs:124
#13 0x00007fe6491d8d74 in tokio_util::task::task_tracker::{impl#8}::poll<zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block_env#0}> (self=..., cx=0x7fe63bbfa4f0) at <***removed***>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-util-0.7.10/src/task/task_tracker.rs:669
...
(gdb) thread 8
[Switching to thread 8 (Thread 0x7fe64839e6c0 (LWP 3416579))]
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
warning: 38 ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fe648b9c954 in std::sys::pal::unix::futex::futex_wait () at library/std/src/sys/pal/unix/futex.rs:62
#2  std::sys::pal::unix::locks::futex_mutex::Mutex::lock_contended () at library/std/src/sys/pal/unix/locks/futex_mutex.rs:56
#3  0x00007fe648e05058 in std::sys::pal::unix::locks::futex_mutex::Mutex::lock (self=0x559ddd94cc40) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys/pal/unix/locks/futex_mutex.rs:28
#4  0x00007fe649036425 in std::sync::mutex::Mutex<alloc::boxed::Box<(dyn zenoh::net::routing::hat::HatTrait + core::marker::Send + core::marker::Sync), alloc::alloc::Global>>::lock<alloc::boxed::Box<(dyn zenoh::net::routing::hat::HatTrait + core::marker::Send + core::marker::Sync), alloc::alloc::Global>> (self=0x559ddd94cc40) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sync/mutex.rs:273
#5  0x00007fe648ed74fe in zenoh::net::routing::router::Router::new_transport_unicast (self=0x559ddd97f030, transport=...) at src/net/routing/router.rs:119
#6  0x00007fe648ee2244 in zenoh::net::runtime::{impl#5}::new_unicast (self=0x559ddd97cf50, peer=..., transport=...) at src/net/runtime/mod.rs:338
#7  0x00007fe6491bdbd0 in zenoh_transport::manager::TransportManager::notify_new_transport_unicast (self=0x7fe63c3d0478, transport=0x7fe63c3d0040) at src/unicast/manager.rs:490
#8  0x00007fe6491c056b in zenoh_transport::unicast::manager::{impl#2}::init_new_transport_unicast::{async_fn#0} () at src/unicast/manager.rs:580
#9  0x00007fe6491c20e5 in zenoh_transport::unicast::manager::{impl#2}::init_transport_unicast::{async_fn#0} () at src/unicast/manager.rs:654
#10 0x00007fe6492843a3 in zenoh_transport::unicast::establishment::accept::accept_link::{async_fn#0} () at src/unicast/establishment/accept.rs:732
#11 0x00007fe6491f4f69 in tokio::time::timeout::{impl#1}::poll<zenoh_transport::unicast::establishment::accept::accept_link::{async_fn_env#0}> (self=..., cx=0x7fe64839b4f0) at <***removed***>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/time/timeout.rs:202
#12 0x00007fe6491c4acf in zenoh_transport::unicast::manager::{impl#2}::handle_new_link_unicast::{async_fn#0}::{async_block#0} () at src/unicast/manager.rs:753
#13 0x00007fe64921a203 in futures_util::future::future::map::{impl#2}::poll<zenoh_transport::unicast::manager::{impl#2}::handle_new_link_unicast::{async_fn#0}::{async_block_env#0}, zenoh_task::{impl#1}::spawn_with_rt::{closure_env#0}<zenoh_transport::unicast::manager::{impl#2}::handle_new_link_unicast::{async_fn#0}::{async_block_env#0}, ()>, ()> (self=..., cx=0x7fe64839b4f0)
...
(gdb) thread 9
[Switching to thread 9 (Thread 0x7fe6487a06c0 (LWP 3416577))]
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38  in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fe648b9c954 in std::sys::pal::unix::futex::futex_wait () at library/std/src/sys/pal/unix/futex.rs:62
#2  std::sys::pal::unix::locks::futex_mutex::Mutex::lock_contended () at library/std/src/sys/pal/unix/locks/futex_mutex.rs:56
#3  0x00007fe648e05058 in std::sys::pal::unix::locks::futex_mutex::Mutex::lock (self=0x559ddd94cc40) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys/pal/unix/locks/futex_mutex.rs:28
#4  0x00007fe649036425 in std::sync::mutex::Mutex<alloc::boxed::Box<(dyn zenoh::net::routing::hat::HatTrait + core::marker::Send + core::marker::Sync), alloc::alloc::Global>>::lock<alloc::boxed::Box<(dyn zenoh::net::routing::hat::HatTrait + core::marker::Send + core::marker::Sync), alloc::alloc::Global>> (self=0x559ddd94cc40) at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sync/mutex.rs:273
#5  0x00007fe648ecc20a in zenoh::net::routing::dispatcher::tables::close_face (tables=0x559ddd94cb50, face=0x7fe64879ab78) at src/net/routing/dispatcher/tables.rs:177
#6  0x00007fe648e58022 in zenoh::net::routing::dispatcher::face::{impl#4}::send_close (self=0x7fe63c006e80) at src/net/routing/dispatcher/face.rs:320
#7  0x00007fe648e53eea in zenoh::net::primitives::demux::{impl#1}::closing (self=0x7fe63c006e80) at src/net/primitives/demux.rs:96
#8  0x00007fe648ee3261 in zenoh::net::runtime::{impl#6}::closing (self=0x7fe63c0099d0) at src/net/runtime/mod.rs:403
#9  0x00007fe64926785e in zenoh_transport::unicast::universal::transport::{impl#0}::delete::{async_fn#0} () at src/unicast/universal/transport.rs:138
#10 0x00007fe64926919b in zenoh_transport::unicast::universal::transport::{impl#0}::del_link::{async_fn#0} () at src/unicast/universal/transport.rs:202
#11 0x00007fe649213801 in zenoh_transport::unicast::universal::link::{impl#0}::start_tx::{async_block#0}::{async_block#0} () at src/unicast/universal/link.rs:106
#12 0x00007fe64922e71b in tokio::runtime::task::core::{impl#6}::poll::{closure#0}<zenoh_transport::unicast::universal::link::{impl#0}::start_tx::{async_block#0}::{async_block_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>> (ptr=0x7fe628002230) at <***removed***>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:328
...

Thread 6 in frame 5; https://github.com/eclipse-zenoh/zenoh/blob/d070e7ce777222692d34f466c0f1a95f0de2f068/zenoh/src/net/routing/dispatcher/face.rs#L189

Thread 8 in frame 5: https://github.com/eclipse-zenoh/zenoh/blob/d070e7ce777222692d34f466c0f1a95f0de2f068/zenoh/src/net/routing/router.rs#L119

Thread 9 in frame 5: https://github.com/eclipse-zenoh/zenoh/blob/d070e7ce777222692d34f466c0f1a95f0de2f068/zenoh/src/net/routing/dispatcher/tables.rs#L177

Also, is this referencing the correct issue number? https://github.com/ros2/rmw_zenoh/pull/200/commits/614954c248bdb7e528dca116512279edf7937a12#diff-7d5bcea24badbde5330943e2f5ff4c2234863cb59a9f75c049d1b659620d0ef9L24

My environment:

$ echo $ROS_DISTRO
jazzy
$ ls -1 ~/.cargo/git/checkouts/
zenoh-cc237f2570fab813
$ ls -1 ~/.cargo/git/checkouts/zenoh-cc237f2570fab813/
d070e7c
$ rustc -V
rustc 1.77.2 (25ef9e3d8 2024-04-09)
$ lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 24.04 LTS
Release:    24.04
Codename:   noble

rclcpp: 8ab10aabc0304fe2a26f54a67274cdbc51b65602

JEnoch commented 3 weeks ago

Thanks for your testing and quick feedback. We'll analyse your backtraces.

I don't know the codebase well enough yet and haven't found if Thread 5 is supposed to drop the write lock and/or ctrl_lock further in the call chain.

The purpose of the patch is precisely to release the ctrl_lock before propagating the liveliness message to the callbacks in rmw_zenoh. So either the patch is not complete, either another lock is in cause. @OlivierHecart , isn't the problem in this line which still takes the ctrl_lock while calling wend_declare(): https://github.com/eclipse-zenoh/zenoh/blob/d070e7ce777222692d34f466c0f1a95f0de2f068/zenoh/src/net/routing/dispatcher/tables.rs#L177

Also, is this referencing the correct issue number? https://github.com/ros2/rmw_zenoh/commit/614954c248bdb7e528dca116512279edf7937a12#diff-7d5bcea24badbde5330943e2f5ff4c2234863cb59a9f75c049d1b659620d0ef9L24

It isn't since this patch is not in a pull request yet. I will update the comment once we have a zenoh PR number for it.

hwoithe commented 3 weeks ago

It isn't since this patch is not in a pull request yet. I will update the comment once we have a zenoh PR number for it.

Sounds great! I just wanted to make sure, since https://github.com/ros2/rmw_zenoh/issues/198 refers to a benchmarking issue. Thank you for the quick response.

JEnoch commented 3 weeks ago

There was indeed some oversights in ctrl_lock dropping. Those are now fixed in https://github.com/eclipse-zenoh/zenoh/commit/5191af258a1908dd2f28601ef4410554e97bb7e9. I updated #200 to use https://github.com/eclipse-zenoh/zenoh-c/commit/87b515b4c3dd028f9e9476e3bb60016f7acc3489 which depends on https://github.com/eclipse-zenoh/zenoh/commit/5191af258a1908dd2f28601ef4410554e97bb7e9.

I tested your scenario several times with this and I still don't experience any deadlock. Hopefully it will work for you now! 🤞 😅

hwoithe commented 3 weeks ago

@JEnoch I am now able to make more progress with https://github.com/ros2/rmw_zenoh/pull/200. Thank you! Unfortunately, I'm still experiencing two scenarios which I will describe in separate comments.

hwoithe commented 3 weeks ago

This is in Jazzy with rmw_zenoh f697984f66373916bdc856f1acd213de16f53689 (https://github.com/ros2/rmw_zenoh/pull/200). The deadlock happens to a node loading a component in the bash for loop.

Debugging the node:

$ sudo gdb attach -p 370800
...
(gdb) info threads
  Id   Target Id                                     Frame 
* 1    Thread 0x7f5d575d5b80 (LWP 370800) "ros2"     futex_wait (private=0, expected=2, futex_word=0x1ce2598) at ../sysdeps/nptl/futex-internal.h:146
  2    Thread 0x7f5d46bf86c0 (LWP 370812) "ros2"     0x00007f5d5766fd61 in __futex_abstimed_wait_common64 (private=1186954624, cancel=true, abstime=0x7f5d46bf7e50, op=137, expected=0, futex_word=0x2628d00) at ./nptl/futex-internal.c:57
  3    Thread 0x7f5d46df96c0 (LWP 370811) "rx-1"     futex_wait (private=0, expected=2, futex_word=0x1ce2598) at ../sysdeps/nptl/futex-internal.h:146
  4    Thread 0x7f5d46ffa6c0 (LWP 370810) "rx-0"     futex_wait (private=0, expected=2, futex_word=0x7f5d567acde0 <rclpy::LoggingGuard::logging_mutex_>) at ../sysdeps/nptl/futex-internal.h:146
  5    Thread 0x7f5d471fb6c0 (LWP 370809) "tx-0"     0x00007f5d57701042 in epoll_wait (epfd=16, events=0x7f5d3800a410, maxevents=1024, timeout=2049) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  6    Thread 0x7f5d475fd6c0 (LWP 370807) "acc-0"    0x00007f5d57701042 in epoll_wait (epfd=12, events=0x2396020, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  7    Thread 0x7f5d4c3056c0 (LWP 370805) "net-0"    0x00007f5d57701042 in epoll_wait (epfd=8, events=0x2386360, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  8    Thread 0x7f5d4c5066c0 (LWP 370804) "app-0"    0x00007f5d57701042 in epoll_wait (epfd=3, events=0x2383350, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  9    Thread 0x7f5d47fff6c0 (LWP 370803) "ros2"     0x00007f5d5766fd61 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f5d567ad580 <(anonymous namespace)::g_signal_handler_sem>) at ./nptl/futex-internal.c:57
  10   Thread 0x7f5d54d846c0 (LWP 370802) "ros2-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  11   Thread 0x7f5d555856c0 (LWP 370801) "ros2-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f5d575d5b80 (LWP 370800))]
#0  futex_wait (private=0, expected=2, futex_word=0x1ce2598) at ../sysdeps/nptl/futex-internal.h:146
146 in ../sysdeps/nptl/futex-internal.h
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x1ce2598) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x1ce2598, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f5d576770f1 in lll_mutex_lock_optimized (mutex=0x1ce2598) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x1ce2598) at ./nptl/pthread_mutex_lock.c:93
#4  0x00007f5d544cfc52 in __gthread_mutex_lock (__mutex=0x1ce2598) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749
#5  0x00007f5d544cfca6 in std::mutex::lock (this=0x1ce2598) at /usr/include/c++/13/bits/std_mutex.h:113
#6  0x00007f5d544cfe88 in std::lock_guard<std::mutex>::lock_guard (this=0x7fff9cb6c1d8, __m=...) at /usr/include/c++/13/bits/std_mutex.h:249
#7  0x00007f5d544d5eb5 in rmw_zenoh_cpp::GraphCache::parse_put (this=0x1ce2430, keyexpr="@ros2_lv/0/46f1237e44d9ea87e3aa8bf7ca9ccc/0/1/MP/%/_ros2cli_370800/%rosout/rcl_interfaces::msg::dds_::Log_/1:1:1,1000", ignore_from_current_session=false) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/detail/graph_cache.cpp:347
#8  0x00007f5d545255fc in graph_sub_data_handler (sample=0x7fff9cb6d6a0, data=0x2363240) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_init.cpp:70
#9  0x00007f5d4e76a91d in zenohcd::closures::sample_closure::z_closure_sample_call (closure=0x23bf0f0, sample=0x7fff9cb6d6a0) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/closures/sample_closure.rs:52
#10 0x00007f5d4e723d71 in zenohcd::liveliness::zc_liveliness_declare_subscriber::{closure#0} (sample=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/liveliness.rs:186
#11 0x00007f5d4e921558 in zenoh::session::Session::handle_data (self=0x1e16380, local=false, key_expr=0x7fff9cb70b00, info=..., payload=..., attachment=...) at src/session.rs:1712
#12 0x00007f5d4e92ab20 in zenoh::session::{impl#13}::send_declare (self=0x1e16380, msg=...) at src/session.rs:2158
#13 0x00007f5d4e9372a4 in zenoh::session::{impl#16}::send_declare (self=0x1e16380, ctx=...) at src/session.rs:2701
#14 0x00007f5d4e9b9af6 in zenoh::net::routing::dispatcher::face::{impl#4}::send_declare (self=0x1d65740, msg=...) at src/net/routing/dispatcher/face.rs:210
#15 0x00007f5d4e961020 in zenoh::session::Session::declare_liveliness_inner (self=0x222fc40, key_expr=0x7fff9cb73760) at src/session.rs:1385
#16 0x00007f5d4e7422aa in zenoh::liveliness::{impl#10}::res_sync (self=...) at <***removed***>/.cargo/git/checkouts/zenoh-cc237f2570fab813/5191af2/zenoh/src/liveliness.rs:241
#17 0x00007f5d4e6ed6ed in zenoh_core::SyncResolve::res<zenoh::liveliness::LivelinessTokenBuilder> (self=<error reading variable: Cannot access memory at address 0x80>) at <***removed***>/.cargo/git/checkouts/zenoh-cc237f2570fab813/5191af2/commons/zenoh-core/src/lib.rs:57
#18 0x00007f5d4e6e43c7 in zenohcd::liveliness::zc_liveliness_declare_token (session=..., key=..., _options=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/liveliness.rs:102
#19 0x00007f5d5452d1c1 in rmw_create_publisher (node=0x26281b0, type_supports=0x7f5d557a4b60, topic_name=0x2633010 "/rosout", qos_profile=0x7fff9cb77870, publisher_options=0x7fff9cb778f0) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp:643
#20 0x00007f5d563ef97a in rcl_publisher_init (publisher=0x7fff9cb77868, node=0x23bf790, type_support=0x7f5d557a4b60, topic_name=0x7f5d5640ef75 "/rosout", options=0x7fff9cb77870) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/publisher.c:111
#21 0x00007f5d563e8907 in rcl_logging_rosout_init_publisher_for_node (node=0x23bf790) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/logging_rosout.c:252
#22 0x00007f5d565e2f09 in rclpy::Node::Node (this=0x23b37b0, node_name=0x7fff9cb785e0 "_ros2cli_370800", namespace_=0x7fff9cb785b8 "", context=..., pycli_args=..., use_global_arguments=true, enable_rosout=true) at <***removed***>/ws_rmw_zenoh/src/rclpy/rclpy/src/rclpy/node.cpp:522
...
(gdb) up 3
#3  ___pthread_mutex_lock (mutex=0x1ce2598) at ./nptl/pthread_mutex_lock.c:93
warning: 93 ./nptl/pthread_mutex_lock.c: No such file or directory
(gdb) print mutex.__data.__owner
$4 = 370810
(gdb) thread find 370810
Thread 4 has target id 'Thread 0x7f5d46ffa6c0 (LWP 370810)'
...
(gdb) thread 4
[Switching to thread 4 (Thread 0x7f5d46ffa6c0 (LWP 370810))]
#0  futex_wait (private=0, expected=2, futex_word=0x7f5d567acde0 <rclpy::LoggingGuard::logging_mutex_>) at ../sysdeps/nptl/futex-internal.h:146
warning: 146    ../sysdeps/nptl/futex-internal.h: No such file or directory
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x7f5d567acde0 <rclpy::LoggingGuard::logging_mutex_>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7f5d567acde0 <rclpy::LoggingGuard::logging_mutex_>, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f5d57677147 in lll_mutex_lock_optimized (mutex=0x7f5d567acde0 <rclpy::LoggingGuard::logging_mutex_>) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x7f5d567acde0 <rclpy::LoggingGuard::logging_mutex_>) at ./nptl/pthread_mutex_lock.c:128
#4  0x00007f5d565dbff4 in __gthread_mutex_lock (__mutex=0x7f5d567acde0 <rclpy::LoggingGuard::logging_mutex_>) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749
#5  0x00007f5d565dc015 in __gthread_recursive_mutex_lock (__mutex=0x7f5d567acde0 <rclpy::LoggingGuard::logging_mutex_>) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:811
#6  0x00007f5d565dc034 in std::recursive_mutex::lock (this=0x7f5d567acde0 <rclpy::LoggingGuard::logging_mutex_>) at /usr/include/c++/13/mutex:120
#7  0x00007f5d565dc078 in std::lock_guard<std::recursive_mutex>::lock_guard (this=0x7f5d46fecee8, __m=...) at /usr/include/c++/13/bits/std_mutex.h:249
#8  0x00007f5d565db584 in rclpy::LoggingGuard::LoggingGuard (this=0x7f5d46fecee8) at <***removed***>/ws_rmw_zenoh/src/rclpy/rclpy/src/rclpy/logging.cpp:33
#9  0x00007f5d565db5c8 in rclpy::rclpy_thread_safe_logging_output_handler (location=0x7f5d54582f10 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f5d545486ed "rmw_zenoh_cpp", timestamp=1718146187547300664, 
    format=0x7f5d54548f58 "Received liveliness token to remove node /%s from the graph before all pub/subs/clients/services for this node have been removed. Removing all entities first...", args=0x7f5d46fecfa0) at <***removed***>/ws_rmw_zenoh/src/rclpy/rclpy/src/rclpy/logging.cpp:51
#10 0x00007f5d55ac5427 in vrcutils_log_internal (location=0x7f5d54582f10 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f5d545486ed "rmw_zenoh_cpp", 
    format=0x7f5d54548f58 "Received liveliness token to remove node /%s from the graph before all pub/subs/clients/services for this node have been removed. Removing all entities first...", args=0x7f5d46fecfa0) at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1212
#11 0x00007f5d55ac5607 in rcutils_log_internal (location=0x7f5d54582f10 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f5d545486ed "rmw_zenoh_cpp", 
    format=0x7f5d54548f58 "Received liveliness token to remove node /%s from the graph before all pub/subs/clients/services for this node have been removed. Removing all entities first...") at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1236
#12 0x00007f5d544d869b in rmw_zenoh_cpp::GraphCache::parse_del (this=0x1ce2430, keyexpr="@ros2_lv/0/ea706a43734f33c80261c91af7451a/595/595/NN/%talker51/talker", ignore_from_current_session=false) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/detail/graph_cache.cpp:603
#13 0x00007f5d5452567a in graph_sub_data_handler (sample=0x7f5d46fee560, data=0x2363240) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_init.cpp:73
#14 0x00007f5d4e76a91d in zenohcd::closures::sample_closure::z_closure_sample_call (closure=0x23bf0f0, sample=0x7f5d46fee560) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/closures/sample_closure.rs:52
#15 0x00007f5d4e723d71 in zenohcd::liveliness::zc_liveliness_declare_subscriber::{closure#0} (sample=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/liveliness.rs:186
#16 0x00007f5d4e921558 in zenoh::session::Session::handle_data (self=0x1e16380, local=false, key_expr=0x7f5d46ff1dc0, info=..., payload=..., attachment=...) at src/session.rs:1712
#17 0x00007f5d4e92bcd4 in zenoh::session::{impl#13}::send_declare (self=0x1e16380, msg=...) at src/session.rs:2195
#18 0x00007f5d4e9372a4 in zenoh::session::{impl#16}::send_declare (self=0x1e16380, ctx=...) at src/session.rs:2701
#19 0x00007f5d4e9b9f10 in zenoh::net::routing::dispatcher::face::{impl#4}::send_declare (self=0x7f5d38006d90, msg=...) at src/net/routing/dispatcher/face.rs:225
#20 0x00007f5d4eb53c0e in zenoh::net::primitives::demux::{impl#1}::handle_message (self=0x7f5d38006d90, msg=...) at src/net/primitives/demux.rs:69
#21 0x00007f5d4ea9a73e in zenoh::net::runtime::{impl#6}::handle_message (self=0x7f5d38009970, msg=...) at src/net/runtime/mod.rs:385
#22 0x00007f5d4ec62539 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::trigger_callback (self=0x7f5d38136a40, callback=..., msg=...) at src/unicast/universal/rx.rs:49
#23 0x00007f5d4ec635a3 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::handle_frame (self=0x7f5d38136a40, frame=...) at src/unicast/universal/rx.rs:100
#24 0x00007f5d4ec665e1 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::read_messages (self=0x7f5d38136a40, batch=..., link=0x7f5d38136b08) at src/unicast/universal/rx.rs:204
#25 0x00007f5d4ed46e5e in zenoh_transport::unicast::universal::link::rx_task::{async_fn#0} () at src/unicast/universal/link.rs:266
#26 0x00007f5d4ed43795 in zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block#0} () at src/unicast/universal/link.rs:124
#27 0x00007f5d4eca8704 in tokio_util::task::task_tracker::{impl#8}::poll<zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block_env#0}> (self=..., cx=0x7f5d46ff8618) at <***removed***>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-util-0.7.10/src/task/task_tracker.rs:669
...
(gdb) up 3
#3  ___pthread_mutex_lock (mutex=0x7f5d567acde0 <rclpy::LoggingGuard::logging_mutex_>) at ./nptl/pthread_mutex_lock.c:128
warning: 128    ./nptl/pthread_mutex_lock.c: No such file or directory
(gdb) print mutex.__data.__owner
$3 = 370800
(gdb) thread find 370800
Thread 1 has target id 'Thread 0x7f5d575d5b80 (LWP 370800)'
...

Thread 1 wants to acquire graph_mutex_ in frame 7: https://github.com/ros2/rmw_zenoh/blob/f697984f66373916bdc856f1acd213de16f53689/rmw_zenoh_cpp/src/detail/graph_cache.cpp#L347

Thread 4 wants to acquire scoped_logging_guard in frame 9: https://github.com/ros2/rclpy/blob/e181737f19d6b8cb335b0326a40ec2075267c555/rclpy/src/rclpy/logging.cpp#L51

Thread 4 has acquired graph_mutex_ in frame 12 early in the parse_del() function: https://github.com/ros2/rmw_zenoh/blob/f697984f66373916bdc856f1acd213de16f53689/rmw_zenoh_cpp/src/detail/graph_cache.cpp#L564

Thread 1 has acquired scoped_logging_guard in frame 22: https://github.com/ros2/rclpy/blob/e181737f19d6b8cb335b0326a40ec2075267c555/rclpy/src/rclpy/node.cpp#L521

rmw_zenoh: f697984f66373916bdc856f1acd213de16f53689 (https://github.com/ros2/rmw_zenoh/pull/200) rclpy: e181737f19d6b8cb335b0326a40ec2075267c555

hwoithe commented 3 weeks ago

This is the second scenario in Jazzy with rmw_zenoh f697984f66373916bdc856f1acd213de16f53689 (https://github.com/ros2/rmw_zenoh/pull/200) where a node loading a component in the bash for loop hangs. It gets stuck in rmw_wait(). I'll have to check this against the other rmw_wait() related issues and pull requests that are currently open to see how closely they match to this.

$ sudo gdb attach -p `pgrep -f talker80`
...
(gdb) info threads
  Id   Target Id                                     Frame 
* 1    Thread 0x7f8e1a5afb80 (LWP 966966) "ros2"     0x00007f8e1a649d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x3279be8) at ./nptl/futex-internal.c:57
  2    Thread 0x7f8e0db106c0 (LWP 966978) "ros2"     0x00007f8e1a649d61 in __futex_abstimed_wait_common64 (private=229703040, cancel=true, abstime=0x7f8e0db0fe50, op=137, expected=0, futex_word=0x31462f0) at ./nptl/futex-internal.c:57
  3    Thread 0x7f8e0de9a6c0 (LWP 966977) "rx-1"     0x00007f8e1a6db042 in epoll_wait (epfd=19, events=0x7f8dfc131e10, maxevents=1024, timeout=6753) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  4    Thread 0x7f8e0e09b6c0 (LWP 966976) "rx-0"     syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  5    Thread 0x7f8e0e29c6c0 (LWP 966975) "tx-0"     0x00007f8e1a6db042 in epoll_wait (epfd=16, events=0x7f8dfc00a290, maxevents=1024, timeout=2009) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  6    Thread 0x7f8e0e69e6c0 (LWP 966973) "acc-0"    0x00007f8e1a6db042 in epoll_wait (epfd=12, events=0x30f1650, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  7    Thread 0x7f8e0eaa06c0 (LWP 966971) "net-0"    0x00007f8e1a6db042 in epoll_wait (epfd=8, events=0x3103180, maxevents=1024, timeout=2147483647) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  8    Thread 0x7f8e0eca16c0 (LWP 966970) "app-0"    0x00007f8e1a6db042 in epoll_wait (epfd=3, events=0x3100170, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  9    Thread 0x7f8e0f4a26c0 (LWP 966969) "ros2"     0x00007f8e1a649d61 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f8e19789580 <(anonymous namespace)::g_signal_handler_sem>) at ./nptl/futex-internal.c:57
  10   Thread 0x7f8e17d616c0 (LWP 966968) "ros2-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  11   Thread 0x7f8e185626c0 (LWP 966967) "ros2-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
...
(gdb) bt
#0  0x00007f8e1a649d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x3279be8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x3279be8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x3279be8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007f8e1a64c7dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x3279bf0, cond=0x3279bc0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x3279bc0, mutex=0x3279bf0) at ./nptl/pthread_cond_wait.c:627
#5  0x00007f8e175194c5 in rmw_wait (subscriptions=0x32c0308, guard_conditions=0x32c0320, services=0x32c0350, clients=0x32c0338, events=0x32c0368, wait_set=0x32ce6a0, wait_timeout=0x0) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp:3334
#6  0x00007f8e193e1b1b in rcl_wait (wait_set=0x3298600, timeout=-1) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/wait.c:670
#7  0x00007f8e1966999d in rclpy::WaitSet::wait (this=0x325d020, timeout=-1) at <***removed***>/ws_rmw_zenoh/src/rclpy/rclpy/src/rclpy/wait_set.cpp:247
#8  0x00007f8e1966e08d in pybind11::cpp_function::cpp_function<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}::operator()(rclpy::WaitSet*, long) const (__closure=0x2b227c8, c=0x325d020, args#0=-1) at /usr/include/pybind11/pybind11.h:111
#9  0x00007f8e1967387b in pybind11::detail::argument_loader<rclpy::WaitSet*, long>::call_impl<void, pybind11::cpp_function::cpp_function<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&, 0ul, 1ul, pybind11::detail::void_type>(pybind11::cpp_function::cpp_function<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul>, pybind11::detail::void_type&&) && (this=0x7ffd239b50b0, f=...) at /usr/include/pybind11/cast.h:1480
#10 0x00007f8e19671da4 in pybind11::detail::argument_loader<rclpy::WaitSet*, long>::call<void, pybind11::detail::void_type, pybind11::cpp_function::cpp_function<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&>(pybind11::cpp_function::cpp_function<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&) && (this=0x7ffd239b50b0, f=...) at /usr/include/pybind11/cast.h:1454
#11 0x00007f8e19671004 in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}, void, rclpy::WaitSet*, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(pybind11::cpp_function::initialize<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&&, void (*)(rclpy::WaitSet*, long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(pybind11::detail::function_call&)#1}::operator()(pybind11::detail::function_call&) const (__closure=0x0, call=...) at /usr/include/pybind11/pybind11.h:254
#12 0x00007f8e196710dc in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}, void, rclpy::WaitSet*, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(pybind11::cpp_function::initialize<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&&, void (*)(rclpy::WaitSet*, long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(pybind11::detail::function_call&)#1}::_FUN(pybind11::detail::function_call&) () at /usr/include/pybind11/pybind11.h:224
#13 0x00007f8e194daffe in pybind11::cpp_function::dispatcher (self=<PyCapsule at remote 0x7f8e198a64c0>, args_in=(<rclpy._rclpy_pybind11.WaitSet at remote 0x7f8e0d0e2f70>, -1), kwargs_in=0x0) at /usr/include/pybind11/pybind11.h:946
...
(gdb) py-bt
Traceback (most recent call first):
  <built-in method wait of PyCapsule object at remote 0x7f8e198a64c0>
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line ?, in _wait_for_ready_callbacks
    (failed to get frame line number)
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 781, in wait_for_ready_callbacks
    return next(self._cb_iter)
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 811, in _spin_once_impl
    handler, entity, node = self.wait_for_ready_callbacks(
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 835, in spin_once_until_future_complete
    self._spin_once_impl(timeout_sec, future.done)
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 321, in spin_until_future_complete
    self.spin_once_until_future_complete(future, timeout_sec)
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/__init__.py", line 266, in spin_until_future_complete
    executor.spin_until_future_complete(future, timeout_sec)
  File "/opt/ros/jazzy/lib/python3.12/site-packages/ros2component/api/__init__.py", line 247, in load_component_into_container
    rclpy.spin_until_future_complete(node, future)
  File "/opt/ros/jazzy/lib/python3.12/site-packages/ros2component/verb/load.py", line 49, in main
    component_uid, component_name = load_component_into_container(
  File "/opt/ros/jazzy/lib/python3.12/site-packages/ros2component/command/component.py", line 37, in main
    return extension.main(args=args)
  File "/opt/ros/jazzy/lib/python3.12/site-packages/ros2cli/cli.py", line 91, in main
    rc = extension.main(parser=parser, args=args)
  File "/opt/ros/jazzy/bin/ros2", line 33, in <module>
    sys.exit(load_entry_point('ros2cli==0.32.1', 'console_scripts', 'ros2')())

rcl: 00c29cd6e72caae36e1e749a8c02558ee709a275 rclpy: e181737f19d6b8cb335b0326a40ec2075267c555 rmw_zenoh: f697984f66373916bdc856f1acd213de16f53689 (https://github.com/ros2/rmw_zenoh/pull/200)

Yadunund commented 3 weeks ago

I've also tested the scenario several times with #200 + Rolling and I can't reproduce the problem anymore.

It could be that something changed between Rolling and Jazzy too? I'm looking into that now.

@JEnoch could we merge the relevant upstream zenoh PRs and update the hash in #200 accordingly so that we can get that in?

JEnoch commented 2 weeks ago

@Yadunund done! #200 is ready for review and merge.

hwoithe commented 2 weeks ago

I have not yet retested in Iron. In Jazzy, I am still experiencing deadlocks with the graph mutex and a logging mutex similar to https://github.com/ros2/rmw_zenoh/issues/182#issuecomment-2161919040. It seems to be easier to trigger if we use parallel. In the debugging session below, the component_container node deadlocks which means the logging mutex is in rclcpp instead of rclpy.

@Yadunund or @clalancette, could you please retest this with the parallel method in Jazzy?

Terminal 1:

ros2 run rmw_zenoh_cpp rmw_zenohd

Terminal 2:

export RMW_IMPLEMENTATION=rmw_zenoh_cpp
ros2 run rclcpp_components component_container

Terminal 3 (now uses parallel):

export RMW_IMPLEMENTATION=rmw_zenoh_cpp
for i in {1..100}; do echo "ros2 component load --node-namespace /talker${i} /ComponentManager composition composition::Talker"; done | parallel

Debugging session:

(gdb) info threads
  Id   Target Id                                            Frame 
* 1    Thread 0x7f90e07c18c0 (LWP 610724) "component_conta" futex_wait (private=0, expected=2, futex_word=0x55d660c278f8) at ../sysdeps/nptl/futex-internal.h:146
  2    Thread 0x7f90bf7fe6c0 (LWP 611779) "rx-2"            futex_wait (private=0, expected=2, futex_word=0x55d660c3bd00) at ../sysdeps/nptl/futex-internal.h:146
  3    Thread 0x7f90bffff6c0 (LWP 610736) "component_conta" 0x00007f90e0cecd61 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55d660c3d908) at ./nptl/futex-internal.c:57
  4    Thread 0x7f90dcaea6c0 (LWP 610735) "component_conta" 0x00007f90e0cecd61 in __futex_abstimed_wait_common64 (private=-592540832, cancel=true, abstime=0x7f90dcae8cd0, op=137, expected=0, futex_word=0x55d660c3d3e0) at ./nptl/futex-internal.c:57
  5    Thread 0x7f90dcceb6c0 (LWP 610734) "rx-1"            futex_wait (private=0, expected=2, futex_word=0x55d660c278f8) at ../sysdeps/nptl/futex-internal.h:146
  6    Thread 0x7f90dd0ed6c0 (LWP 610732) "tx-0"            0x00007f90e0d7e042 in epoll_wait (epfd=16, events=0x7f90d0009880, maxevents=1024, timeout=505) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  7    Thread 0x7f90dd4ef6c0 (LWP 610730) "acc-0"           0x00007f90e0d7e042 in epoll_wait (epfd=12, events=0x55d660c348b0, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  8    Thread 0x7f90dd8f16c0 (LWP 610728) "net-0"           futex_wait (private=0, expected=2, futex_word=0x55d660c278f8) at ../sysdeps/nptl/futex-internal.h:146
  9    Thread 0x7f90ddaf26c0 (LWP 610727) "app-0"           0x00007f90e0d7e042 in epoll_wait (epfd=3, events=0x55d660c27ec0, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  10   Thread 0x7f90dff956c0 (LWP 610726) "component_c-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  11   Thread 0x7f90e07966c0 (LWP 610725) "component_c-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f90e07c18c0 (LWP 610724))]
#0  futex_wait (private=0, expected=2, futex_word=0x55d660c278f8) at ../sysdeps/nptl/futex-internal.h:146
warning: 146    ../sysdeps/nptl/futex-internal.h: No such file or directory
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x55d660c278f8) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x55d660c278f8, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f90e0cf40f1 in lll_mutex_lock_optimized (mutex=0x55d660c278f8) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x55d660c278f8) at ./nptl/pthread_mutex_lock.c:93
#4  0x00007f90df6e0c52 in __gthread_mutex_lock (__mutex=0x55d660c278f8) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749
#5  0x00007f90df6e0ca6 in std::mutex::lock (this=0x55d660c278f8) at /usr/include/c++/13/bits/std_mutex.h:113
#6  0x00007f90df6e0e88 in std::lock_guard<std::mutex>::lock_guard (this=0x7ffdab924e08, __m=...) at /usr/include/c++/13/bits/std_mutex.h:249
#7  0x00007f90df6e6eb5 in rmw_zenoh_cpp::GraphCache::parse_put (this=0x55d660c27790, keyexpr="@ros2_lv/0/70eb82fae64d6e3095520311f8d33d/271/272/MP/%talker24/talker/%rosout/rcl_interfaces::msg::dds_::Log_/1:1:1,1000", ignore_from_current_session=false) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/detail/graph_cache.cpp:347
#8  0x00007f90df7365fc in graph_sub_data_handler (sample=0x7ffdab9262d0, data=0x55d660c25e30) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_init.cpp:70
#9  0x00007f90ddd06dcd in zenohcd::closures::sample_closure::z_closure_sample_call (closure=0x55d660c3eda0, sample=0x7ffdab9262d0) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/closures/sample_closure.rs:52
#10 0x00007f90ddd80e71 in zenohcd::liveliness::zc_liveliness_declare_subscriber::{closure#0} (sample=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/liveliness.rs:186
#11 0x00007f90ddf721c8 in zenoh::session::Session::handle_data (self=0x55d660c338c0, local=false, key_expr=0x7ffdab929730, info=..., payload=..., attachment=...) at src/session.rs:1712
#12 0x00007f90ddf7b790 in zenoh::session::{impl#13}::send_declare (self=0x55d660c338c0, msg=...) at src/session.rs:2158
#13 0x00007f90ddf87f14 in zenoh::session::{impl#16}::send_declare (self=0x55d660c338c0, ctx=...) at src/session.rs:2701
#14 0x00007f90de00a616 in zenoh::net::routing::dispatcher::face::{impl#4}::send_declare (self=0x55d660c27d20, msg=...) at src/net/routing/dispatcher/face.rs:210
#15 0x00007f90ddfb1c30 in zenoh::session::Session::declare_liveliness_inner (self=0x55d660c3a240, key_expr=0x7ffdab92c390) at src/session.rs:1385
#16 0x00007f90ddde1b2a in zenoh::liveliness::{impl#10}::res_sync (self=...) at <***removed***>/.cargo/git/checkouts/zenoh-cc237f2570fab813/93f93d2/zenoh/src/liveliness.rs:241
#17 0x00007f90ddd834cd in zenoh_core::SyncResolve::res<zenoh::liveliness::LivelinessTokenBuilder> (self=<error reading variable: Cannot access memory at address 0x80>) at <***removed***>/.cargo/git/checkouts/zenoh-cc237f2570fab813/93f93d2/commons/zenoh-core/src/lib.rs:57
#18 0x00007f90ddd369c7 in zenohcd::liveliness::zc_liveliness_declare_token (session=..., key=..., _options=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/liveliness.rs:102
#19 0x00007f90df73e1c1 in rmw_create_publisher (node=0x55d661422910, type_supports=0x7f90e09e7b60, topic_name=0x55d6611f9ba0 "/rosout", qos_profile=0x7ffdab9304a0, publisher_options=0x7ffdab930520) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp:643
#20 0x00007f90e0c0497a in rcl_publisher_init (publisher=0x7ffdab930498, node=0x55d6612accf0, type_support=0x7f90e09e7b60, topic_name=0x7f90e0c23f75 "/rosout", options=0x7ffdab9304a0) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/publisher.c:111
#21 0x00007f90e0bfd907 in rcl_logging_rosout_init_publisher_for_node (node=0x55d6612accf0) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/logging_rosout.c:252
#22 0x00007f90e1cfb9d5 in rclcpp::node_interfaces::NodeBase::NodeBase (this=0x55d66140fc80, node_name="talker", namespace_="", context=std::shared_ptr<rclcpp::Context> (use count 52, weak count 26) = {...}, rcl_node_options=..., use_intra_process_default=false, enable_topic_statistics_default=false, default_callback_group=std::shared_ptr<rclcpp::CallbackGroup> (empty) = {...})
    at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/node_interfaces/node_base.cpp:118
#23 0x00007f90e1ced8f5 in rclcpp::Node::Node (this=0x55d6613d1a00, node_name="talker", namespace_="", options=...) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/node.cpp:186
#24 0x00007f90e1ced0ea in rclcpp::Node::Node (this=0x55d6613d1a00, node_name="talker", options=...) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/node.cpp:133
#25 0x00007f90dc1e50ed in composition::Talker::Talker (this=0x55d6613d1a00, options=...) at <***removed***>/ws_rmw_zenoh/src/demos/composition/src/talker_component.cpp:34
#26 0x00007f90dc211c3a in std::_Construct<composition::Talker, rclcpp::NodeOptions const&> (__p=0x55d6613d1a00) at /usr/include/c++/13/bits/stl_construct.h:119
#27 0x00007f90dc2115e4 in std::allocator_traits<std::allocator<void> >::construct<composition::Talker, rclcpp::NodeOptions const&> (__p=0x55d6613d1a00) at /usr/include/c++/13/bits/alloc_traits.h:661
#28 std::_Sp_counted_ptr_inplace<composition::Talker, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<rclcpp::NodeOptions const&> (this=0x55d6613d19f0, __a=...) at /usr/include/c++/13/bits/shared_ptr_base.h:604
#29 0x00007f90dc2111b6 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<composition::Talker, std::allocator<void>, rclcpp::NodeOptions const&> (this=0x7ffdab931648, __p=@0x7ffdab931640: 0x0, __a=...) at /usr/include/c++/13/bits/shared_ptr_base.h:971
#30 0x00007f90dc210e8a in std::__shared_ptr<composition::Talker, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<void>, rclcpp::NodeOptions const&> (this=0x7ffdab931640, __tag=...) at /usr/include/c++/13/bits/shared_ptr_base.h:1712
#31 0x00007f90dc210c23 in std::shared_ptr<composition::Talker>::shared_ptr<std::allocator<void>, rclcpp::NodeOptions const&> (this=0x7ffdab931640, __tag=...) at /usr/include/c++/13/bits/shared_ptr.h:464
#32 0x00007f90dc210a7e in std::make_shared<composition::Talker, rclcpp::NodeOptions const&> () at /usr/include/c++/13/bits/shared_ptr.h:1010
#33 0x00007f90dc2108e5 in rclcpp_components::NodeFactoryTemplate<composition::Talker>::create_node_instance (this=0x55d6613cebc0, options=...) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp_components/include/rclcpp_components/node_factory_template.hpp:45
#34 0x00007f90e2218f04 in rclcpp_components::ComponentManager::on_load_node (this=0x55d660c3f990, request_header=std::shared_ptr<rmw_request_id_s> (use count 3, weak count 0) = {...}, request=std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> >> (use count 4, weak count 0) = {...}, 
    response=std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> >> (use count 2, weak count 0) = {...}) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp_components/src/component_manager.cpp:253
#35 0x00007f90e224ab3d in std::__invoke_impl<void, void (rclcpp_components::ComponentManager::*&)(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >), rclcpp_components::ComponentManager*&, std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > > > (
    __f=@0x55d660c5ebb0: &virtual rclcpp_components::ComponentManager::on_load_node(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >), __t=@0x55d660c5ebc0: 0x55d660c3f990) at /usr/include/c++/13/bits/invoke.h:74
#36 0x00007f90e2249d83 in std::__invoke<void (rclcpp_components::ComponentManager::*&)(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >), rclcpp_components::ComponentManager*&, std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::s--Type <RET> for more, q to quit, c to continue without paging--
rv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > > > (
    __fn=@0x55d660c5ebb0: &virtual rclcpp_components::ComponentManager::on_load_node(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >)) at /usr/include/c++/13/bits/invoke.h:96
#37 0x00007f90e22487ad in std::_Bind<void (rclcpp_components::ComponentManager::*(rclcpp_components::ComponentManager*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >)>::__call<void, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >&&, 0ul, 1ul, 2ul, 3ul>(std::tuple<std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >&&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul>) (this=0x55d660c5ebb0, __args=...) at /usr/include/c++/13/functional:506
#38 0x00007f90e2247738 in std::_Bind<void (rclcpp_components::ComponentManager::*(rclcpp_components::ComponentManager*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >)>::operator()<std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >, void>(std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >&&) (this=0x55d660c5ebb0) at /usr/include/c++/13/functional:591
#39 0x00007f90e224622f in std::__invoke_impl<void, std::_Bind<void (rclcpp_components::ComponentManager::*(rclcpp_components::ComponentManager*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >)>&, std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > > >(std::__invoke_other, std::_Bind<void (rclcpp_components::ComponentManager::*(rclcpp_components::ComponentManager*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >)>&, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >&&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#40 0x00007f90e22444ee in std::__invoke_r<void, std::_Bind<void (rclcpp_components::ComponentManager::*(rclcpp_components::ComponentManager*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >)>&, std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > > >(std::_Bind<void (rclcpp_components::ComponentManager::*(rclcpp_components::ComponentManager*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >)>&, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >&&) (__fn=...)
    at /usr/include/c++/13/bits/invoke.h:111
#41 0x00007f90e22420da in std::_Function_handler<void (std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >), std::_Bind<void (rclcpp_components::ComponentManager::*(rclcpp_components::ComponentManager*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >&&, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >&&) (__functor=..., __args#0=..., __args#1=..., __args#2=...) at /usr/include/c++/13/bits/std_function.h:290
#42 0x00007f90e2251054 in std::function<void (std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >)>::operator()(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> > >, std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> > >) const (this=0x55d660c62df0, __args#0=std::shared_ptr<rmw_request_id_s> (empty) = {...}, __args#1=std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> >> (empty) = {...}, __args#2=std::shared_ptr<composition_interfaces::srv::LoadNode_Response_<std::allocator<void> >> (empty) = {...})
    at /usr/include/c++/13/bits/std_function.h:591
#43 0x00007f90e224f417 in rclcpp::AnyServiceCallback<composition_interfaces::srv::LoadNode>::dispatch (this=0x55d660c62df0, service_handle=std::shared_ptr<rclcpp::Service<composition_interfaces::srv::LoadNode>> (use count 5, weak count 4) = {...}, request_header=std::shared_ptr<rmw_request_id_s> (use count 3, weak count 0) = {...}, 
    request=std::shared_ptr<composition_interfaces::srv::LoadNode_Request_<std::allocator<void> >> (empty) = {...}) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/include/rclcpp/any_service_callback.hpp:183
#44 0x00007f90e224d447 in rclcpp::Service<composition_interfaces::srv::LoadNode>::handle_request (this=0x55d660c62d40, request_header=std::shared_ptr<rmw_request_id_s> (use count 3, weak count 0) = {...}, request=std::shared_ptr<void> (use count 4, weak count 0) = {...}) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/include/rclcpp/service.hpp:478
#45 0x00007f90e1c4b11d in operator() (__closure=0x7ffdab932dc8) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:641
#46 0x00007f90e1c4f3b8 in take_and_do_error_handling<rclcpp::Executor::execute_service(rclcpp::ServiceBase::SharedPtr)::<lambda()>, rclcpp::Executor::execute_service(rclcpp::ServiceBase::SharedPtr)::<lambda()> >(const char *, const char *, struct {...}, struct {...}) (action_description=0x7f90e1f0dd10 "taking a service server request from service", 
    topic_or_service_name=0x55d660c63bc0 "/ComponentManager/_container/load_node", take_action=..., handle_action=...) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:509
#47 0x00007f90e1c4b28a in rclcpp::Executor::execute_service (service=std::shared_ptr<rclcpp::ServiceBase> (use count 5, weak count 4) = {...}) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:637
#48 0x00007f90e1c4a1e6 in rclcpp::Executor::execute_any_executable (this=0x55d660c3f140, any_exec=...) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executor.cpp:474
#49 0x00007f90e1ca2042 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x55d660c3f140) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp:42
#50 0x000055d660583636 in main (argc=1, argv=0x7ffdab9331a8) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp_components/src/component_container.cpp:28
(gdb) up 3
#3  ___pthread_mutex_lock (mutex=0x55d660c278f8) at ./nptl/pthread_mutex_lock.c:93
warning: 93 ./nptl/pthread_mutex_lock.c: No such file or directory
(gdb) print mutex.__data.__owner
$3 = 611779
(gdb) thread find 611779
Thread 2 has target id 'Thread 0x7f90bf7fe6c0 (LWP 611779)'

(gdb) thread 2
[Switching to thread 2 (Thread 0x7f90bf7fe6c0 (LWP 611779))]
#0  futex_wait (private=0, expected=2, futex_word=0x55d660c3bd00) at ../sysdeps/nptl/futex-internal.h:146
warning: 146    ../sysdeps/nptl/futex-internal.h: No such file or directory
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x55d660c3bd00) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x55d660c3bd00, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f90e0cf4147 in lll_mutex_lock_optimized (mutex=0x55d660c3bd00) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x55d660c3bd00) at ./nptl/pthread_mutex_lock.c:128
#4  0x00007f90e221a162 in __gthread_mutex_lock (__mutex=0x55d660c3bd00) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749
#5  0x00007f90e221a1b2 in __gthread_recursive_mutex_lock (__mutex=0x55d660c3bd00) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:811
#6  0x00007f90e221a822 in std::recursive_mutex::lock (this=0x55d660c3bd00) at /usr/include/c++/13/mutex:120
#7  0x00007f90e221db3c in std::lock_guard<std::recursive_mutex>::lock_guard (this=0x7f90bf7efd60, __m=...) at /usr/include/c++/13/bits/std_mutex.h:249
#8  0x00007f90e1c28c0b in rclcpp_logging_output_handler (location=0x7f90df793f10 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f90df7596ed "rmw_zenoh_cpp", timestamp=1718793033330027250, 
    format=0x7f90df759f58 "Received liveliness token to remove node /%s from the graph before all pub/subs/clients/services for this node have been removed. Removing all entities first...", args=0x7f90bf7efe20) at <***removed***>/ws_rmw_zenoh/src/rclcpp/rclcpp/src/rclcpp/context.cpp:133
#9  0x00007f90e1136427 in vrcutils_log_internal (location=0x7f90df793f10 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f90df7596ed "rmw_zenoh_cpp", 
    format=0x7f90df759f58 "Received liveliness token to remove node /%s from the graph before all pub/subs/clients/services for this node have been removed. Removing all entities first...", args=0x7f90bf7efe20) at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1212
#10 0x00007f90e1136607 in rcutils_log_internal (location=0x7f90df793f10 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f90df7596ed "rmw_zenoh_cpp", 
    format=0x7f90df759f58 "Received liveliness token to remove node /%s from the graph before all pub/subs/clients/services for this node have been removed. Removing all entities first...") at <***removed***>/ws_rmw_zenoh/src/rcutils/src/logging.c:1236
#11 0x00007f90df6e969b in rmw_zenoh_cpp::GraphCache::parse_del (this=0x55d660c27790, keyexpr="@ros2_lv/0/976c6187fe83f57542c68ba674db992/0/0/NN/%/_ros2cli_611470", ignore_from_current_session=false) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/detail/graph_cache.cpp:603
#12 0x00007f90df73667a in graph_sub_data_handler (sample=0x7f90bf7f13e0, data=0x55d660c25e30) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_init.cpp:73
#13 0x00007f90ddd06dcd in zenohcd::closures::sample_closure::z_closure_sample_call (closure=0x55d660c3eda0, sample=0x7f90bf7f13e0) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/closures/sample_closure.rs:52
#14 0x00007f90ddd80e71 in zenohcd::liveliness::zc_liveliness_declare_subscriber::{closure#0} (sample=...) at <***removed***>/ws_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/liveliness.rs:186
#15 0x00007f90ddf721c8 in zenoh::session::Session::handle_data (self=0x55d660c338c0, local=false, key_expr=0x7f90bf7f4c40, info=..., payload=..., attachment=...) at src/session.rs:1712
#16 0x00007f90ddf7c944 in zenoh::session::{impl#13}::send_declare (self=0x55d660c338c0, msg=...) at src/session.rs:2195
#17 0x00007f90ddf87f14 in zenoh::session::{impl#16}::send_declare (self=0x55d660c338c0, ctx=...) at src/session.rs:2701
#18 0x00007f90de00aa30 in zenoh::net::routing::dispatcher::face::{impl#4}::send_declare (self=0x7f90c9043420, msg=...) at src/net/routing/dispatcher/face.rs:225
#19 0x00007f90de1a3bde in zenoh::net::primitives::demux::{impl#1}::handle_message (self=0x7f90c9043420, msg=...) at src/net/primitives/demux.rs:69
#20 0x00007f90de0ead8e in zenoh::net::runtime::{impl#6}::handle_message (self=0x7f90c8f7e3d0, msg=...) at src/net/runtime/mod.rs:385
#21 0x00007f90de300709 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::trigger_callback (self=0x7f90c90454c0, callback=..., msg=...) at src/unicast/universal/rx.rs:49
#22 0x00007f90de301773 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::handle_frame (self=0x7f90c90454c0, frame=...) at src/unicast/universal/rx.rs:100
#23 0x00007f90de3047b1 in zenoh_transport::unicast::universal::transport::TransportUnicastUniversal::read_messages (self=0x7f90c90454c0, batch=..., link=0x7f90c9045588) at src/unicast/universal/rx.rs:204
#24 0x00007f90de3b977e in zenoh_transport::unicast::universal::link::rx_task::{async_fn#0} () at src/unicast/universal/link.rs:266
#25 0x00007f90de3b60b5 in zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block#0} () at src/unicast/universal/link.rs:124
#26 0x00007f90de335a94 in tokio_util::task::task_tracker::{impl#8}::poll<zenoh_transport::unicast::universal::link::{impl#0}::start_rx::{async_block_env#0}> (self=..., cx=0x7f90bf7fb498) at <***removed***>/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-util-0.7.10/src/task/task_tracker.rs:669
...
(gdb) up 3
#3  ___pthread_mutex_lock (mutex=0x55d660c3bd00) at ./nptl/pthread_mutex_lock.c:128
warning: 128    ./nptl/pthread_mutex_lock.c: No such file or directory
(gdb) print mutex.__data.__owner
$4 = 610724
(gdb) thread find 610724
Thread 1 has target id 'Thread 0x7f90e07c18c0 (LWP 610724)'

Thread 1 in frame 7 wants to acquire graph_mutex_: https://github.com/ros2/rmw_zenoh/blob/d3396a59f7b005e152244c212178c43951b2b113/rmw_zenoh_cpp/src/detail/graph_cache.cpp#L347

Thread 1 has acquired logging_mutex in frame 22 before calling rcl_logging_rosout_init_publisher_for_node(): https://github.com/ros2/rclcpp/blob/7f5b44ebe22d269d329d85b678f82ae96d66ad20/rclcpp/src/rclcpp/node_interfaces/node_base.cpp#L117

Thread 2 in frame 8 wants to acquire logging_mutex: https://github.com/ros2/rclcpp/blob/7f5b44ebe22d269d329d85b678f82ae96d66ad20/rclcpp/src/rclcpp/context.cpp#L133

Thread 2 has acquired graph_mutex_ in frame 11 earlier in parse_del(): https://github.com/ros2/rmw_zenoh/blob/d3396a59f7b005e152244c212178c43951b2b113/rmw_zenoh_cpp/src/detail/graph_cache.cpp#L564

rmw_zenoh: d3396a59f7b005e152244c212178c43951b2b113 rclcpp: 7f5b44ebe22d269d329d85b678f82ae96d66ad20 rcl: 00c29cd6e72caae36e1e749a8c02558ee709a275

Yadunund commented 2 weeks ago

@hwoithe I was able to successfully run for i in {1..100}; do echo "ros2 component load --node-namespace /talker${i} /ComponentManager composition composition::Talker"; done | parallel on Jazzy with #203 branch (It takes a while to complete) Could you please retry with this branch?

Yadunund commented 2 weeks ago

Ah I am able to reproduce it in subsequent tries.

However if I add a sleep in between, I succeed each time.

for i in {1..100}; do echo "ros2 component load --node-namespace /talker${i} /ComponentManager composition composition::Talker"; sleep 1; done | parallel

nonetheless there is still a chance for deadlock in graph_cache. Investigating...

hwoithe commented 2 weeks ago

In general, I now run the parallel version to reproduce the graph lock issue and the non-parallel version to reproduce the rmw_wait issue. I am still experiencing a hang on rmw_wait from a node in the for loop, even with https://github.com/ros2/rmw_zenoh/pull/203.

With https://github.com/ros2/rmw_zenoh/pull/203:

(gdb) info threads
  Id   Target Id                                     Frame 
* 1    Thread 0x7f3b6c04ab80 (LWP 954695) "ros2"     0x00007f3b6c0e4d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x2f4cd70) at ./nptl/futex-internal.c:57
  2    Thread 0x7f3b4f3886c0 (LWP 954708) "ros2"     0x00007f3b6c0e4d61 in __futex_abstimed_wait_common64 (private=1329102208, cancel=true, abstime=0x7f3b4f387e50, op=137, expected=0, futex_word=0x31c1110) at ./nptl/futex-internal.c:57
  3    Thread 0x7f3b4f7fb6c0 (LWP 954707) "rx-1"     syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  4    Thread 0x7f3b4f9fc6c0 (LWP 954706) "rx-0"     0x00007f3b6c176042 in epoll_wait (epfd=19, events=0x7f3b50131d10, maxevents=1024, timeout=3732) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  5    Thread 0x7f3b4fbfd6c0 (LWP 954705) "tx-0"     0x00007f3b6c176042 in epoll_wait (epfd=16, events=0x7f3b50009f10, maxevents=1024, timeout=2387) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  6    Thread 0x7f3b4ffff6c0 (LWP 954703) "acc-0"    0x00007f3b6c176042 in epoll_wait (epfd=12, events=0x309c2d0, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  7    Thread 0x7f3b6055a6c0 (LWP 954701) "net-0"    0x00007f3b6c176042 in epoll_wait (epfd=8, events=0x2fbd190, maxevents=1024, timeout=2147483647) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  8    Thread 0x7f3b6075b6c0 (LWP 954700) "app-0"    0x00007f3b6c176042 in epoll_wait (epfd=3, events=0x2fec5f0, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  9    Thread 0x7f3b60f5c6c0 (LWP 954698) "ros2"     0x00007f3b6c0e4d61 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f3b6b222580 <(anonymous namespace)::g_signal_handler_sem>) at ./nptl/futex-internal.c:57
  10   Thread 0x7f3b697f96c0 (LWP 954697) "ros2-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  11   Thread 0x7f3b69ffa6c0 (LWP 954696) "ros2-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
...
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f3b6c04ab80 (LWP 954695))]
#0  0x00007f3b6c0e4d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x2f4cd70) at ./nptl/futex-internal.c:57
57  in ./nptl/futex-internal.c
(gdb) bt
#0  0x00007f3b6c0e4d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x2f4cd70) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x2f4cd70) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x2f4cd70, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007f3b6c0e77dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2f4cd78, cond=0x2f4cd48) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x2f4cd48, mutex=0x2f4cd78) at ./nptl/pthread_cond_wait.c:627
#5  0x00007f3b68fb2276 in rmw_wait (subscriptions=0x31f7308, guard_conditions=0x31f7320, services=0x31f7350, clients=0x31f7338, events=0x31f7368, wait_set=0x32ba850, wait_timeout=0x0) at <***removed***>/ws_rmw_zenoh/src/rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp:3277
#6  0x00007f3b6ae7ab1b in rcl_wait (wait_set=0x32c26f0, timeout=-1) at <***removed***>/ws_rmw_zenoh/src/rcl/rcl/src/rcl/wait.c:670
#7  0x00007f3b6b10299d in rclpy::WaitSet::wait (this=0x31ed840, timeout=-1) at <***removed***>/ws_rmw_zenoh/src/rclpy/rclpy/src/rclpy/wait_set.cpp:247
#8  0x00007f3b6b10708d in pybind11::cpp_function::cpp_function<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}::operator()(rclpy::WaitSet*, long) const (__closure=0x2a048f8, c=0x31ed840, args#0=-1)
    at /usr/include/pybind11/pybind11.h:111
#9  0x00007f3b6b10c87b in pybind11::detail::argument_loader<rclpy::WaitSet*, long>::call_impl<void, pybind11::cpp_function::cpp_function<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&, 0ul, 1ul, pybind11::detail::void_type>(pybind11::cpp_function::cpp_function<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul>, pybind11::detail::void_type&&) && (
    this=0x7fff5aa59cd0, f=...) at /usr/include/pybind11/cast.h:1480
#10 0x00007f3b6b10ada4 in pybind11::detail::argument_loader<rclpy::WaitSet*, long>::call<void, pybind11::detail::void_type, pybind11::cpp_function::cpp_function<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&>(pybind11::cpp_function::cpp_function<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&) && (this=0x7fff5aa59cd0, f=...) at /usr/include/pybind11/cast.h:1454
#11 0x00007f3b6b10a004 in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}, void, rclpy::WaitSet*, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(pybind11::cpp_function::initialize<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&&, void (*)(rclpy::WaitSet*, long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(pybind11::detail::function_call&)#1}::operator()(pybind11::detail::function_call&) const (__closure=0x0, call=...) at /usr/include/pybind11/pybind11.h:254
#12 0x00007f3b6b10a0dc in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}, void, rclpy::WaitSet*, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(pybind11::cpp_function::initialize<void, rclpy::WaitSet, long, pybind11::name, pybind11::is_method, pybind11::sibling, char [48]>(void (rclpy::WaitSet::*)(long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(rclpy::WaitSet*, long)#1}&&, void (*)(rclpy::WaitSet*, long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, char const (&) [48])::{lambda(pybind11::detail::function_call&)#1}::_FUN(pybind11::detail::function_call&) () at /usr/include/pybind11/pybind11.h:224
#13 0x00007f3b6af73ffe in pybind11::cpp_function::dispatcher (self=<PyCapsule at remote 0x7f3b6b336430>, args_in=(<rclpy._rclpy_pybind11.WaitSet at remote 0x7f3b4ea2ee70>, -1), kwargs_in=0x0) at /usr/include/pybind11/pybind11.h:946
#14 0x000000000058201f in cfunction_call (func=<built-in method wait of PyCapsule object at remote 0x7f3b6b336430>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/methodobject.c:537
#15 0x0000000000548f55 in _PyObject_MakeTpCall (tstate=0xba5048 <_PyRuntime+459656>, callable=<built-in method wait of PyCapsule object at remote 0x7f3b6b336430>, args=<optimized out>, nargs=2, keywords=0x0) at ../Objects/call.c:240
...
(gdb) py-bt
Traceback (most recent call first):
  <built-in method wait of PyCapsule object at remote 0x7f3b6b336430>
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line ?, in _wait_for_ready_callbacks
    (failed to get frame line number)
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 781, in wait_for_ready_callbacks
    return next(self._cb_iter)
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 811, in _spin_once_impl
    handler, entity, node = self.wait_for_ready_callbacks(
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 835, in spin_once_until_future_complete
    self._spin_once_impl(timeout_sec, future.done)
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 321, in spin_until_future_complete
    self.spin_once_until_future_complete(future, timeout_sec)
  File "<***removed***>/ws_rmw_zenoh/install/rclpy/lib/python3.12/site-packages/rclpy/__init__.py", line 266, in spin_until_future_complete
    executor.spin_until_future_complete(future, timeout_sec)
  File "/opt/ros/jazzy/lib/python3.12/site-packages/ros2component/api/__init__.py", line 247, in load_component_into_container
    rclpy.spin_until_future_complete(node, future)
  File "/opt/ros/jazzy/lib/python3.12/site-packages/ros2component/verb/load.py", line 49, in main
    component_uid, component_name = load_component_into_container(
  File "/opt/ros/jazzy/lib/python3.12/site-packages/ros2component/command/component.py", line 37, in main
    return extension.main(args=args)
  File "/opt/ros/jazzy/lib/python3.12/site-packages/ros2cli/cli.py", line 91, in main
    rc = extension.main(parser=parser, args=args)
  File "/opt/ros/jazzy/bin/ros2", line 33, in <module>
    sys.exit(load_entry_point('ros2cli==0.32.1', 'console_scripts', 'ros2')())

The node is waiting in frame 5: https://github.com/ros2/rmw_zenoh/blob/45391cc378c9b27f00eeac8a32df743f386a81b9/rmw_zenoh_cpp/src/rmw_zenoh.cpp#L3277

I have not verified whether the appropriate events are being correctly sent or received in the first place.

rmw_zenoh: 45391cc378c9b27f00eeac8a32df743f386a81b9 rclpy: e181737f19d6b8cb335b0326a40ec2075267c555

Yadunund commented 2 weeks ago

Thanks @hwoithe. Will keep digging...

Yadunund commented 2 weeks ago
#8  0x00007f90e1c28c0b in rclcpp_logging_output_handler (location=0x7f90df793f10 <rmw_zenoh_cpp::GraphCache::parse_del(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)::__rcutils_logging_location>, severity=30, name=0x7f90df7596ed "rmw_zenoh_cpp", timestamp=1718793033330027250, 

@clalancette and I took a closer look at the stack trace from https://github.com/ros2/rmw_zenoh/issues/182#issuecomment-2178911481 and discovered that it is indeed an ABBA deadlock resulting from the two threads trying to acquire both graph_mutex_ and the scoped_logging_guard. 🤦🏼

The problem arises from the RCUTILS_LOG_*_NAMED macros. By default vrcutils_log_internal logs to stdout/err via g_rcutils_logging_output_handler but this hander is overwritten via rcutils_logging_set_output_handler in rclcpp and rclpy to also include publishing to /rosout.

I'm working on a fix where we rely on our own logging functions in rmw_zenoh that prints to stderr/stdout.

The deadlock in rmw_wait will be addressed separately.

hwoithe commented 2 weeks ago

I'm working on a fix where we rely on our own logging functions in rmw_zenoh that prints to stderr/stdout.

Would you suggest that all RMW implementations migrate to their own logging mechanism to avoid potential logger related deadlocks?

rmw_fastrtps has encountered this as well: https://github.com/ros2/rmw_fastrtps/blob/1417a44b5022264c00c5cf748eea00e3dac698e9/rmw_fastrtps_shared_cpp/include/rmw_fastrtps_shared_cpp/custom_participant_info.hpp#L216

I can see the value of a RMW safe replacement (or fix) to the RCUTILS_LOG_*_NAMED macros.

clalancette commented 2 weeks ago

I can see the value of a RMW safe replacement (or fix) to the RCUTILS_LOG_*_NAMED macros.

As it stands, that is probably not feasible. The way the RCUTILS macros work is with a global logging callback. Even if we did have some way to switch back and forth between the "console-only" and the "client-library" installed callback (which we currently don't), I don't see a good way we could do it in a thread-safe manner.

That said, we could add a new set of logging APIs to rcutils specifically for the RMW layer, that would only ever use the console. That would have the benefit of being able to reuse a bunch of the code that we use for the "full-featured" rcutils logging.

For now, I think the easiest thing to do (and backport) is to fix this locally in rmw_zenoh_cpp, and then consider one of these more generic solutions later.

hwoithe commented 2 weeks ago

...we could add a new set of logging APIs to rcutils specifically for the RMW layer, that would only ever use the console.

I was thinking this as well and in the long term would be nice to have.

Thanks to all for the help on this issue.