ros2 / rmw_fastrtps

Implementation of the ROS Middleware (rmw) Interface using eProsima's Fast RTPS.
Apache License 2.0
154 stars 116 forks source link

Node randomly stops responding due to possible deadlock in fastrtps middleware #695

Open ksuszka opened 1 year ago

ksuszka commented 1 year ago

Bug report

Required Info:

Steps to reproduce issue

Unfortunately, the issue occurs randomly. The issue happened three times so far. We have system with a few dozen nodes. During testing on a vehicle ekf_node from robot_localization package stopped responding. The process was still running, but it wasn't discoverable by ros2 tooling and no messages were sent from it. We had some issues with messages not being delivered after we switched to FastDDS, but we blamed SHM transport so far. This complete hang was something new.

We used strace to check what is going on but it didn't show any activity in the ekf_node process. So we connected to it with gdb and checked stack traces of all threads and it seemed like all threads were waiting somewhere inside FastDDS middleware.

Here is the output from gdb process:

(gdb) t
[Current thread is 1 (Thread 0x7fd27fa18380 (LWP 759))]
(gdb) info threads
  Id   Target Id                                  Frame 
* 1    Thread 0x7fd27fa18380 (LWP 759) "ekf_node" 0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  2    Thread 0x7fd27ec41640 (LWP 781) "ekf_node" 0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  3    Thread 0x7fd27e440640 (LWP 788) "ekf_node" 0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  4    Thread 0x7fd27dbb8640 (LWP 789) "ekf_node" 0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  5    Thread 0x7fd27d3b7640 (LWP 790) "ekf_node" 0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7fd27cbb6640 (LWP 791) "ekf_node" 0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  7    Thread 0x7fd27c3b5640 (LWP 798) "ekf_node" 0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  8    Thread 0x7fd27bbb4640 (LWP 799) "ekf_node" 0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  9    Thread 0x7fd27b3b3640 (LWP 802) "ekf_node" 0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  10   Thread 0x7fd27aad2640 (LWP 803) "ekf_node" 0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  11   Thread 0x7fd27a2c9640 (LWP 809) "ekf_node" 0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
  12   Thread 0x7fd2799d7640 (LWP 827) "ekf_node" 0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) thread apply all bt

Thread 12 (Thread 0x7fd2799d7640 (LWP 827) "ekf_node"):
#0  0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe4eac1 in pthread_cond_wait () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f567d82 in eprosima::fastdds::dds::detail::WaitSetImpl::wait(std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >&, eprosima::fastrtps::Time_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd27f567dfa in eprosima::fastdds::dds::WaitSet::wait(std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >&, eprosima::fastrtps::Time_t) const () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007fd27f999504 in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
#5  0x00007fd27f9f0707 in rmw_wait () from /opt/ros/humble/lib/librmw_fastrtps_cpp.so
#6  0x00007fd27fce6718 in rcl_wait () from /opt/ros/humble/lib/librcl.so
#7  0x00007fd2802f0dbc in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/humble/lib/librclcpp.so
#8  0x00007fd2802f3ab3 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/humble/lib/librclcpp.so
#9  0x00007fd2802faf31 in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/humble/lib/librclcpp.so
#10 0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#12 0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 11 (Thread 0x7fd27a2c9640 (LWP 809) "ekf_node"):
#0  0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe4eac1 in pthread_cond_wait () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f567d82 in eprosima::fastdds::dds::detail::WaitSetImpl::wait(std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >&, eprosima::fastrtps::Time_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd27f567dfa in eprosima::fastdds::dds::WaitSet::wait(std::vector<eprosima::fastdds::dds::Condition*, std::allocator<eprosima::fastdds::dds::Condition*> >&, eprosima::fastrtps::Time_t) const () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007fd27f999504 in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
#5  0x00007fd27f9850e2 in ?? () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
#6  0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#8  0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 10 (Thread 0x7fd27aad2640 (LWP 803) "ekf_node"):
#0  0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe530dd in pthread_mutex_lock () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f3c0b34 in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd27f40cc3e in eprosima::fastrtps::rtps::RTPSMessageGroup::send() () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007fd27f40cf0d in eprosima::fastrtps::rtps::RTPSMessageGroup::flush() () from /opt/ros/humble/lib/libfastrtps.so.2.6
#5  0x00007fd27f40cf2d in eprosima::fastrtps::rtps::RTPSMessageGroup::flush_and_reset() () from /opt/ros/humble/lib/libfastrtps.so.2.6
#6  0x00007fd27f5bc34d in eprosima::fastdds::rtps::FlowControllerImpl<eprosima::fastdds::rtps::FlowControllerSyncPublishMode, eprosima::fastdds::rtps::FlowControllerFifoSchedule>::run() () from /opt/ros/humble/lib/libfastrtps.so.2.6
#7  0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#9  0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 9 (Thread 0x7fd27b3b3640 (LWP 802) "ekf_node"):
#0  0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe57ad3 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f64ac4a in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd27f64b44c in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007fd27f649487 in eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#5  0x00007fd27f64b96b in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#6  0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#8  0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 8 (Thread 0x7fd27bbb4640 (LWP 799) "ekf_node"):
#0  0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe530dd in pthread_mutex_lock () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f57022d in eprosima::fastrtps::rtps::PDP::assert_remote_participant_liveliness(eprosima::fastrtps::rtps::GuidPrefix_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd27f4145ed in eprosima::fastrtps::rtps::MessageReceiver::processCDRMsg(eprosima::fastrtps::rtps::Locator_t const&, eprosima::fastrtps::rtps::Locator_t const&, eprosima::fastrtps::rtps::CDRMessage_t*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007fd27f41a09f in eprosima::fastrtps::rtps::ReceiverResource::OnDataReceived(unsigned char const*, unsigned int, eprosima::fastrtps::rtps::Locator_t const&, eprosima::fastrtps::rtps::Locator_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#5  0x00007fd27f4a07d6 in eprosima::fastdds::rtps::UDPChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#6  0x00007fd27f49b08b in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#7  0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#9  0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 7 (Thread 0x7fd27c3b5640 (LWP 798) "ekf_node"):
#0  0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe57ad3 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f64ac4a in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd27f64b44c in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007fd27f649487 in eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#5  0x00007fd27f64b96b in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#6  0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#8  0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 6 (Thread 0x7fd27cbb6640 (LWP 791) "ekf_node"):
#0  0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe53082 in pthread_mutex_lock () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f5b08ca in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd27f3d6285 in eprosima::fastrtps::rtps::StatefulWriter::change_removed_by_history(eprosima::fastrtps::rtps::CacheChange_t*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007fd27f3f1147 in eprosima::fastrtps::rtps::WriterHistory::remove_change_nts(__gnu_cxx::__normal_iterator<eprosima::fastrtps::rtps::CacheChange_t* const*, std::vector<eprosima::fastrtps::rtps::CacheChange_t*, std::allocator<eprosima::fastrtps::rtps::CacheChange_t*> > >, bool) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#5  0x00007fd27f3e2732 in eprosima::fastrtps::rtps::History::remove_change(eprosima::fastrtps::rtps::CacheChange_t*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#6  0x00007fd27f45cc2b in eprosima::fastdds::dds::DataWriterHistory::remove_change_pub(eprosima::fastrtps::rtps::CacheChange_t*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#7  0x00007fd27f3d6cde in eprosima::fastrtps::rtps::StatefulWriter::check_acked_status() () from /opt/ros/humble/lib/libfastrtps.so.2.6
#8  0x00007fd27f3dad23 in eprosima::fastrtps::rtps::StatefulWriter::matched_reader_remove(eprosima::fastrtps::rtps::GUID_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#9  0x00007fd27f57cc2e in eprosima::fastrtps::rtps::EDP::unpairReaderProxy(eprosima::fastrtps::rtps::GUID_t const&, eprosima::fastrtps::rtps::GUID_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#10 0x00007fd27f56f605 in eprosima::fastrtps::rtps::PDP::removeReaderProxyData(eprosima::fastrtps::rtps::GUID_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#11 0x00007fd27f5847e5 in eprosima::fastrtps::rtps::EDPSimpleSUBListener::onNewCacheChangeAdded(eprosima::fastrtps::rtps::RTPSReader*, eprosima::fastrtps::rtps::CacheChange_t const*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#12 0x00007fd27f3fac04 in eprosima::fastrtps::rtps::StatefulReader::NotifyChanges(eprosima::fastrtps::rtps::WriterProxy*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#13 0x00007fd27f3fb37b in eprosima::fastrtps::rtps::StatefulReader::change_received(eprosima::fastrtps::rtps::CacheChange_t*, eprosima::fastrtps::rtps::WriterProxy*, unsigned long) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#14 0x00007fd27f3fb8a1 in eprosima::fastrtps::rtps::StatefulReader::processDataMsg(eprosima::fastrtps::rtps::CacheChange_t*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#15 0x00007fd27f408ff0 in eprosima::fastrtps::rtps::MessageReceiver::process_data_message_without_security(eprosima::fastrtps::rtps::EntityId_t const&, eprosima::fastrtps::rtps::CacheChange_t&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#16 0x00007fd27f412a2b in eprosima::fastrtps::rtps::MessageReceiver::proc_Submsg_Data(eprosima::fastrtps::rtps::CDRMessage_t*, eprosima::fastrtps::rtps::SubmessageHeader_t*) const () from /opt/ros/humble/lib/libfastrtps.so.2.6
#17 0x00007fd27f414678 in eprosima::fastrtps::rtps::MessageReceiver::processCDRMsg(eprosima::fastrtps::rtps::Locator_t const&, eprosima::fastrtps::rtps::Locator_t const&, eprosima::fastrtps::rtps::CDRMessage_t*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#18 0x00007fd27f41a09f in eprosima::fastrtps::rtps::ReceiverResource::OnDataReceived(unsigned char const*, unsigned int, eprosima::fastrtps::rtps::Locator_t const&, eprosima::fastrtps::rtps::Locator_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#19 0x00007fd27f4a07d6 in eprosima::fastdds::rtps::UDPChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#20 0x00007fd27f49b08b in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#21 0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#23 0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 5 (Thread 0x7fd27d3b7640 (LWP 790) "ekf_node"):
#0  0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe530dd in pthread_mutex_lock () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f57022d in eprosima::fastrtps::rtps::PDP::assert_remote_participant_liveliness(eprosima::fastrtps::rtps::GuidPrefix_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd27f4145ed in eprosima::fastrtps::rtps::MessageReceiver::processCDRMsg(eprosima::fastrtps::rtps::Locator_t const&, eprosima::fastrtps::rtps::Locator_t const&, eprosima::fastrtps::rtps::CDRMessage_t*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007fd27f41a09f in eprosima::fastrtps::rtps::ReceiverResource::OnDataReceived(unsigned char const*, unsigned int, eprosima::fastrtps::rtps::Locator_t const&, eprosima::fastrtps::rtps::Locator_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#5  0x00007fd27f4a07d6 in eprosima::fastdds::rtps::UDPChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#6  0x00007fd27f49b08b in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#7  0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#9  0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 4 (Thread 0x7fd27dbb8640 (LWP 789) "ekf_node"):
#0  0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe530dd in pthread_mutex_lock () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f3d6e45 in eprosima::fastrtps::rtps::StatefulWriter::perform_nack_supression(eprosima::fastrtps::rtps::GUID_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd27f3d6f0b in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007fd27f3bd69c in eprosima::fastrtps::rtps::TimedEventImpl::trigger(std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#5  0x00007fd27f3bd8ca in eprosima::fastrtps::rtps::ResourceEvent::do_timer_actions() () from /opt/ros/humble/lib/libfastrtps.so.2.6
#6  0x00007fd27f3bdbb7 in eprosima::fastrtps::rtps::ResourceEvent::event_service() () from /opt/ros/humble/lib/libfastrtps.so.2.6
#7  0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#9  0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7fd27e440640 (LWP 788) "ekf_node"):
#0  0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe4f35d in pthread_cond_clockwait () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f63caa1 in eprosima::fastdds::rtps::SharedMemWatchdog::run() () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#5  0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7fd27ec41640 (LWP 781) "ekf_node"):
#0  0x00007fd27fe4c197 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe57cf8 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd2803689b2 in rclcpp::SignalHandler::wait_for_signal() () from /opt/ros/humble/lib/librclcpp.so
#3  0x00007fd280369a2e in rclcpp::SignalHandler::deferred_signal_handler() () from /opt/ros/humble/lib/librclcpp.so
#4  0x00007fd2800bf2b3 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fd27fe4fb43 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#6  0x00007fd27fee1a00 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7fd27fa18380 (LWP 759) "ekf_node"):
#0  0x00007fd27fe4c340 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd27fe530dd in pthread_mutex_lock () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd27f452042 in eprosima::fastdds::dds::DataWriterImpl::perform_create_new_change(eprosima::fastrtps::rtps::ChangeKind_t, void*, eprosima::fastrtps::rtps::WriteParams&, eprosima::fastrtps::rtps::InstanceHandle_t const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007fd27f453653 in eprosima::fastdds::dds::DataWriterImpl::create_new_change_with_params(eprosima::fastrtps::rtps::ChangeKind_t, void*, eprosima::fastrtps::rtps::WriteParams&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007fd27f453730 in eprosima::fastdds::dds::DataWriterImpl::create_new_change(eprosima::fastrtps::rtps::ChangeKind_t, void*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#5  0x00007fd27f453769 in eprosima::fastdds::dds::DataWriterImpl::write(void*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#6  0x00007fd27f98b854 in rmw_fastrtps_shared_cpp::__rmw_publish(char const*, rmw_publisher_s const*, void const*, rmw_publisher_allocation_s*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
#7  0x00007fd27fcdeade in ?? () from /opt/ros/humble/lib/librcl.so
#8  0x00007fd27fd9c515 in ?? () from /opt/ros/humble/lib/libtf2_ros.so
#9  0x00007fd27fd9c7fa in tf2_ros::TransformBroadcaster::sendTransform(std::vector<geometry_msgs::msg::TransformStamped_<std::allocator<void> >, std::allocator<geometry_msgs::msg::TransformStamped_<std::allocator<void> > > > const&) () from /opt/ros/humble/lib/libtf2_ros.so
#10 0x00007fd27fd9c904 in tf2_ros::TransformBroadcaster::sendTransform(geometry_msgs::msg::TransformStamped_<std::allocator<void> > const&) () from /opt/ros/humble/lib/libtf2_ros.so
#11 0x00007fd28075583d in robot_localization::RosFilter<robot_localization::Ekf>::periodicUpdate() () from /as_drive/ws/install/robot_localization/lib/librl_lib.so
#12 0x00007fd2806c4738 in rclcpp::GenericTimer<std::function<void ()>, (void*)0>::execute_callback() () from /as_drive/ws/install/robot_localization/lib/librl_lib.so
#13 0x00007fd2802f3651 in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) () from /opt/ros/humble/lib/librclcpp.so
#14 0x00007fd2802faf40 in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/humble/lib/librclcpp.so
#15 0x00007fd2802fb155 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) () from /opt/ros/humble/lib/librclcpp.so
#16 0x000055fcf2e045f6 in main ()

I'm not 100% sure that it is caused by the FastDDS middleware or its SHM transport, but we had been using the same ekf_node for the past two years with CycloneDDS without issues and experienced it after switching to FastDDS, and all threads hang inside DDS middleware so it is our best guess at the moment.

Expected behavior

ROS node should work without hangs.

Actual behavior

ROS node stopped responding.

Additional information

Unfortunately, we don't have an easy way to reliably reproduce this issue. We tested it in the same environment multiple times and it only happened three times so far. This gdb analysis is from the third time.

Additionally, it would be great to know what actions/diagnostic tools should we use if we encounter this issue again to make it easier to diagnose and fix.

fujitatomoya commented 1 year ago

Sorry, not much i can help w/o reproducible environment...

I'm not 100% sure that it is caused by the FastDDS middleware or its SHM transport

To be sure for this, how about disabling shm transport via configuration file? if that works w/o any problem, shm transport could be the reason to produce the issue? see https://fast-dds.docs.eprosima.com/en/latest/fastdds/transport/shared_memory/shared_memory.html and https://fast-dds.docs.eprosima.com/en/latest/fastdds/xml_configuration/transports.html, i do not think you need to change any application code.

also here, https://fast-dds.docs.eprosima.com/en/latest/fastdds/xml_configuration/transports.html

ksuszka commented 11 months ago

FYI. We had numerous issues with FastDDS and we switched back to CycloneDDS. No more hang ups, silent errors, messages stopped being delivered.

CycloneDDS+Iceoryx is far from being perfect, but in our case, after learning a few of its quirks, it seems to be much more reliable.

I leave this issue open as I think it's still a problem with FastDDS, but at the moment we don't plan to investigate it any further.

fujitatomoya commented 11 months ago

@ksuszka thanks for sharing the experience.

Just FYI, several patches related SHM transport and data sharing is staging for next humble patch release. (https://github.com/ros2/ros2/pull/1484/files#diff-0b86fcc230a228fb210653f2069d07ee0ab117da02c6471640ae12327835ff4fL37-R37)