Closed Aposhian closed 2 years ago
I also confirmed that the same behavior occurs on:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==630514==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7f0201855638 bp 0x60800000b9a0 sp 0x7fff1a16a9a0 T0)
==630514==The signal is caused by a READ memory access.
==630514==Hint: this fault was caused by a dereference of a high value address (see register values below). Dissassemble the provided pc to learn which register was used.
#0 0x7f0201855638 in lease_renew (/opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0+0x88638)
#1 0x7f02018693f1 (/opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0+0x9c3f1)
#2 0x7f020188c3da in dds_write_impl (/opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0+0xbf3da)
#3 0x7f020188c5f1 in dds_write (/opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0+0xbf5f1)
#4 0x7f0202580b71 in rmw_publish (/opt/ros/rolling/lib/librmw_cyclonedds_cpp.so+0x1fb71)
#5 0x7f02058aa820 in rcl_publish (/opt/ros/rolling/lib/librcl.so+0x1e820)
#6 0x56083b8deb51 in rclcpp::Publisher<std_msgs::msg::Bool_<std::allocator<void> >, std::allocator<void> >::do_inter_process_publish(std_msgs::msg::Bool_<std::allocator<void> > const&) /opt/ros/rolling/include/rclcpp/publisher.hpp:453
#7 0x56083b8deb51 in std::enable_if<rosidl_generator_traits::is_message<std_msgs::msg::Bool_<std::allocator<void> > >::value&&std::is_same<std_msgs::msg::Bool_<std::allocator<void> >, std_msgs::msg::Bool_<std::allocator<void> > >::value, void>::type rclcpp::Publisher<std_msgs::msg::Bool_<std::allocator<void> >, std::allocator<void> >::publish<std_msgs::msg::Bool_<std::allocator<void> > >(std_msgs::msg::Bool_<std::allocator<void> > const&) /opt/ros/rolling/include/rclcpp/publisher.hpp:297
#8 0x56083b8deb51 in main /home/redacted/repro_lease_renew_segfault/lease_renew_segfault/src/main.cpp:12
#9 0x7f020544e0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x240b2)
#10 0x56083b8df52d in _start (/home/redacted/repro_lease_renew_segfault/build/lease_renew_segfault/lease_renew_segfault+0xf52d)
AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV (/opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0+0x88638) in lease_renew ==630514==ABORTING ``
I also confirmed that this behavior does not occur if I use FastRTPS by setting:
export RMW_IMPLEMENTATION=rmw_fastrtps_cpp
Thanks for the easy-to-reproduce example, that really helps.
ros-rolling-cyclonedds 0.8.0
ros-rolling-rmw-cyclone-dds 1.1.2
ros-rolling-rclcpp 15.0.0
I'll note that these are old versions of packages on Rolling; that's because we stopped updating Rolling on Ubuntu 20.04.
That said, I tried out this example with the latest sources on Ubuntu 22.04, and the crash is still there. Given the stack trace from asan, it looks like this is probably somewhere down in CycloneDDS proper, though it could be the interaction between rmw_cyclonedds_cpp and CycloneDDS. Either way, pinging @eboasson for thoughts.
Ouch. Definitely in Cyclone DDS proper. (This is almost always the case if the crash is not related to user data in a read/write-like operation (or more generally pointers in input parameters), because it is pretty well hardened against abuse of entity handles.)
@dpotman I won't have time to look at this the next couple of days, but perhaps you could have a first look at it? Thanks!
@dpotman I won't have time to look at this the next couple of days, but perhaps you could have a first look at it? Thanks!
Sure, I'll look into it
Cyclone assumes that the lease duration is not infinite (the default) in case the manual-by-topic liveliness policy is used, and the problem that causes the segfault is that it does not correctly check this when renewing the writer lease. I'll create a PR to fix this in Cyclone.
A workaround is to add a (non-infinite) lease duration in the QoS object used when creating the publisher:
rmw_time_t duration{10, 0};
auto pub = node->create_publisher<std_msgs::msg::Bool>("test_topic",
rclcpp::QoS(10)
.liveliness(rclcpp::LivelinessPolicy::ManualByTopic)
.liveliness_lease_duration(duration)
);
@clalancette just in case you didn't notice, the fix for this crash is in cyclone 0.9.1 released a few days ago.
If updating to this version still fits with the Humble release schedule you may want to update, otherwise, I don't think there's any harm in leaving it at the 0.9.0 release. The only change that is really relevant to ROS is for this issue's case. I would say it only fixes an accidental misconfiguration as a manual-by-topic liveliness policy is infinite lease is effectively exactly the same as an automatic one with an infinite lease duration. The only observable difference has to do with the QoS matching rules, but basing a system on QoS mismatching is quite evil!
If updating to this version still fits with the Humble release schedule you may want to update, otherwise, I don't think there's any harm in leaving it at the 0.9.0 release.
With the release being just a few days away, I think we'll stick with 0.9.0 for the initial release.
However, it does seem like this is a bugfix we want, so what I'm going to do here is to open an issue for releasing 0.9.1 into Humble for Patch Release 1.
Bug report
Required Info:
Steps to reproduce issue
With this small example ROS2 program I get inconsistent segfaults coming from
lease_renew
Here is this program packaged into a full package for easy building and running: https://github.com/Aposhian/repro_lease_renew_segfault
Expected behavior
Publish 10
Bool
messages and then exit.Actual behavior
Exits with -11 sometimes. When run with
I get
Additional information
In this example, I am not calling
assert_liveliness
, but this behavior comes up either way