ros2 / rmw_cyclonedds

ROS 2 RMW layer for Eclipse Cyclone DDS
Apache License 2.0
108 stars 89 forks source link

segfault in lease_renew on publish with manual by topic liveliness qos #390

Closed Aposhian closed 2 years ago

Aposhian commented 2 years ago

Bug report

Required Info:

Steps to reproduce issue

With this small example ROS2 program I get inconsistent segfaults coming from lease_renew

#include <rclcpp/rclcpp.hpp>
#include <std_msgs/msg/bool.hpp>

int main(int argc, char const * argv[]) {
    rclcpp::init(argc, argv);
    auto node = std::make_shared<rclcpp::Node>("node");
    auto pub = node->create_publisher<std_msgs::msg::Bool>("test_topic",
        rclcpp::QoS(10)
        .liveliness(rclcpp::LivelinessPolicy::ManualByTopic)
    );
    for (size_t i = 0u; i < 10u; ++i) {
        pub->publish(std::move(std_msgs::msg::Bool()));
        rclcpp::spin_some(node);
    }
    rclcpp::shutdown();
    return 0;
}

Here is this program packaged into a full package for easy building and running: https://github.com/Aposhian/repro_lease_renew_segfault

Expected behavior

Publish 10 Bool messages and then exit.

Actual behavior

Exits with -11 sometimes. When run with

LD_PRELOAD=/lib/x86_64-linux-gnu/libasan.so.6

I get

AddressSanitizer:DEADLYSIGNAL
=================================================================
==604209==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x7ff2ce655638 bp 0x60800000bfa0 sp 0x7ffda68739f0 T0)
==604209==The signal is caused by a READ memory access.
==604209==Hint: this fault was caused by a dereference of a high value address (see register values below).  Dissassemble the provided pc to learn which register was used.
    #0 0x7ff2ce655638 in lease_renew (/opt/ros/galactic/lib/x86_64-linux-gnu/libddsc.so.0+0x88638)
    #1 0x7ff2ce6693f1  (/opt/ros/galactic/lib/x86_64-linux-gnu/libddsc.so.0+0x9c3f1)
    #2 0x7ff2ce68c3da in dds_write_impl (/opt/ros/galactic/lib/x86_64-linux-gnu/libddsc.so.0+0xbf3da)
    #3 0x7ff2ce68c5f1 in dds_write (/opt/ros/galactic/lib/x86_64-linux-gnu/libddsc.so.0+0xbf5f1)
    #4 0x7ff2cf326da3 in rmw_publish (/opt/ros/galactic/lib/librmw_cyclonedds_cpp.so+0x1eda3)
    #5 0x7ff2d264b7d0 in rcl_publish (/opt/ros/galactic/lib/librcl.so+0x1e7d0)
    #6 0x5574f4b9f990 in rclcpp::Publisher<std_msgs::msg::Bool_<std::allocator<void> >, std::allocator<void> >::do_inter_process_publish(std_msgs::msg::Bool_<std::allocator<void> > const&) /opt/ros/galactic/include/rclcpp/publisher.hpp:288
    #7 0x5574f4b9f990 in rclcpp::Publisher<std_msgs::msg::Bool_<std::allocator<void> >, std::allocator<void> >::publish(std_msgs::msg::Bool_<std::allocator<void> > const&) /opt/ros/galactic/include/rclcpp/publisher.hpp:216
    #8 0x5574f4b9f990 in main /home/redacted/sandbox/lease_renew_segfault/lease_renew_segfault/src/main.cpp:12
    #9 0x7ff2d21ef0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x240b2)
    #10 0x5574f4b9ff3d in _start (/home/redacted/sandbox/lease_renew_segfault/build/lease_renew_segfault/lease_renew_segfault+0xdf3d)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/opt/ros/galactic/lib/x86_64-linux-gnu/libddsc.so.0+0x88638) in lease_renew
==604209==ABORTING

Additional information

In this example, I am not calling assert_liveliness, but this behavior comes up either way

Aposhian commented 2 years ago

I also confirmed that the same behavior occurs on:

AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV (/opt/ros/rolling/lib/x86_64-linux-gnu/libddsc.so.0+0x88638) in lease_renew ==630514==ABORTING ``

Aposhian commented 2 years ago

I also confirmed that this behavior does not occur if I use FastRTPS by setting:

export RMW_IMPLEMENTATION=rmw_fastrtps_cpp
clalancette commented 2 years ago

Thanks for the easy-to-reproduce example, that really helps.

  • ros-rolling-cyclonedds 0.8.0

    • ros-rolling-rmw-cyclone-dds 1.1.2

    • ros-rolling-rclcpp 15.0.0

I'll note that these are old versions of packages on Rolling; that's because we stopped updating Rolling on Ubuntu 20.04.

That said, I tried out this example with the latest sources on Ubuntu 22.04, and the crash is still there. Given the stack trace from asan, it looks like this is probably somewhere down in CycloneDDS proper, though it could be the interaction between rmw_cyclonedds_cpp and CycloneDDS. Either way, pinging @eboasson for thoughts.

eboasson commented 2 years ago

Ouch. Definitely in Cyclone DDS proper. (This is almost always the case if the crash is not related to user data in a read/write-like operation (or more generally pointers in input parameters), because it is pretty well hardened against abuse of entity handles.)

@dpotman I won't have time to look at this the next couple of days, but perhaps you could have a first look at it? Thanks!

dpotman commented 2 years ago

@dpotman I won't have time to look at this the next couple of days, but perhaps you could have a first look at it? Thanks!

Sure, I'll look into it

dpotman commented 2 years ago

Cyclone assumes that the lease duration is not infinite (the default) in case the manual-by-topic liveliness policy is used, and the problem that causes the segfault is that it does not correctly check this when renewing the writer lease. I'll create a PR to fix this in Cyclone.

A workaround is to add a (non-infinite) lease duration in the QoS object used when creating the publisher:

    rmw_time_t duration{10, 0};
    auto pub = node->create_publisher<std_msgs::msg::Bool>("test_topic",
        rclcpp::QoS(10)
        .liveliness(rclcpp::LivelinessPolicy::ManualByTopic)
        .liveliness_lease_duration(duration)
    );
eboasson commented 2 years ago

@clalancette just in case you didn't notice, the fix for this crash is in cyclone 0.9.1 released a few days ago.

If updating to this version still fits with the Humble release schedule you may want to update, otherwise, I don't think there's any harm in leaving it at the 0.9.0 release. The only change that is really relevant to ROS is for this issue's case. I would say it only fixes an accidental misconfiguration as a manual-by-topic liveliness policy is infinite lease is effectively exactly the same as an automatic one with an infinite lease duration. The only observable difference has to do with the QoS matching rules, but basing a system on QoS mismatching is quite evil!

clalancette commented 2 years ago

If updating to this version still fits with the Humble release schedule you may want to update, otherwise, I don't think there's any harm in leaving it at the 0.9.0 release.

With the release being just a few days away, I think we'll stick with 0.9.0 for the initial release.

However, it does seem like this is a bugfix we want, so what I'm going to do here is to open an issue for releasing 0.9.1 into Humble for Patch Release 1.