kas-lab / suave

An Exemplar for Self-Adaptive Underwater Vehicles performing pipeline inspection
https://kas-lab.github.io/suave/
Apache License 2.0
28 stars 9 forks source link

ROS2 Daemon Segfault #136

Closed azdle closed 1 year ago

azdle commented 1 year ago

Hi,

I work for a company called Auxon that makes trace-based testing and verification tools primarily for cyber-phyical systems. We're trying to put together a demonstration based on suave. However, we're seeing some segfaults that our ROS tooling seems to be making worse, but that seems to already exist without it. I'm wondering if you already know anything about these problems.

For reference, this is running suave straight from this repo without any of our tooling, commit beb4712, within docker, by running the command ./build_docker_images.sh && docker run -it --shm-size=512m -p 6901:6901 -e VNC_PW=password --security-opt seccomp=unconfined -v ~/suave_results:/home/kasm-user/suave/results suave:dev and then running rr ./example_run.sh within the container. The only change that I've made is to install gdb and rr to be able to debug the system.

When I do a run, it does seem to work as intended, the vehicle finds the pipe and follows it until the mission times-out, but I see what I believe is the ROS daemon itself has segfaulted:

PID PPID    EXIT    CMD
562472  562467  -11 /usr/bin/python3 -c from ros2cli.daemon.daemonize import main; main() --name ros2-daemon --ros-domain-id 0 --rmw-implementation rmw_fastrtps_cpp
Thread 1 received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(rr) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ff0147297d9 in _ZNSt13random_device7_M_initERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE (this=0x7ffc53864000, token=0x7ffc53863fe0) at ./src/preload/overrides.c:233
#2  0x00007ff010276c66 in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#3  0x00007ff00ff10531 in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6
#4  0x00007ff00ff11b2c in eprosima::fastrtps::rtps::RTPSDomainImpl::create_participant_guid(int&, eprosima::fastrtps::rtps::GUID_t&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#5  0x00007ff00ff77e2d in eprosima::fastdds::dds::DomainParticipantImpl::DomainParticipantImpl(eprosima::fastdds::dds::DomainParticipant*, unsigned int, eprosima::fastdds::dds::DomainParticipantQos const&, eprosima::fastdds::dds::DomainParticipantListener*) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#6  0x00007ff00ff732c0 in eprosima::fastdds::dds::DomainParticipantFactory::create_participant(unsigned int, eprosima::fastdds::dds::DomainParticipantQos const&, eprosima::fastdds::dds::DomainParticipantListener*, eprosima::fastdds::dds::StatusMask const&) () from /opt/ros/humble/lib/libfastrtps.so.2.6
#7  0x00007ff0104719f3 in ?? () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
#8  0x00007ff01047da7d in rmw_fastrtps_shared_cpp::create_participant(char const*, unsigned long, rmw_security_options_s const*, bool, char const*, rmw_dds_common::Context*) ()
   from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
#9  0x00007ff0104cfaa3 in ?? () from /opt/ros/humble/lib/librmw_fastrtps_cpp.so
#10 0x00007ff0104d8120 in rmw_create_node () from /opt/ros/humble/lib/librmw_fastrtps_cpp.so
#11 0x00007ff01331684a in rcl_node_init () from /opt/ros/humble/lib/librcl.so
#12 0x00007ff013999117 in ?? () from /opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/_rclpy_pybind11.cpython-310-x86_64-linux-gnu.so
<snip>

When we add our instrumentation for ros topic publisher and subscriber tracing we are seeing much more variable segfaulting happening within the instrumentation in many different places/processes, but I haven't been able to nail down an exact cause for that yet and I'm currently assuming that this segfault may be related and since it is much more consistent I'm trying to diagnose this one first.

I'm going to continue trying to get more information out of this system about what is going on, but I'm pretty new to ROS so I just wanted to see if there's anything that stands out to you as obvious for what could be wrong.

Also, if you happen to know off hand:

Those are things that are in my list of things that I'm going to try to figure out at some point that have stumped me in the brief attempts I've made so far.

Thanks for your time, I'd love to hear any thoughts you have on this.

azdle commented 1 year ago

This is a bug in rr and seems to have nothing to do with suave or ROS. Sorry for the noise.

#1 0x00007ff0147297d9 in _ZNSt13random_device7_M_initERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE (this=0x7ffc53864000, token=0x7ffc53863fe0) at ./src/preload/overrides.c:233 is coming from rr: https://github.com/rr-debugger/rr/blob/5.5.0/src/preload/overrides.c#L233