ros / diagnostics

Packages related to gathering, viewing, and analyzing diagnostics data from robots.
https://index.ros.org/p/diagnostics/
Other
102 stars 178 forks source link

Updater raises std::bad_alloc sporadically on arm64 with O3 optimizations #119

Open Schwo0ps opened 5 years ago

Schwo0ps commented 5 years ago

The issue is a complicated one, but here goes.

I first noticed this issue when I saw that diagnostics messages from arm64 machines sometimes arrive, but only infrequently (between 0% and 10% of the time), and eventually the following message comes from the Diagnostic Aggregator running on amd64 and it appears all messages are dropped.

[ERROR] [1564529964.093868659 /diag_agg] [/tmp/binarydeb/ros-kinetic-roscpp-1.12.14/src/libros/transport_publisher_link.cpp:TransportPublisherLink::onMessageLength:175]: a message of over a gigabyte was predicted in tcpros. that seems highly unlikely, so I'll assume protocol synchronization is lost.

At first, I though it was Endianness, but all machines are Little Endian. There are also C++ nodes which are able to communicate with each other properly on all machines. It also appears as though this only happens with the Diagnostic Updater, not any other topic.

After this, I started running a test: just running roscore and a single test node with the following code.

#include <ros/ros.h>
#include <diagnostic_updater/diagnostic_updater.h>
#include <diagnostic_updater/update_functions.h>

int main(int argc, char** argv)
{
  ros::init(argc, argv, "updater_node");
  double rate = 10.;
  diagnostic_updater::Updater updater;
  diagnostic_updater::FrequencyStatus frequency_status(
    diagnostic_updater::FrequencyStatusParam(&rate, &rate)
  );
  updater.setHardwareID("none");
  updater.add(frequency_status);

  ros::Rate r(rate);
  while (ros::ok())
  {
    frequency_status.tick();
    updater.update();
    ros::spinOnce();
    r.sleep();
  }
}

Everything works fine with amd64, but on arm64, the above issues happen. Additionally, the node just eventually crashes with std::bad_alloc. Here is the backtrace and relevant message that was published (as it looks like the error was with serialization)

#13 0x0000000000419084 in diagnostic_updater::Updater::publish (this=this@entry=0x7fffffecc0, status_vec=std::vector of length 1, capacity 1 = {...})
    at /opt/ros/kinetic/include/diagnostic_updater/diagnostic_updater.h:547
547         publisher_.publish(msg);
(gdb) list
542             node_name_.substr(1) + std::string(": ") + iter->name;
543         }
544         diagnostic_msgs::DiagnosticArray msg;
545         msg.status = status_vec;
546         msg.header.stamp = ros::Time::now(); // Add timestamp for ROS 0.10
547         publisher_.publish(msg);
548       }
549 
550       /**
551        * Publishes on /diagnostics and reads the diagnostic_period parameter.
(gdb) p msg
$7 = {header = {seq = 0, stamp = {<ros::TimeBase<ros::Time, ros::Duration>> = {sec = 1564612119, nsec = 323701618}, <No data fields>}, frame_id = ""}, status = std::vector of length 1, capacity 1 = {{level = 0 '\000', name = "updater_node_1564612117153179612: Frequency Status", message = "Desired frequency met", hardware_id = "none", 
      values = std::vector of length 7, capacity 7 = {{key = "Events in window", value = "22"}, {key = "Events since startup", value = "22"}, {key = "Duration of window (s)", value = "2.100290"}, {key = "Actual frequency (Hz)", value = "10.474742"}, {key = "Target frequency (Hz)", value = "10.000000"}, {
          key = "Minimum acceptable frequency (Hz)", value = "9.000000"}, {key = "Maximum acceptable frequency (Hz)", value = "11.000000"}}}}}
(gdb) bt
#0  0x0000007fb7a04528 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x0000007fb7a059e0 in __GI_abort () at abort.c:89
#2  0x0000007fb7bde254 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#3  0x0000007fb7bdbdc4 in ?? () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#4  0x0000007fb7bdbe10 in std::terminate() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#5  0x0000007fb7bdc0d4 in __cxa_throw () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#6  0x0000007fb7bdc6d8 in operator new(unsigned long) () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#7  0x000000000040f840 in ros::serialization::serializeMessage<diagnostic_msgs::DiagnosticArray_<std::allocator<void> > > (message=...) at /opt/ros/kinetic/include/ros/serialization.h:795
#8  0x000000000040d60c in boost::_bi::list1<boost::reference_wrapper<diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const> >::operator()<ros::SerializedMessage, ros::SerializedMessage (*)(diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const&), boost::_bi::list0> (f=<optimized out>, a=<synthetic pointer>..., 
    this=<optimized out>) at /usr/include/boost/function/function_template.hpp:129
#9  boost::_bi::bind_t<ros::SerializedMessage, ros::SerializedMessage (*)(diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const&), boost::_bi::list1<boost::reference_wrapper<diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const> > >::operator() (this=<optimized out>) at /usr/include/boost/bind/bind.hpp:893
#10 boost::detail::function::function_obj_invoker0<boost::_bi::bind_t<ros::SerializedMessage, ros::SerializedMessage (*)(diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const&), boost::_bi::list1<boost::reference_wrapper<diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const> > >, ros::SerializedMessage>::invoke (
    function_obj_ptr=...) at /usr/include/boost/function/function_template.hpp:138
#11 0x0000007fb7efc8b0 in ros::TopicManager::publish(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::function<ros::SerializedMessage ()> const&, ros::SerializedMessage&) () from /opt/ros/kinetic/lib/libroscpp.so
#12 0x000000000040fe14 in ros::Publisher::publish<diagnostic_msgs::DiagnosticArray_<std::allocator<void> > > (this=this@entry=0x7fffffee88, message=...) at /usr/include/c++/8/new:169
#13 0x0000000000419084 in diagnostic_updater::Updater::publish (this=this@entry=0x7fffffecc0, status_vec=std::vector of length 1, capacity 1 = {...}) at /opt/ros/kinetic/include/diagnostic_updater/diagnostic_updater.h:547
#14 0x000000000040c5ac in diagnostic_updater::Updater::force_update (this=0x7fffffecc0) at /opt/ros/kinetic/include/diagnostic_updater/diagnostic_updater.h:440
#15 diagnostic_updater::Updater::update (this=0x7fffffecc0) at /opt/ros/kinetic/include/diagnostic_updater/diagnostic_updater.h:390
#16 main (argc=<optimized out>, argv=<optimized out>) at /ws/src/test/src/updater_node.cpp:20

It crashes after ~30 seconds but seems to do so more quickly if multiple of the nodes are running.

As specified in the title, this only happens with GCC optimization level 3 (compiling with -O3 or CMAKE_BUILD_TYPE=Release). I tried with -O2 and it appears to work fine.

I doubt this can be easily fixed and I'm not 100% sure if the issue is within this repo or ros_comm, but any ideas would be greatly appreciated. We really would like to use -O3 throughout our code for improved performance.

Ecophagy commented 2 years ago

Bit of a necro, but this looks like a gcc bug, the same as https://github.com/ros/ros_comm/issues/2197 & https://github.com/ros/roscpp_core/issues/130

If you need to use -O3 then I think your only option is to update your version of gcc.

peci1 commented 1 year ago

PR https://github.com/ros/roscpp_core/pull/136 should fix this. I'm looking for someone who could verify. Just please notice that Focal now has GCC 9.4 by default where I could not reproduce the issue. So the test would need to be done with GCC 9.3 installed explicitly.