ros / ros_comm

ROS communications-related packages, including core client libraries (roscpp, rospy, roslisp) and graph introspection tools (rostopic, rosnode, rosservice, rosparam).
http://wiki.ros.org/ros_comm
752 stars 911 forks source link

Topic statistics do not handle well ROS time jumping back - denial of service #2249

Open peci1 opened 2 years ago

peci1 commented 2 years ago

How I found it

I have a bag file which I replay from the beginning a few times in succession, and one of my nodes processing IMU messages stopped being able to process messages for no obvious reason. In the end, it was processing e.g. at 0.1 Hz while the messages are coming at 100 Hz. Restarting the node fixes the issue temporarily.

Analysis

I accidentaly enabled topic statistics by setting /enable_statistics ROS param to true. With topic statistics enabled, this bug manifested.

https://github.com/ros/ros_comm/blob/dd78ac8af128bb8eb992d6431bb9f994658ea6ab/clients/roscpp/src/libros/statistics.cpp#L118

stats.last_publish is updated with the last time when a message was published to /statistics. If then a bag file starts playing again, the check referenced above will for long not be true (actually never again if you're playing the same bag).

There is a list in stats.arrival_time_list that is cleared only when a stats message is published:

https://github.com/ros/ros_comm/blob/dd78ac8af128bb8eb992d6431bb9f994658ea6ab/clients/roscpp/src/libros/statistics.cpp#L252

So this list is growing without bounds when the bag is replayed. Each received message adds a record to this list. With IMU messages @ 100 Hz, this quickly comes to the order of 1e5 and even higher orders.

Finally, this growing list is being copied twice in each call of the callback() function:

https://github.com/ros/ros_comm/blob/dd78ac8af128bb8eb992d6431bb9f994658ea6ab/clients/roscpp/src/libros/statistics.cpp#L84

and

https://github.com/ros/ros_comm/blob/dd78ac8af128bb8eb992d6431bb9f994658ea6ab/clients/roscpp/src/libros/statistics.cpp#L258

These copies are what is slowing the subscriber down to unacceptable slowness after a few bag replays. Whenever I pause the node in GDB, PollManager is caught copying this list:

std::list<ros::Time>::_M_assign_dispatch<std::_List_const_iterator<ros::Time> > list.tcc:313
std::list<ros::Time>::operator= list.tcc:285
ros::StatisticsLogger::StatData::operator= statistics.h:91
ros::StatisticsLogger::callback 0x00007f98cbce233a
ros::Subscription::handleMessage 0x00007f98cbd38e19
ros::TransportPublisherLink::handleMessage 0x00007f98cbd1881d
ros::TransportPublisherLink::onMessage 0x00007f98cbd1894c
boost::function4<void, const boost::shared_ptr<ros::Connection> &, const boost::shared_array<unsigned char> &, unsigned int, bool>::operator() function_template.hpp:759
ros::Connection::readTransport 0x00007f98cbc9a24c
boost::function1<void, const boost::shared_ptr<ros::Transport> &>::operator() function_template.hpp:759
ros::TransportTCP::socketUpdate 0x00007f98cbd1343d
boost::function1<void, int>::operator() function_template.hpp:759
ros::PollSet::update 0x00007f98cbd49d43
ros::PollManager::threadFunc 0x00007f98cbcd5365
<unknown> 0x00007f98c4411bcd
start_thread 0x00007f98c50076db
clone 0x00007f98c8d2161f

I'll sketch a PR fixing this by detecting ROS time jumping back and resetting the timestamp check.

peci1 commented 2 years ago

Fix proposed in https://github.com/ros/ros_comm/pull/2250 .