ros2 / rmw_dps

Implementation of the ROS Middleware (rmw) Interface using Intel's Distributed Publish & Subscribe.
Apache License 2.0
23 stars 8 forks source link

Memory leakage when publishing big messages at high frequencies #34

Closed mauropasse closed 5 years ago

mauropasse commented 5 years ago

I've noticed that when publishing big messages at high rates, there's a constant increase in RSS (and virtual) memory, even if the subscriber gets all the messages.

To reproduce the situation, clone the ros2 examples repo and apply patch: patch.log

This patch makes the publisher_lambda publish high frequency messages of size 64,5 Kb (close to the max UDP datagram size, max message possible to send by rmw_dps) and the subscriber_lambda prints the messages and checks if any is lost (subscriber_lambda sholud be run before publisher_lambda to properly detect lost messages)

mkdir -p ~/mem_test_ws/src && cd $_
git clone https://github.com/ros2/examples.git
cd examples
git apply ~/patch.log
cd ~/mem_test_ws
colcon build

#Terminal 1: 
export RMW_IMPLEMENTATION=rmw_dps_cpp
cd ~/mem_test_ws/install/examples_rclcpp_minimal_subscriber/lib/examples_rclcpp_minimal_subscriber/
./subscriber_lambda

#Terminal 2:
top | grep publisher

#Terminal 3:
export RMW_IMPLEMENTATION=rmw_dps_cpp
cd ~/mem_test_ws/install/examples_rclcpp_minimal_publisher/lib/examples_rclcpp_minimal_publisher/
./publisher_lambda

The output of top in terminal 2 shows an increase in memory:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9032 root      20   0  179440  19792  11696 S  17.6  0.0   0:00.53 publisher_lambd                                   
 9032 root      20   0  197716  38472  11696 S  49.7  0.1   0:02.02 publisher_lambd                                   
 9032 root      20   0  217880  58952  11696 R  48.8  0.1   0:03.49 publisher_lambd                                   
 9032 root      20   0  238300  79688  11696 S  47.7  0.2   0:04.92 publisher_lambd                                   
 9032 root      20   0  257584  99280  11696 R  48.3  0.2   0:06.37 publisher_lambd                                   
 9032 root      20   0  277120 119124  11696 S  46.8  0.2   0:07.78 publisher_lambd                                   
 9032 root      20   0  296276 138592  11696 S  47.7  0.3   0:09.21 publisher_lambd                                   
 9032 root      20   0  315180 157808  11696 S  47.0  0.3   0:10.62 publisher_lambd                                   
 9032 root      20   0  335472 178408  11696 S  48.5  0.4   0:12.08 publisher_lambd                                   
 9032 root      20   0  354128 197368  11696 S  47.3  0.4   0:13.50 publisher_lambd                                   
 9032 root      20   0  373788 217348  11696 S  48.2  0.4   0:14.95 publisher_lambd                                   
 9032 root      20   0  392692 236568  11696 S  50.7  0.5   0:16.47 publisher_lambd                                   
 9032 root      20   0  412480 256680  11696 S  54.3  0.5   0:18.10 publisher_lambd                                   
 9032 root      20   0  432016 276536  11696 S  49.0  0.6   0:19.57 publisher_lambd                                   
 9032 root      20   0  450292 295112  11696 S  49.2  0.6   0:21.05 publisher_lambd                                   
 9032 root      20   0  468568 313692  11696 S  52.7  0.6   0:22.63 publisher_lambd                                   
 9032 root      20   0  486212 331640  11696 S  50.8  0.7   0:24.16 publisher_lambd                                   
 9032 root      20   0  505872 351620  11696 S  52.3  0.7   0:25.73 publisher_lambd                                   
 9032 root      20   0  524524 370584  11696 S  50.0  0.7   0:27.23 publisher_lambd                                   
 9032 root      20   0  544060 390440  11696 S  52.5  0.8   0:28.81 publisher_lambd                                   
 9032 root      20   0  564228 410936  11696 S  52.0  0.8   0:30.37 publisher_lambd                                   
 9032 root      20   0  583068 429896  11696 S  52.3  0.9   0:31.94 publisher_lambd                                   
 9032 root      20   0  602984 450312  11696 S  52.0  0.9   0:33.50 publisher_lambd                                   
 9032 root      20   0  622264 469900  11696 S  50.8  1.0   0:35.03 publisher_lambd                                   
 9032 root      20   0  642180 490132  11696 S  52.0  1.0   0:36.59 publisher_lambd                                   
 9032 root      20   0  660204 508460  11696 S  50.0  1.0   0:38.09 publisher_lambd                                   
 9032 root      20   0  680620 529204  11696 S  50.8  1.1   0:39.62 publisher_lambd                                   
 9032 root      20   0  699276 548172  11696 S  52.0  1.1   0:41.18 publisher_lambd                                   
 9032 root      20   0  718808 568028  11696 S  52.0  1.1   0:42.74 publisher_lambd                                   
 9032 root      20   0  737588 587116  11696 S  50.5  1.2   0:44.26 publisher_lambd                                   
 9032 root      20   0  757376 607216  11696 S  48.3  1.2   0:45.71 publisher_lambd                                   
 9032 root      20   0  776280 626428  11696 S  47.3  1.3   0:47.13 publisher_lambd                                   
 9032 root      20   0  794936 645396  11696 S  47.7  1.3   0:48.56 publisher_lambd                                   
 9032 root      20   0  813336 664100  11696 S  47.2  1.3   0:49.98 publisher_lambd                                   
 9032 root      20   0  831736 682808  11696 S  48.0  1.4   0:51.42 publisher_lambd                                   
 9032 root      20   0  851400 702780  11696 S  47.5  1.4   0:52.85 publisher_lambd                                   
 9032 root      20   0  871688 723280  11696 S  47.7  1.5   0:54.28 publisher_lambd                                   
 9032 root      20   0  890972 743004  11696 S  51.0  1.5   0:55.81 publisher_lambd                                   
 9032 root      20   0  908364 760700  11696 S  49.8  1.5   0:57.31 publisher_lambd                                   
 9032 root      20   0  927900 780560  11696 S  50.7  1.6   0:58.83 publisher_lambd                                   
 9032 root      20   0  948068 801048  11696 S  49.8  1.6   1:00.33 publisher_lambd                                   
 9032 root      20   0  965460 818768  11696 S  46.7  1.7   1:01.73 publisher_lambd                                   
 9032 root      20   0  988776 842568  11696 S  43.7  1.7   1:03.04 publisher_lambd                                   
 9032 root      20   0 1013984 868296  11696 S  44.5  1.8   1:04.38 publisher_lambd                                   
 9032 root      20   0 1036984 891780  11696 S  44.7  1.8   1:05.72 publisher_lambd                                   
 9032 root      20   0 1058788 914044  11696 S  44.2  1.8   1:07.05 publisher_lambd                                   
 9032 root      20   0 1067864 923364  11696 S  46.3  1.9   1:08.44 publisher_lambd                                   

In terminal 1, subscriber_lambda shows all messages it got from the publisher. None is lost.

When running this test on the RPi2, I still can see the memory increasing using publish frequencies from 62Hz.

Any ideas why this could be happening?

malsbat commented 5 years ago

Thanks for the report, I will investigate.

malsbat commented 5 years ago

@mauropasse, I suspect what is happening is that the DPS node thread is getting starved and the callbacks to free the allocated publish requests are backing up. The follow up question to answer is why your log shows the publisher is using up half the CPU. I'm continuing to investigate...

malsbat commented 5 years ago

@mauropasse, https://github.com/ros2/rmw_dps/pull/35 should address the memory growth. Please let me know.

mauropasse commented 5 years ago

Thanks for solving it so fast @malsbat! It works perfect now. Tested in x86_64 and RPi and no more memory issues due high rate publishing. Closing the issue here.

mauropasse commented 5 years ago

@malsbat, if the node is configured in rmw_node.cpp to only receive multicast messages the app crashes:

ret = DPS_StartNode(node_impl->node_, DPS_MCAST_PUB_ENABLE_RECV, 0);

./subscriber_lambda 
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  failed to publish message: error not set, at /root/ros2_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:257
Aborted (core dumped)
malsbat commented 5 years ago

@mauropasse, I noticed this late yesterday as well doing some experiments on the CPU usage of multicast. The issue is in DPS, the publish callback reports an error if there were no destinations to publish to, when it should only report failure if it there were destinations and no publish succeeded.

I'm going to merge the PR since it addresses the memory issue.

mauropasse commented 5 years ago

Ok, thanks @malsbat

malsbat commented 5 years ago

@mauropasse, a fix for the failing publish is in progress at https://github.com/intel/dps-for-iot/pull/110. Travis seems a bit slow today, I am waiting for the CI builds to pass and then I will merge it.

mauropasse commented 5 years ago

That's great @malsbat! I'll test it soon!

malsbat commented 5 years ago

https://github.com/intel/dps-for-iot/pull/110 has been merged