ros2 / rmw_fastrtps

Implementation of the ROS Middleware (rmw) Interface using eProsima's Fast RTPS.
Apache License 2.0
154 stars 117 forks source link

insufficient performance in the QoS demo using default parameters #202

Closed dirk-thomas closed 6 years ago

dirk-thomas commented 6 years ago

The scope of this ticket is focus on the performance of the QoS demo with default values between publisher and subscriber with the same RMW implementation. The cross vendor results are only mentioned for completeness / context.

To reproduce run the easiest example from the QoS demo:

Which means:

The following results show significant differences in the performance depending on which RMW implementation is chosen on the publisher and subscriber side (collected with the default branches on Ubuntu 16.04 on a Lenovo P50). Only the diagonal highlighted with "quality" colors is of interest for now:

Sub \ Pub FastRTPS Connext OpenSplice
FastRTPS #ffa000 Little stuttering, not smooth Flawless Severe increasing lag, multi-seconds within seconds
Connext One second burst, short hang, repeat #00ff00 Flawless One second burst, short hang, repeat
OpenSplice Flawless Stuttering #00ff00 Flawless

When increasing the image size to -x 640 -y 480 the problems become even more apparent:

Sub \ Pub FastRTPS Connext OpenSplice
FastRTPS #f03c15 Much more stuttering Hangs for several seconds between bursts Severe increasing lag, multi-seconds within seconds
Connext One second burst, short hang, repeat #00ff00 Flawless One second burst, short hang, repeat
OpenSplice Flawless Smooth but significantly reduced framerate (even on publisher side) #00ff00 Flawless

The acceptance criteria to resolve this ticket are:

PS: please don't suggest variations of the QoS parameters in this thread but keep this ticket focused on this very specific case. Other cases can be considered separately.

codebot commented 6 years ago

Another data point: a new ThinkPad X1 (i7-8550U) running Ubuntu 18.04. Qualitative performance is the same as reported by @dirk-thomas :

FastRTPS-FastRTPS

Connext-Connext

mikaelarguedas commented 6 years ago

For people testing this today: if you can afford to rebuild a workspace before testing, please consider using the timestamp_console_logging branch from rcutils (and rclcpp if you build the tests). This will allow use to have timestamps reported in the console output and give more quantitative feedback.

To use it:

Example of resulting output: Publisher

[INFO] [cam2image] [1527186403.826753588]: Publishing image #400
[INFO] [cam2image] [1527186403.860162956]: Publishing image #401
[INFO] [cam2image] [1527186403.893428871]: Publishing image #402
[INFO] [cam2image] [1527186403.926734432]: Publishing image #403
[INFO] [cam2image] [1527186403.960066041]: Publishing image #404
[INFO] [cam2image] [1527186403.993477350]: Publishing image #405

Subscriber:

[INFO] [showimage] [1527186403.829860703]: Received image #400
[INFO] [showimage] [1527186403.862650268]: Received image #401
[INFO] [showimage] [1527186403.895934081]: Received image #402
[INFO] [showimage] [1527186403.929688603]: Received image #403
[INFO] [showimage] [1527186403.963225085]: Received image #404
[INFO] [showimage] [1527186403.996462511]: Received image #405

So I can confidently say that for me Fast-RTPS is smooth and doesnt show latency at 30 Hz (3ms of delay after 400 frames).

codebot commented 6 years ago

Do you see non-trivial jitter? I'm not sure how to quantify that other than gathering a lot of samples and showing the distribution in a log-log plot or something. We could compute first-order statistics (median, std.dev, etc) but those tend to not fully capture the things we worry about, like the worst-case or "fraction of outliers," etc.

clalancette commented 6 years ago

I tested Fast-RTPS on my Fedora machine as well (all ros2 repos on latest master). The Fast-RTPS -> Fast-RTPS was also very stuttering, just like what @dirk-thomas mentioned in the initial comment.

I'm having some trouble rebuilding on the timestamp_console_logging branch that @mikaelarguedas mentioned, but I'll try a bit more and report back here once I figure it out.

dhood commented 6 years ago

Seeing the same stuttering with fastrtps on my 16.04 desktop.

These are some screen captures of the performance:

clalancette commented 6 years ago

I finally got the timestamps compiling, and took some data on my machine. First, this is the default of 320x240, and the cam2image process took ~12% of a core and the showimage took ~10% of a core (so I was not CPU bound). Second, once I took the data I post-processed it with a small python script to calculate the delta between successive publish/receives (I can share that script later if needed).

The cam2image publishing data is here: https://gist.github.com/clalancette/def97b55da9c6c2af52f9a6195b4845d , and it shows about what we expect; we publish an image every ~33ms, for a frame rate of around 30Hz.

The showimage subscribe data is here: https://gist.github.com/clalancette/1d6259b52e6679a96c8f3eafb2ffd9de . It shows that the data is being received in a bursty manner. Looking at line 57-82, for instance, we see that it is receiving data at ~33ms, and then at line 83 there is an 'event' that causes the next frame to take ~5 times longer. That causes a backlog of 5 items which are processed quickly (lines 84-88), at which point things get back to normal.

JaimeMartin commented 6 years ago

We were commenting the issue in our daily meeting. We are now working in some performance improvements, including the rmw layer, so it is perfect timing to analyze this case. We will update you guys soon.

mikaelarguedas commented 6 years ago

Result of the 640x480 test on the 3 rmw implementations with the following setup: Dell XPS15, Ubuntu 16.04, send/receive 400 images, image rendering enabled

ros2 run image_tools cam2image -b -x 640 -y 480

640x480:

  Min Max Mean Median Stddev fps sender CPU receiver CPU
connext 0.0300738811 0.041087389 0.0318081552 0.0316724777 0.0010073057 30.2290740409 50.00% 50.00%
opensplice 0.010027647 0.051132679 0.022188526 0.019367218 0.0093266182 30.2616755364 36.00% 31.00%
fastrtps 0.0020124912 0.1774580479 0.0352770768 0.0078625679 0.0464236557 30.6302253416 22.00% 33.00%

Serious stuttering with FastRTPS, very smooth for the other ones. Below a graph of the latency for each rmw_implementation.

image

wjwwood commented 6 years ago

I experienced the same behavior as @mikaelarguedas on my 2017 macbook pro (macOS 10.13.3).

I noticed a few behaviors of interest to me:

I experienced similar behaviors on my Windows desktop, not exactly the same, but the same trend (enough to convince me that it's not OS dependent at this point). I can provide timestamped data if needed on any platform.

dejanpan commented 6 years ago

@andreaspasternak can you weigh in here please?

richiware commented 6 years ago

In this branch I've decreased default values for some timers and it improves the responsive at RTPS level. I've noticed it using cam2image and showimage. Can anyone test with this branch?

mikaelarguedas commented 6 years ago

Thanks @richiware for looking into it. The behavior on the hotfix branch is much better indeed!

Could you breifly describe what are the tradeoffs being made using the new timer values?

Result on my XP15 using the new branch:

  Min Max Mean Median Stddev fps
connext 0.0300700665 0.0410900116 0.0318081648 0.0316700935 0.0010073107 30.2290620882
opensplice 0.0100297928 0.0511300564 0.0221883707 0.019359827 0.0093266318 30.261684248
fastrtps 0.002010107 0.1774599552 0.0352770948 0.0078701973 0.0464236037 30.6302342667
fastrtps_hotfix 0.001853466 0.0287704468 0.0074550388 0.0068049431 0.0458932319 30.2791350882

image_demo_latency_hotfix_comparison

With the hotfix and for this test, Fast-RTPS has less latency than the other rmw implementations using default values: image_demo_latency_hotfix

Note: I haven't tested other frequencies of message sizes

dirk-thomas commented 6 years ago

The branch definitely shows a better performance with the burger demo. Thank you for the quick update :+1:

Could you breifly describe what are the tradeoffs being made using the new timer values?

Yeah, any information what you expect as "side effects" for other use cases would be good to hear.

deeplearningrobotics commented 6 years ago

Hi all,

Regarding this:

Second, once I took the data I post-processed it with a small python script to calculate the delta between successive publish/receives (I can share that script later if needed).Do you see non-trivial jitter? I'm not sure how to quantify that other than gathering a lot of samples and showing the distribution in a log-log plot or something.

Next week we will release as open source a performance measurement tool for various communication systems. Its core features are:

We were working together with eProsima (FastRTPS) to improve performance under real-time conditions for autonomous driving with great success. This will also improve the performance of the upcoming new RMW implementation for FastRTPS with static type support (https://github.com/eProsima/rmw_fastrtps/tree/static_typesupport)

In the following graphs, note that y-axis is latency in ms and x-axis is the duration of the experiment in seconds.

Using directly FastRTPS on a real-time system with real-time settings shows good results for medium-sized data at 1000 Hz: log_array32k

Large data also works well at 30 Hz after applying the settings in http://docs.eprosima.com/en/latest/advanced.html#finding-out-system-maximum-values: log_array2m

We found for example that busy waiting (meaning not using a waitset) leads to a very significant performance degradation due to expensive mutex locking/unlocking. log_array32k_no_waitset

Using the current dynamic type support of FastRTPS together with ROS 2 does not provide optimal performance, but this will be very likely shortly resolved by the upcoming new RMW implementation for FastRTPS with static type support. log_array32k_ros2

If you like to test other RMW implementations you can do that using the RMW_IMPLEMENTATION environment variable.

We took care to implement the performance test tool real-time capable by:

Here the latency at 100 Hz with real-time functionality enabled: log_array32k_rt

Here with without real-time functionality: log_array32k_non_rt

dirk-thomas commented 6 years ago

@richiprosima Can you please respond to the above questions? We would like to get this resolved (merged into master) rather sooner than later. Thanks.

richiware commented 6 years ago

@mikaelarguedas @dirk-thomas I've decreased two QoS values, one for publishers and other for subscribers.

Decreasing these values the response to lost data improves. A lost sample takes less time to be forwarded. This is more appreciated in the video scenario because you can see the stuttering. Why data is lost? A video frame is to large to default linux socket buffer size and some RTPS fragments are lost.

This week I want to play more with these values and it will be merge at the end of the week.

dirk-thomas commented 6 years ago

@richiware

Could you briefly describe what are the tradeoffs being made using the new timer values?

richiware commented 6 years ago

I've updated my comment ↑

mikaelarguedas commented 6 years ago

Thanks @richiware for the updated comment!

(sorry if it's obvious) Can you give examples of scenarios where a shorter nackResponseDelay (or heartbeatResponseDelay) would result in worse performance?

As a made up example of the kind of "scenarios" I'm thinking of: Is it possible that in network environments with significant latency, fragments will be considered lost while they may still be on the way, and with a shorter nackResponseDelay we would nack them too early resulting in an increasing number of resend and collisions?

richiware commented 6 years ago

nackResponseDelay is also used to reduce the heartbeat messages. If several acknack messages, from different subscribers, are received in a short time, they will be answered by one heartbeat instead of one for each acknack.

matt-attack commented 6 years ago

It's concerning to me that these parameters are fixed times. It would probably help if they were related to network latency somehow so you don't have to tune them manually for the type of network you are using.

dirk-thomas commented 6 years ago

This week I want to play more with these values and it will be merge at the end of the week.

@richiware As far as I can see the patch hasn't been merged yet. Can you clarify the timeline when that is planned to happen?

richiware commented 6 years ago

@dirk-thomas , @miguelcompany and I were working in a new release. We are integrating all new features, and also this patch, in our internal develop branch. I want to release it to master tomorrow Wednesday. Sorry for the inconveniences.

dirk-thomas commented 6 years ago

With the hotfix merged upstream I will go ahead and close this.

dejanpan commented 6 years ago

@dirk-thomas for traceability it would be nice to link to the upstream commit where the fix happened.

mikaelarguedas commented 6 years ago

Commit that changed the timer values: https://github.com/eProsima/Fast-RTPS/commit/201a393b500a35d56d3e46b5b6c2c6a96e85928e