Performance test results analysis and interpretation

EduPonz commented 3 years ago

I'm opening this issue to trigger the discussion on the best methods to analyze, interpret, and compare the performance results obtained by the methods decided in #6 .

Right now, buildfarm_perf_test does a very slim and probably inadequate processing of the results, basically showing mean and standard deviation without even looking at the probability distributions, which can lead to wrong assessments in terms of for instance regression detection. Since Apex.AI performance_test already produces an "means" entry every second, I thought about benefiting from the central limit theorem and perform Gaussian statistics over those distributions. Mind that having normal distributions of the measurements will ease the comparisons, since that enables statistics such as student T-test, which can asses the significance in the difference between two different experiments. However, I encounter some problems with this approach:

For low publication rates we will need a vast amount of entries to tend to normal distributions.
Some measurements seem to be not random variables. I played with latency measurements in different publication modes. For synchronous publications I got fairly normal distributions of means when publishing for 10 min at 1000 Hz. However, asynchronous publications showed ternary distributions with three very clear events, each of them with a different probability.

The previous made me think that we have to develop a system that can decide on which statistic test to run between experiments that gives the most relevant information. Such a system could then be used for detecting regressions in CI builds, and also as a way to present to end user performance results which interpretation can be used to draw fair and relevant conclusions about the performance of the stack when using different configurations.

Furthermore, the ROS 2 middlewares allow for a great number of configurations, a lot of them having an impact on the performance of the stack. I think it'd be very important to define testing profiles and have results for each of them, so that end users can select the profile from which they will benefit the most.

I would also be very helpful for users and developers to set performance requirements on the different profiles. I my opinion, we are sometimes considering the latency difference to the micro-second, but I really don't think any robotic system minds of such small difference, specially cause the control system would never run at such high frequencies. I think from the users' perspective is not a question of who gives the very best performance, but rather who can meet my requirements. This approach would push development to meet all the requirements in every direction, improving the overall ROS 2 experience.

tokr-bit commented 3 years ago

Related to that, we measured latencies with ros2 foxy varying different parameters (payload, number of nodes, dds middleware, frequency...) and submitted a paper. You can find the preprint at arxiv: https://arxiv.org/pdf/2101.02074.pdf

Evaluation code can be found on github.

carlossvg commented 3 years ago

Thanks, @EduPonz for your comments. I will continue the discussion related to real-time statistics in this thread. I will list different options I'm aware of:

Currently: mean, standard deviation, max, min => https://gitlab.com/ApexAI/performance_test/-/blob/master/performance_test/src/utilities/statistics_tracker.hpp
worst execution time (WCET) with long term experiment durations (which is basically the max). Mentioned by @ y-okumura-isp in https://discourse.ros.org/t/ros-2-real-time-working-group-online-meeting-18-may-26-2020-meeting-minutes/14302/12
More complex statistics depending on the result distribution => Proposed by @EduPonz https://github.com/ros-realtime/community/issues/9
Histogram? => https://github.com/hsgwa/ros2_timer_latency_measurement/blob/master/src/util.cpp

Some topics I would like to discuss:

Wich statistics are relevant for real-time systems. Is there anything missing in performance test to get them?
Is it possible to decouple the result analysis from the performance test tools?
Is the current approach good for long term experiments (e.g: > 5 hours)

ros-realtime / community

Performance test results analysis and interpretation #9