dejanpan commented 4 years ago

We currently the following selection of tools available:

https://discourse.ros.org/t/ros-2-real-time-working-group-online-meeting-18-may-26-2020-meeting-minutes/14302/14?u=dejan_pangercic
1. [g1] performance_test (Apex) https://gitlab.com/ApexAI/performance_test
2. [g2] iRobot ros2-performance (iRobot) https://github.com/irobot-ros/ros2-performance 1
3. [g3] buildfarm_perf_test (OpenRobotics) https://github.com/ros2/buildfarm_perf_tests
Every DDS vendor has a perf testing tool, e.g.
1. https://github.com/rticommunity/rtiperftest
https://github.com/rticommunity/rtiperftest

Acceptance Criteria

[x] Consolidate the requirements from:
1. https://discourse.ros.org/t/ros-2-real-time-working-group-online-meeting-18-may-26-2020-meeting-minutes/14302/14?u=dejan_pangercic (at the tail)
2. https://drive.google.com/file/d/15nX80RK6aS8abZvQAOnMNUEgh7px9V5S/view => Standardized experiment setup
[x] Compare above tools
[x] Enhance one of the tools => perfomance_test was improved, referece_system was created to complement other use cases

fadi-labib commented 4 years ago

Based on https://discourse.ros.org/t/ros-2-real-time-working-group-online-meeting-18-may-26-2020-meeting-minutes/14302/14 & https://gist.github.com/y-okumura-isp/8c03fa6a59ce57533159c7e3e7917999, Apex.AI analyzed the comparison criteria and the three different tools in this analysis

CI: buildfarm_perf_tests + Apex.AI performance_test
iRobot ros2-performance
pendulum_control

[1] CI: buildfarm_perf_tests + Apex.AI performance_test

Test 1

In this test we are running the Performance Test provided by Apex.AI. Right now we have our own fork because there are some pending pull requests in the official gitlab repository.

Proposal for next steps: For the future, Apex.AI will try to solve this problem:

Apex.AI will allocate more resources to maintain the performance_test tool, there is already a newcomer who will be mainly focusing on this topic
If this doesn't fix the problem, Apex.AI will open up the tool for more maintainers from the community

Test 2

Apex.AI would like to understand more how this test is different from Apex.AI's performance_test tool to understand the gaps in the tool:

We understand probably because it can use two different RMW implementations so it required some tweaks to run two different instances
Also, we would like to understand if the metrics measured in the test (Average round trip, CPU usage (read from the filesystem), Total lost packets, Received/Sent packets per second, Physical memory, Resident anonymous memory, Virtual memory) are calculated in a way different than how Apex.AI performance testing does? Apex.AI uses https://gitlab.com/ApexAI/performance_test/-/blob/master/performance_test/src/experiment_execution/analysis_result.cpp, performance_test supports all metrics in https://linux.die.net/man/2/getrusage so we would like to get more clarification on the additional metrics and how they are calculated and if the common metrics are calculated differently (i.e. CPU)

Hello @cottsay, @dejanpan mentioned that you can support regarding buildfarm_perf_tests so please comment on the above if you can.

Test 3

Apex.AI shall consider supporting the use case of measuring a node overhead.

Proposal for next steps:

Apex.AI shall work with OpenRobotics in consolidating buildfarm_perf_test g[3] & performance_test g[1]. It seems it's possible to do so and that will bring the benefit of having a standard single evaluation platform, increasing the utilization of both tools and bringing the tools to maturity with larger use cases

[2] iRobot ros2-performance

The tool is mature, it has around 290 commits and around 21 forks. It is inspired by the performance_test tool

ApexAI provides an alternative valid performance evaluation framework, which allows testing different type of messages. Our implementation is inspired by their work.

Additionally, to the already open-source performance_test tool, Apex.AI has internally another performance testing tool called test_bench to test a running system that scales to dimensions of a real application, meaning about one hundred nodes with various loads of messages being passed between them. The tool is very similar to iRobot ros2-performance.

The tool is still under evaluation and Apex.AI is planning to release the tool in 2021.

Apex.AI's test-bench doesn't support:

Measuring service discovery time
Using services
Building the topology manually https://github.com/irobot-ros/ros2-performance/tree/master/performances/performance_test_factory#manually-create-ros2-nodes

Apex.AI supports the following features over iRobot's tool

Persisting the results in a database using ODB
Configuration using yml files which seems easier to manage over the json configurations in iRobot's tool (lots more repetition than Apex.AI's yml files)
iRobot's tool doesn't appear to support a simulated work time; publishers are run at a fixed frequency only. Apex.AI's test_bench allows publishers to run at a fixed frequency, but also has the option to publish once after each subscriber in the node receives a message.
Apex.AI tool supports changing the QoS settings and the threading settings which is not clear if iRobot's does

iRobot's definition for testing latencies is

Message classifications by their latency A message is classified as too_late when its latency is greater than min(period, 50ms), where period is the publishing period of that particular topic. A message is classified as late if it's not classified as too_late but its latency is greater than min(0.2*period, 5ms). The idea is that a real system could still work with a few late messages but not too_late messages. A lost message is a message that never arrived.

Apex.AI's definition is

Allowing the latency to be within the period (less than 1/frequency)

Proposal for next steps:

RTWG shall come up with a common definition for latency
After Apex.AI's evaluation of the test_bench, Apex.AI shall consider merging the test_bench with performance_test to have a single common performance evaluation tool
Besides merging test_bench with performance_test. Apex.AI will evaluate internally test_bench and compare the usability with iRobot's performance tool and then decide how to proceed (planned in 2021)
Based on the analysis, Apex.AI shall discuss with iRobot the different possibilities, available options are merging the two tools, keeping the two tools independent, adding the missing features to one of them or other options to be discussed

[3] pendulum_control

It seems the toot doesn't add much value over the other tools
It seems the tool doesn't have big updates since 2016 so a bit aged
The usage of malloc and printing stack backtrace, this is similar to the approach used by https://github.com/osrf/osrf_testing_tools_cpp
The page faults collection seems to be the same approach used by performance_test (performance_test, pendulum_control/rttest)
In https://gist.github.com/y-okumura-isp/8c03fa6a59ce57533159c7e3e7917999#metrics-comparison-table-resource, there is a jitter metric in pendulum_control, this can be calculated easily from the latencies collected by [1] & [2]

Conclusion: pendulum_control shall not be considered.

y-okumura-isp commented 4 years ago

@fadi-labib Thank you for your comment and consideration. I said I was going to post a follow-up article in discourse, and this is the one. I'm sorry to late. As I first tried to post in ROS discourse, it is a please forgive me for the long comment.

Preface

We are surveying ROS2 measurement tools, especially in terms of Real-Time. We have compared some existing tools and want to share the results. As following, we found out there are some differences between tools. We hope this gives hints to consider the measurement conditions and settings.

For the complete comparison table, please see https://gist.github.com/y-okumura-isp/8c03fa6a59ce57533159c7e3e7917999. "No1" etc. in this post means the row number of this table. This post is a follow-up post of my 2020/09/01 real-time-wg talk.

In our comparison table, function comparison is very large. So we describe this table mainly.

Target Projects

We survey the following projects:

[1] CI: buildfarm_perf_tests + Apex.AI performance_test (forked from https://gitlab.com/ApexAI/performance_test)
- results are available at http://build.ros2.org/job/Fci__nightly-performance_ubuntu_focal_amd64/
- This tool has three test and measures communication performance in a simple topology. We refer only Test1 and Test2 because they are for communication measurement(separate columns in the table below).
[2] iRobot ros2-performance
- result for crystal: https://github.com/irobot-ros/ros2-performance/tree/master/performances/experiments/crystal
- This tool can test a more complex situation than [1]. We can specify a network topology, the number of nodes and topics, and so on.
- The sample topology "sierra-nevada" has ten nodes, 13 topics, and more publishers and subscribers. We refer to this topology as "SN" in [2] column.
[3] pendulum_control
- slides: https://roscon.ros.org/2015/presentations/RealtimeROS2.pdf
- This tool shows how to set up real-time settings and run the pendulum control simulation.

Each tool has at least one publisher and subscriber. The publisher wakes periodically and sends a topic, and sleeps again. They measure the performances of programs such as topic trip time and OS resources such as CPU usage.

Program structure

We describe the functional similarities and differences of each tool in this section. As ROS has the following layer structure, we describe according to this structure.

+----------------------------+
|  Publisher / Subscription  |
|  rclcpp(Executor/Nodes)    |
|  DDS                       |   ROS2 layer
+----------------------------+

+----------------------------+
|  Process and RT-setting    |
+----------------------------+

+----------------------------+
|  HW / OS                   |
+----------------------------+

We describe how to read the table, followed by the explanations of each layer.

How to read the table

We explain how to read the comparison table. This table has the following columns.

Column name	About
Category	The layer of the structures such as "HW / OS" andro "Process" from bottom to top.
Subcategory	Divide category into a few subcategories.
name	Concrete items.
[1] Test1, [1] Test2	About [1]. As [1] has two type of test, we divide columns.
[2]	About [2].
[3]	About [3].

And we use the following notations:

"O" and "-" mean "exist" and "not exists", respectively.
"<-" means "same as the left cell".

We describe a summary of each category below.

HW / OS

About
- We have to configure HW and OS to get real-time behavior
Discussions
- Please see our previous post for other settings (The comparison table does not include them)
- https://discourse.ros.org/t/ros2-generated-child-thread-scheduling-policy-affects-timers/14212
- https://github.com/hsgwa/ros2_timer_latency_measurement/blob/master/setup.md
  - there are some examples such as kernel parameters, power supply, and so on.

Process

About
- Process general settings. See "RT setting" for real-time specific settings.
- For example, there are some differences in the usage of subprocesses and threads.
- They affect the mechanism of communication. We summarize it below (see "Communication detail" below)
Notices
- No.5 # of thread
- we exclude subthread created by rclcpp::init and DDS.
- [2] has a little complicated process and thread model: use subprocess to test in parallel, let the main thread measure metrics, and subthreads run executors. If multiple executors are specified, this tool generates a thread per executor.
Discussions
- No.3 duration: we have to get WCET(Worst-Case Execution Time) for real-time. So we have to run the measurement tool long enough to get WCET(at least a few hours)

RT setting

About
- [3] has some settings specialized for real-time.
Notices
- [1] has priority and affinity settings. But we can not find out whether CI specifies it or not.

DDS

About
- In [1], we can call DDS directly and test communication between different DDS(FastRTPS - CycloneDDS, for example). So we show differences.
Notices
- No.10 supported DDS(direct call): DDS libraries are called directly only in [1] Test1
- No.11 heterogeneous communication: we can test communication between different DDS only in [1] Test2.

rclcpp

About
- Settings such as rclcpp::init, executor, and nodes. We have separated the publisher and subscription setting for simplicity.
Notices
- No13 Executor: We can specify an allocator in ExecutorOptions. In fact, [3] uses TLSFAllocator.
Discussions
- No15 use_intra_process_comms_: [1] [2] enables NodeOptions.use_intra_process_comms. But communication optimizations are also implemented in other layers, for example, zero-copy in some DDS. It looks good for programmers to know what to use for each situation (so we need tests for each situation).
- Do we have to test advanced features such as CallbackGroup?

Communication detail

About
- Communication topology and topic setting.
Discussions
- No16 1-way/2-way: Do we need other type of topology / communication style?
- publishing in subscription callback: such as ping-pong, relay, and so on.
- timer-based subscription callback: in some applications, subscriber has its own timer
- more white-box test: codes for publish/subscribe are complex. For example, three publish functions are called in rclcpp::Publisher::publish(std::unique_ptr<MessageT, MessageDeleter> msg).
- No17-19: [3] is most appropriate for real-time?

ROS2 communication optimization

There are some communication optimizations, and each needs a specific situation. For example, we cannot use intra_process_comms for inter process communication. There are at least four patterns of the relationship between a publisher and subscribers, and we have to check which optimizations we can use for each situation.

- There are many types of relations between Pub & Subs. The below figure shows:
  - Sub1: Same process and same Node with Pub.
          There are some optimizations: Node intra_process_comm, sharing pointer, DDS zero-copy, and so on.
  - Sub2: Same process but different Node with Pub
          We may use the same optimization as Sub1.
  - Sub3: Different process but same host with Pub
          Some DDS may optimize this type of communication.
  - Sub4: Different host with Pub
          Communication over the network.

 +-----+  +------+ +------+ +----------+  +----------+
 | Pub |  | Sub1 | | Sub2 | |   Sub3   |  |   Sub4   |
 +-----+  +------+ +------+ +----------+  +----------+
 +---------------+ +------+ +----------+  +----------+
 |     Node      | | Node | |   Node   |  |   Node   |  <- rclcpp has intra_process_comms_
 +---------------+ +------+ +----------+  +----------+
 +------------------------+ +----------+  +----------+
 |        Executor        | | Executor |  | Executor |
 +------------------------+ +----------+  +----------+
 +------------------------+ +----------+  +----------+
 |          DDS           | |   DDS    |  |   DDS    |  <- some DDS support efficient communication such as intra process or shm
 +------------------------+ +----------+  +----------+
 +------------------------+ +----------+  +----------+
 |        Process         | | Process  |  | Process  |
 +------------------------+ +----------+  +----------+
 +-------------------------------------+  +----------+
 |              Host1                  |  |  Host2   |
 +-------------------------------------+  +----------+

Publisher / callback

About
- Publisher settings. Additionally, we see other settings in terms of the publisher's processing flow. For example, a typical publisher may run as following (1) wake up periodically (2) allocate variable (if necessary) and set values for it (3) call rclcpp::Publisher::publish
Notices
- No32 internal api: rclcpp::Publish::publish uses several internal fuctions such as rcl_serialized_message and can_loan_messages(). If they are appropriate for real-time, we have to test them.
Discussions
- No.24-26 data: dynamic memory allocation is undesirable for real-time application
- No.27 Timer: it's natural to use executor WallTimer.

Subscription / callback

About
- Subscription settings and settings related with its processing flow (see "Publisher / callback")

Other functions

About
- We describe communication methods other than the topic.

Measurement

About
- Conditions in measurement
Discussions
- No37 measurement under discovery: there may be delays and losses until discovery completes. Is there a standard way? (lifecycle?)

JanStaschulat commented 4 years ago

Testbench to generate use-cases:

Generation of use-cases Frontend for rclc-Execuotor: (Executor with C-API) from Bosch

https://github.com/micro-ROS/executor_testbench/tree/master/rcl_executor_testbench

Frontend for Static Executor ROS 2 (Executor with C++ API) from Nobleo

https://github.com/nobleo/rclcpp_executor_testbench

carlossvg commented 3 years ago

I will summarize the discussion up to this point:

Define metrics:

timer precision
communication quality
1. topic trip-time (1-way, 2-way) stats
2. total sent / total recv / losses
Program latency
1. PDP/EDP discovery
2. timer jitter
3. callback jitter
Resource usage (CPU, memory, page faults, network)

Open questions:

How to measure timer precision?
Communication quality => This is covered by performance test tools
- Do we need to add new features to the tools for missing metrics?
Program latency. Is it possible to measure this with performance test? Does it make sense to create a new sub-project to measure this?
Resource usage. There are several applications using different approaches. It would be interesting to aggregate all these utils in one single package.
1. performance test => getrusage
2. ros2 tracing => https://gitlab.com/ros-tracing/tracetools_analysis/-/blob/master/tracetools_analysis/launch/memory_usage.launch.py
3. tooling code inside the package
4. https://github.com/ros-tooling/system_metrics_collector

carlossvg commented 2 years ago

For the moment we will use the following benchmarking tools:

ros-realtime / community

Consolidate Performance Test Tools for ROS 2 #6