High CPU usage when using DDS intra-process communication

mauropasse commented 3 years ago

CPU profiling of a ROS2 application shows that using the DDS intra-process communication uses almost twice CPU time than using the rclcpp IPC (handled by the intra-process manager).

This happens due the executor makes a copy of the message when a subscription is ready to take it: https://github.com/ros2/rclcpp/blob/7d8b2690ff188672bfe3703d8b3c074b87303931/rclcpp/src/rclcpp/executor.cpp#L637

The copy is made regardless of the use of IPC by the DDS middleware, where the message might be passed from the publisher to the subscriber without actually copying it.

The following plot shows the total CPU time spent to publish a 100Kb message and receive it on the subscription.

Tested with CycloneDDS and FastDDS, both with rclcpp IPC ON and OFF, using the StaticSingleThreadedExecutor on a single core platform (RPi 1) at 500Mhz:

Publisher → Msg: 100kb ← Subscriber The graph below shows a FlameGraph (CPU profiling) of a similar system, using the DDS intra-process (rclcpp IPC disabled)

Publisher → Msg: 8mb at 10Hz ← Subscriber We can see memmove/memset operations called twice, once by the publisher when creating the message, and once by the executor when taking the subscription message on execute_subscription(), where a new message is created.

We're planning to fix this, maybe reusing some of the logic used to take loaned messages, but first we'd like to know if this is a known issue and maybe there's already a plan for a fix, or a proposal on a possible fix implementation?

fujitatomoya commented 3 years ago

@mauropasse thanks for detailed information, really interesting.

what about using LoanedMessage instead? the following seems to support zero copy via LoanedMessage, which means that application can write / read the same exact physical memory address. i believe that this can support intra/inter-process communication w/o data copy. (if it is intra-process, it can use the same virtual address.)

CC: @MiguelCompany @eboasson would you care to share thoughts?

irobot-ros commented 3 years ago

Yes, if it's possible we should try to reuse the logic for LoanedMessages.

The rclcpp code shouldn't really care about whether the message comes from the same process or from a different process via a shared memory layer, as long as the underlying DDS supports optimized intra-process communication.

mauropasse commented 3 years ago

Hi @fujitatomoya I tested both CycloneDDS and FastDDS branches, but LoanedMessage can't still be used because:

https://github.com/eclipse-cyclonedds/cyclonedds/blob/iceoryx/src/core/ddsc/src/dds_data_allocator.c#L130 returns false FastDDS:

https://github.com/ros2/rmw_fastrtps/blob/649b1c09507bd4811d5b3e11b1816d231bb6040b/rmw_fastrtps_cpp/src/publisher.cpp#L299

is false

fujitatomoya commented 3 years ago

I tested both CycloneDDS and FastDDS branches, but LoanedMessage can't still be used because

thanks for sharing, as far as i know about rmw_fastrtps, data structure has to be bound data type. sorry i am not sure about cyclonedds. (actually we've tried zero copy via rmw_fastrtps PRs, which works okay.)

we'd like to know if this is a known issue and maybe there's already a plan for a fix, or a proposal on a possible fix implementation?

i was hoping that could be one of them 😄

eboasson commented 3 years ago

The bounded size data type restriction is also in Cyclone, and it besides needs a few more changes before it will be willing to do this in the absence of shared memory (that's a bookkeeping detail).

Frankly, I'm shocked that there are 2 memsets in the profile and that they together consume 50% of the time. Cyclone at least has a good argument for calling memmove when not doing loaned messages, but the time in memset seems really wasteful as it is basically spent inside the allocation of a new message that then (presumably) will be filled with data. I guess that's standard C++ behaviour, but you might want to look into that.

MiguelCompany commented 3 years ago

@mauropasse If your type is plain, could you check with this commit?

I should mention that ros2/rosidl_typesupport_fastrtps#67 is also required

mauropasse commented 3 years ago

Great, with @MiguelCompany's commit, using ros2/rosidl_typesupport_fastrtps#67, and commenting out this line we can use loaned messages on intra-process publishers and subscribers. I'll do some profiling to check how the CPU looks like with these changes.

mauropasse commented 3 years ago

but the time in memset seems really wasteful as it is basically spent inside the allocation of a new message that then (presumably) will be filled with data. I guess that's standard C++ behaviour, but you might want to look into that.

@eboasson yes, I've observed the same and also plan to look at this, for what I've quickly seen is due the standard C++ behaviour.

mauropasse commented 3 years ago

Benchmark results looking good with LoanedMessage!

Test: Pub/Sub system passing a 100kb message at 40Hz
  1 Default use case IPC ON (is rclcpp's IPC, so no copies). User allocates Msg
  2 Default use case IPC OFF (the executor does a copy of the message) User allocates Msg
  3 Loaned message IPC ON - intra_process_publish(LoanedMessage);
  4 Loaned message IPC OFF - inter_process_publish(LoanedMessage);

fujitatomoya commented 3 years ago

@mauropasse

thanks for sharing the result!

fujitatomoya commented 3 years ago

@mauropasse is this https://github.com/ros2/rclcpp/issues/1642#issuecomment-829470359 come from cyclone or fast-dds?

alsora commented 3 years ago

@fujitatomoya those results are with Fast-DDS, using the instructions provided by @MiguelCompany.

mauropasse commented 3 years ago

I did further testing on LoanedMessages, trying to understand the reason for the performance improvements compared to RCLCPP intra-process, showed on https://github.com/ros2/rclcpp/issues/1642#issuecomment-829470359. The memory improvement by LoanedMessage was due the memory is asked on a map (its shown as part of VSZ), but it doesn't show on RSS until every byte of the message has been initialized (which I didn't do in previous test).

Also the CPU improvement was due the memory was not initalized on the LoanedMessages case. Now I made sure every byte is initalized in all tests, to have all memory shown in RSS. As conclusion by now, LoanedMessages perform similar to RCLCPP intra-process, but with bigger latency.

Now I'm trying to force ROS2 to use the DDS IPC (instead of the rclcpp one). In the case of rmw_fastrtps_cpp, a publisher sending a message does eprosima::fastdds::dds::DataWriter->write(&data); which serializes the message (and seems to copy the data). Looks like the entities which are able to do intraprocess deliveries are StatelessWriter and StatefulWriter (which derive from RTPSWriter on the RTPS Layer (as oposed to the DDS Layer of the DataWriter). But rmw_fastrtps_cpp implementation of a publisher is through a DataWriter which doesn't seem to be able to do intra-process communication. Am I right @MiguelCompany? Does this mean the publisher should be implemented using a RTPSWriter instead of DataWriter or is there any other way?

MiguelCompany commented 3 years ago

@mauropasse The DataWriter is a wrapper of both an RTPSWriter and a WriterHistory. It is able to perform intraprocess deliveries, but that doesn't mean it will avoid the serialization (i.e. copy).

On its travel from the user sample on the publishing side to the user sample on the subscribing side, the following copies may be performed:

Serialization from the user sample into the writer history during the DataWriter::write call (this is avoided when the input data comes from a loan)
From the writer's history to the reader's history
- On intraprocess this is automatically avoided.
- For readers on the same host and fully bounded data types, no copy is made, as the writer's history is on a shared-memory mmap'd file.
- For other cases, the transport is used, which requires:
  - Encapsulation (1 copy)
  - Data transfer (1 copy with UDP, no copies with SHM)
  - Desencapsulation (1 copy)
Deserialization from the reader's history to the user sample. That depends on how DataReader::take is called. When loans are used and the type is plain, a pointer to the serialized payload (which is binary compatible with the user type) is returned.

I hope this long explanation is clear enough.

fujitatomoya commented 3 years ago

@mauropasse

but it doesn't show on RSS until every byte of the message has been initialized (which I didn't do in previous test).

this means that maps all physical pages in the process space, i think this is what application does.

As conclusion by now, LoanedMessages perform similar to RCLCPP intra-process, but with bigger latency.

do you happen to have any idea why latency is bigger than rclcpp intra-process? i am not sure, but maybe page fault to map physical pages into process space in subscription side when using rmw intra-process? (means same physical page but different virtual address, it costs system time.) at least i think rclcpp intra-process uses the same exact virtual address to access the data if possible. i can be wrong on this...

thanks in advance.

mauropasse commented 3 years ago

@fujitatomoya I did some quick profiling, looks like when using LoanedMessage the latency is bigger due all the pipeline and work done in FastDDS to send the message (AsyncWriterThread takes 18% CPU time compared to 0.43% when using rclcpp's IPC). If you want to take a look, here are the FlameGraphs (open them in a web browser) flamegraphs.zip

fujitatomoya commented 3 years ago

@mauropasse

thanks 👍 but i am not sure how we can tell latency difference from this graph since it only shows ratio how much it consumes CPU....

rclcpp intra-process

rclcpp::executors::StaticSingleThreadedExecutor::spin is 47.01%, which includes publish based on timer and subscription to call user callback.
Fast-DDS intra-process

eprosima::fastrtps::rtps::AsyncWriterThread::run is 18.0% (as you mentioned), and rclcpp::executors::StaticSingleThreadedExecutor::spin is 28.02%. as in total, it is 46%.

i think time consuming ratio is almost same? just checking if i am missing something.

ivanpauno commented 3 years ago

We can see memmove/memset operations called twice, once by the publisher when creating the message, and once by the executor when taking the subscription message on execute_subscription(), where a new message is created.

The problem is that this line is zero-initializing the message, right? https://github.com/ros2/rclcpp/blob/7d8b2690ff188672bfe3703d8b3c074b87303931/rclcpp/src/rclcpp/executor.cpp#L637

(make sure you use permalinks, if not the link will be poiniting to another part of the code when updated)

The difference between zero and default initialization in cpp is extremely subtle, but it's also possible to achieve the seconde one.

e.g.

auto data = new T;  // default initialized, there won't be any memset for POD, it starts "uninitialized"
auto data = new T();   // this one is zero-initialized :)

we're using

https://github.com/ros2/rclcpp/blob/1037822a63330495dcf0de5e8f20544375a5f116/rclcpp/include/rclcpp/message_memory_strategy.hpp#L87

which is equivalent to zero-initialization (uses ::new (pv) T(v) according to cppreference). When using allocators it's a bit harder to get default initialization (c++20 introduced allocate_shared_for_overwrite :confused: ), but something like

  // allocate
  auto message_ptr = std::allocator_traits<MessageAlloc>::allocate(*message_allocator_.get(), sizeof(MessageT));
  try {
    // placement new, using default initialization
    message_ptr = new(message_ptr) MessageAlloc;
    return std::shared_ptr(
      message_ptr,
      [message_allocator_] (MessageT * p) {
        std::allocator_traits<MessageAlloc>::deallocate(*message_allocator_.get(), message_ptr, sizeof(MessageT));
      },
      *message_allocator_.get());
  } catch (...) {
    // deallocate here
    auto std::allocator_traits<MessageAlloc>::deallocate(*message_allocator_.get(), message_ptr, sizeof(MessageT));
    throw;
  }

should do it. (don't expect that snippet to work, I haven't actually tried it :smiley: )

Another thing to check is the default constructors of the message we're generating. Those might be forcing zero-initialization, but AFAIR that was not the case.

MiguelCompany commented 3 years ago

@mauropasse

AsyncWriterThread takes 18% CPU time

This can be avoided forcing synchronous publishing. Using export RMW_FASTRTPS_PUBLICATION_MODE=SYNCHRONOUS would do the trick.

mauropasse commented 3 years ago

Posting here latest tests on Loaned Messages, now including also CycloneDDS.

Notes about tests:

Modified FastDDS code to zero initialize loaned messages, to have a fair comparison against rclcpp IPC ON/OFF. Synchronous publish mode was used (due better performances than asynchronous).
Results using CycloneDDS loaned messages don't include the overhead of the iceoryx middleware daemon RouDi (iox-roudi), which runs in a different process.

tl; dr; Conclusions Loaned messages are good when publishing big messages (>50Kb). For small messages sizes, the performances are pretty similar or worst than to using IPC OFF. Even for big messages, latencies using loaned messages are twice or more than using rclcpp IPC ON.

Latency: CycloneDDS

Both graphs are the same, just different scale to better appreciate the difference with small messages.

image2021-8-9_15-41-48

Latency: FastDDS

Both graphs are the same, just different scale to better appreciate the difference with small messages.

image2021-8-9_15-41-33

Memory RSS: FastDDS & CycloneDDS

image2021-8-9_15-52-1

Memory VSZ: FastDDS & CycloneDDS

image2021-8-9_15-52-31

Defining future works

rclcpp IPC ON is great but currently services, clients and actions can't take advantage of this way of communication.
Expanding the rclcpp intra-process manager to support them would not be straight forward, nor is the correct approach. The RMW should handle these situations of single or different processes on same machine to make communication performant.
Even if the DDS currently have their own intra-process communication means, currently ROS2 can't take advantage of them. Copies of the messages are being done.

fujitatomoya commented 3 years ago

@mauropasse

Appreciate the update, this is informative 👍

What it does not make sense to me is the latency between LoanedMessage and RCLCCPP IPC OFF for small message size cases. in those cases, RCLCCPP IPC OFF is better performance than LoanedMessage, especially on CycloneDDS. (FastDDS LoanedMessage is almost the same with RCLCCPP IPC OFF)

I guess this is related to event notification mechanism, not sharing data. Any opinion?

budrus commented 3 years ago

Thanks for the update @mauropasse. Which exact version of ROS 2 did you use and which HW? There are also differences between the initial Galactic release and the first patch release

wjwwood commented 3 years ago

Interesting results, thanks for posting them.

I wonder about the hybrid approach we had in the past where the pointers and actual data were handled in rclcpp, but a message was passed via rmw (therefore via dds vendor) to notify the subscription. That might achieve good results with large messages where currently the rclcpp IPC (of today) out performs the loaned message approach.

Obviously we should try to improve the loaned message support first, but just a thought.

alsora commented 3 years ago

@wjwwood I took the chance to open a separate ticket to have a more "design-level" discussion on the various delivery mechanisms such that we can keep this thread focused on the performance analysis

See https://github.com/ros2/rclcpp/issues/1750

ros-discourse commented 3 years ago

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/nav2-composition/22175/7

mauropasse commented 3 years ago

I did some CPU profiling on different architectures (x86, MIPS-500MHz, ARM quad-core -1.2GHz) around LoanedMessages publishing/subscribing a single byte message at a high frequency, to identify areas where we can save some CPU time. The use of such a small message is to point out work done by the Loaned Message infrastructure and reduce noise about message creation and handling.

Note: Below, when I say IPC I mean intra-process communication in RCLCPP layer, not the DDS intra-process, which is not exercised here.

FastDDS 2.3.x: First from x86 I got flamegraphs (CPU profiling) of the different message passing methods, (IPC ON/OFF, LoanedMsg) to identify major differences between them: FastDDS-Flamegraphs.zip

From the flamegraphs we can see that either the use of LoanedMsg or IPC OFF leads to almost the same path of APIs, and they seem to take similar CPU time.

The differences: on IPC OFF we have the serialize operation (there's no deserialize if data sharing is ON), while in the LoanedMsg we have the (expensive) construction/destruction of LoanedMessage instances. The take operation seems also to take more time on LoanedMessage (_rmw_take_loaned_message_withinfo vs _rmw_take_withinfo). Same with the return message, more expensive using LoanedMsg (_rmw_return_loaned_message_fromsubscription vs _returnmessage).

So in short, for small messages using LoanedMsg is more expensive than IPC OFF. For bigger messages, the increase of serialize/deserialize durations is what makes the difference in favour of LoanedMsg.

Based on flamegraphs of LoanedMsgs, I identified expensive operations and tried to reduce or remove them directly like:

RCLCPP (tag 13.0.0): LoanedMessage constructor and destructor are expensive:
    * can_loan_messages() is actually called at least 3 times in a pub/sub cycle
    * logger usage on destructor
Fast-DDS:  These operations that can be removed and still manage to pub/sub LoanMessages (which consume 10.8% CPU): 
    * check_acked_status (3.5%)
    * end_sample_access_nts  (2.9%)
    * intraprocess_heartbeat (2.5%)
    * change_removed_by_history (0.8%)
    * writer_is_matched (0.6%)
    * update_last_notified (0.5%)

A flamegraph called loaned-optimised.svg included in FastDDS-Flamegraphs.zip shows the resulting CPU usage after removal of apis.

Here some comparison graphs of running all mentioned alternatives on 2 different embedded systems. The graph show the CPU time used by the publisher thread and the executor thread (which executes the subscription, besides all the static executor work done on the waitset): cpu-time-pub-sub-1Byte The times shown are the average runtime of 2000 pub/sub cycles, obtained looking at thread activity (Command: _trace-cmd record -e schedswitch). The latency of a message would be approx the sum of the 2 thread times.

CycloneDDS Iceoryx: (v1.0.1/release_1.0 same results on both tags) CycloneDDS: iceoryx and master branch Got some flamegraphs (x86) comparing IPC ON/OFF / LoanedMsg, but haven't run tests in embedded systems since Iceoryx doesn't support 32bits archs: CycloneDDS-flamegraphs.zip

There seem to be quite a big overhead from IOX, manager of the shared memory (see violet area in flamegraph), which totally explains previous results on https://github.com/ros2/rclcpp/issues/1642#issuecomment-900387789 about big latency using Loaned Messages on CycloneDDS, compared to IPC OFF: loaned-cyclonedds-flamegraph

Lot of information here, tried to squeeze it as much as I could. Let me know if something is not clear. These results should serve to keep defining future works on performance improvements.

MatthiasKillat commented 2 years ago

@mauropasse @budrus Looking at the Flamegraphs and your conclusions I agree. I only consider the CycloneDDS/ iceoryx communication path here.

It is apparent that LoanedMessage construction is expensive but theoretically it does not have to be (it is mainly an allocation).
There is non-negligible overhead when iceoryx is used with LoanedMessage (zero-copy shared memory transfer). This is mainly in the iox::popo::Listener which is getting notified for each sample that is send.

Those are separate issues that can be resolved independently.

Loaned Message overhead

Ideally LoanedMessage construction should not need to zero any memory nor construct any payload data, this must be done by the user later via an emplace API. Construction should only acquire memory for the payload and initialize some meta-data of constant size. The emplace API called by the user would take constructor arguments and populate the payload memory via placement new.

iox Listener notification overhead

We need some kind of mechanism that informs us that there is data. However, there might be redundancy since the Executor will also look for data and perform other actions. So we are actually listening with two threads for the same data. Both of them use at some point semaphore/futexes as a signal mechanism, which are not exactly cheap (but there are no real alternatives I think). It might be possible to improve the performance of the iox::Listener, but the notification redundancy with the Executor remains.

All of this explains why the data pretty well I think.

If IPC is on we only need to pass a pointer and notify the receiver (no copies needed). Notification cost will be the major factor here.
If IPC is off and we do not use shared memory we avoid the iox::Listener but pay serialization costs. At some point these costs exceed the relatively constant iox::listener costs (but not for small messages).
If we use LoanedMessage we pay the iox::Listener costs but reduce/avoid serialization costs. We also incur the cost of the LoanedMessage constructor/destructor.

Reducing the iox::Listener costs and avoiding unnecessary initialization in LoanedMessage will likely lead to improvements and make the shared memory viable for small messages as well. The latter might require API changes in LoanedMessage to support the emplacement construction. It will probably not be possible to make this faster than true intra process communication via pointer passing and notification.

ros2 / rclcpp

High CPU usage when using DDS intra-process communication #1642

Loaned Message overhead

iox Listener notification overhead