Reliable publisher/subscriber-connection drops messages in high frequency scenarios with build type Release

DensoADAS commented 4 years ago

Bug report

Required Info:

Operating System:
- Ubuntu 18.04
Installation type:
- deb packages or source
Version or commit hash:
- Eloquent or ros2 master branch (22. October 2019)
DDS implementation:
- FastRTPS, OpenSplice, RTI connext
Client library (if applicable):
- rclcpp

Steps to reproduce issue

Create a simple publisher/subscriber pair, both having reliable QoS settings. Publish a fixed number of messages on the publisher side with a high frequency (>1kHZ -> <1ms intervals (e.g. 1us)).

Example: Talker:

class Talker : public rclcpp::Node
{
public:
    DEMO_NODES_CPP_PUBLIC
    explicit Talker(const rclcpp::NodeOptions& options)
        : Node("talker", options)
    {
        setvbuf(stdout, NULL, _IONBF, BUFSIZ);
        auto publish_message =
            [this]() -> void
            {
                auto msg = std::make_unique<std_msgs::msg::Int32>();
                msg->data = count_++;
                pub_->publish(std::move(msg));

                if (count_ > 100000)
                {
                    timer_->cancel();
                    RCLCPP_INFO(this->get_logger(), "Sent 100000 messages");
                }
            };

        rclcpp::QoS qos = rclcpp::QoS(rclcpp::KeepLast(1000)).reliable();
        pub_ = this->create_publisher<std_msgs::msg::Int32>("chatter", qos);

        RCLCPP_INFO(this->get_logger(), "Sending messages...");
        timer_ = this->create_wall_timer(1us, publish_message);
    }

private:
    int32_t count_ = 1;
    rclcpp::Publisher<std_msgs::msg::Int32>::SharedPtr pub_;
    rclcpp::TimerBase::SharedPtr timer_;
};

Listener:

class Listener : public rclcpp::Node
{
public:
    DEMO_NODES_CPP_PUBLIC
    explicit Listener(const rclcpp::NodeOptions& options)
        : Node("listener", options),
          m_messageID(0),
          m_numReceivedMsgs(0)
    {
        setvbuf(stdout, NULL, _IONBF, BUFSIZ);
        auto callback =
            [this](const std_msgs::msg::Int32::SharedPtr msg) -> void
            {
                ++m_numReceivedMsgs;
                int32_t msgDiff = msg->data - m_messageID;
                m_messageID = msg->data;
            };

        rclcpp::QoS qos = rclcpp::QoS(rclcpp::KeepLast(1000)).reliable();
        sub_ = create_subscription<std_msgs::msg::Int32>("chatter", qos, callback);
        RCLCPP_INFO(this->get_logger(), "Ready to receive messages");
    }

    ~Listener()
    {
        RCLCPP_INFO(this->get_logger(), "Received %d messages", m_numReceivedMsgs);
        RCLCPP_INFO(this->get_logger(), "Lost %d messages", 100000 - m_numReceivedMsgs);
    }

private:
    rclcpp::Subscription<std_msgs::msg::Int32>::SharedPtr sub_;
    int32_t m_messageID;
    int32_t m_numReceivedMsgs;
};

Expected behavior

The subscriber receives all messages in order or an exception is thrown (or some other error propagation method).

Actual behavior

Some messages are published but not received on the subscriber side without any notification to the user or thrown errors. Basically behaves similar to best_effort strategy.

Additional information

A full implementation example which reproduces the error can be found here: https://github.com/DensoADAS/high_freq_pub_example.git This is basically the Talker/Listener demo from the demo_nodes_cpp package with a different message type for easier counting and a higher pub rate.

This problem occurs with various middleware implementations (as well as with the eloquent binaries and the ros2 master) and usually gets worse the higher the publisher frequency is. Also different QoS settings were tested (different depth/keepAll) but the problem still occurs (less frequently with keepAll but some messages are still dropped). The problem can be observed already with normal publishing rates, e.g. intervals between published messages of about 1ms.

With the simple example above the following dropped message counts were observed:

Sent messages: 10000 -> repeating 5x:

Publisher interval: 1us:           
FastRTPS:       7541    7926    7192    8623    8255
OpenSplice:     4946    426     5181    3934    5253
RTI Connext:    8603    2755    1417    5355    5467 

Publisher interval: 1ms:
FastRTPS:       53      0       33      1474    61

Sent messages: 100000 -> repeating 5x:

Publisher interval: 1us   
FastRTPS:       96052   94911   96321   97322   95244
OpenSplice:     9369    7710    15950   51295   12835
RTI Connext:    45810   22624   41902   30460   21139

fujitatomoya commented 4 years ago

@DensoADAS

can you describe your platform detail? what hardware are you trying to use?

DensoADAS commented 4 years ago

Multiple machines were tested, all got the issue:

OS: Ubuntu 18.04 CPU: Intel Core i5-6500 @ 3.20GHz RAM: 24GB
OS: Ubuntu 18.04 CPU: Intel Xeon E5-2680 v4 @ 2.40GHz RAM: 16GB
OS: Ubuntu 18.04 (virtual machine on Windows 10 host) CPU: Intel Core i7-6820HQ @ 2.70GHz RAM: 8GB

The software on all machines is up-to-date. The test was executed on all machines using the eloquent binaries and in the VM also with the latest ros2 master.

Tell me if you need more information.

claireyywang commented 4 years ago

Thanks for reporting this issue! I was able to reproduce this behaviour by sending 10 msgs in frequencies ranging from 1us to 1s. The higher frequency messages were sent in, the more loss I observed. It seems like Opensplice has relatively the most reliable delivery among all middleware implementations, with Fastrtps being the worst. This also confirms the dropped msgs counts. I'm currently not sure what caused this problem but will look into it.

dirk-thomas commented 4 years ago

I tried to reproduce the problem and the first thing I noticed is that the talker is starting to publish messages right away without waiting for the subscriber to be discovered and matched.

You can easily reproduce that by printing msg->data for the first received message in the listener - it is greater than 1 and the offset matches the missing message count exactly.

Adding a sleep in the talker after creating the publisher but before creating the timer (e.g. std::this_thread::sleep_for(std::chrono::seconds(1));) lets the discovery happen and match the two endpoints before the first message is published. With that change the listener receives all the messages for me.

Diff to sleep before publishing and print the first received msg->data:

``` diff --git a/src/listener.cpp b/src/listener.cpp index 52b2bca..7f29d73 100644 --- a/src/listener.cpp +++ b/src/listener.cpp @@ -26,6 +26,10 @@ public: auto callback = [this](const std_msgs::msg::Int32::SharedPtr msg) -> void { + if (first) { + first = false; + fprintf(stderr, "%d\n", msg->data); + } ++m_numReceivedMsgs; int32_t msgDiff = msg->data - m_messageID; m_messageID = msg->data; @@ -51,6 +55,7 @@ public: } private: + bool first = true; rclcpp::Subscription::SharedPtr sub_; int32_t m_messageID; int32_t m_numReceivedMsgs; diff --git a/src/talker.cpp b/src/talker.cpp index f57cd3c..62c067e 100644 --- a/src/talker.cpp +++ b/src/talker.cpp @@ -41,6 +41,7 @@ public: rclcpp::QoS qos = rclcpp::QoS(rclcpp::KeepLast(1000)).reliable(); pub_ = this->create_publisher("chatter", qos); + std::this_thread::sleep_for(std::chrono::seconds(1)); RCLCPP_INFO(this->get_logger(), "Sending messages..."); timer_ = this->create_wall_timer(1us, publish_message); } ```

Please comment if you can reproduce this and if there is any problem remaining.

Tobi2001 commented 4 years ago

@dirk-thomas The sleep fixes the missing messages at the beginning but the general problem still exists. Messages are dropped (quite randomly as it seems) throughout the publishing process; you can check this by adding the commented out code in line 34 of the listener:

RCLCPP_ERROR(this->get_logger(), "Lost %d msgs, curr: %d", msgDiff - 1, msg->data);

With the sleep and 100000 sent messages I still get around 80000 dropped ones with FastRTPS and the eloquent binaries.

fujitatomoya commented 4 years ago

@claireyywang @dirk-thomas @Tobi2001

confirmed this problem with Fast-RTPS. the more frequency, the more losses happen.

claireyywang commented 4 years ago

I looked into fastrtps and it is mentioned that in high rate or large data scenarios, there are a couple of parameters need to be tuned to avoid network package drop. Based on the documentation and presentation slides, it is possible to tune those parameters using an xml file.

dirk-thomas commented 4 years ago

The sleep fixes the missing messages at the beginning but the general problem still exists.

I can't reproduce the problem. With the sleep in place (and the error message commented in) the listener still reports the following for me with no error messages:

[INFO] [listener]: Received 100000 messages
[INFO] [listener]: Lost 0 messages

claireyywang commented 4 years ago

@DensoADAS @Tobi2001 @fujitatomoya

@dirk-thomas and I did more testing and found out that by default (without any args), fastrtps build type is release, which caused the ~70% msg drop. But if we build debug instead, which is what @dirk-thomas used, the performance improves significantly: msg drop around 0-10%.

Here are the steps we took:

  cd ~/ros2_ws/build/ 
  rm -rf fastrtps  # delete fastrtps build folder from workspace
  cd ~/ros2_ws # go to the root of workspace
  colcon build --packages-select fastrtps --cmake-args -DCMAKE_BUILD_TYPE=Debug  # rebuild in debug type

Please let me know if you are able to get better performance after trying it.

dirk-thomas commented 4 years ago

@richiprosima The behavior we are seeing here is pretty weird. When building FastRTPS with the build type Release it drops a very significant number of messages (~60%-70%). When building Debug it surprisingly behaves much better - it drops none (or sometimes up to 10%).

DensoADAS/high_freq_pub_example#1 contains the above example with the additional sleep to ensure the endpoints have matched.

I already tried to get rid of all conditional logic in FastRTPS (depending on __DEBUG and NDEBUG) but that didn't make any difference. Any insight would be more than welcome.

fujitatomoya commented 4 years ago

@DensoADAS

i confirmed that cyclonedds does not have any message drop. i have tested more than 10 times with cyclonedds, can you confirm for double check?

[Subscriber]
root@6cc2c62d0dd0:~/ros2_colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub listener 
[INFO] [listener_high_freq]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener_high_freq]: Received 100000 messages
[INFO] [listener_high_freq]: Lost 0 messages

[Publisher]
root@6cc2c62d0dd0:~/ros2_colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub talker
[INFO] [talker_high_freq]: Sending messages...
[INFO] [talker_high_freq]: Sent 100000 messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)

@claireyywang @dirk-thomas it surely depends on dds implementation, need some help from each dds vendor. do you want me to create issue for each rmw implementation such as rmw_fastrtps?

dirk-thomas commented 4 years ago

@fujitatomoya please try to reproduce the behavior with the different build types of FastRTPS mentioned above before adding another dimension to the problem.

fujitatomoya commented 4 years ago

@dirk-thomas

When building FastRTPS with the build type Release it drops a very significant number of messages (~60%-70%). When building Debug it surprisingly behaves much better - it drops none (or sometimes up to 10%).

confirmed almost same behavior.

dirk-thomas commented 4 years ago

confirmed almost same behavior.

Thanks for checking.

Since the current ticket indicates a problem with FastRTPS I moved the ticket to the rmw_fastrtps repository.

it surely depends on dds implementation, need some help from each dds vendor. do you want me to create issue for each rmw implementation such as rmw_fastrtps?

If there are similar problems with other RMW implementations please open a separate ticket in their RMW repo - or if applicable to all RMW impl. in one of the general ROS 2 repos.

The behavior we are seeing here is pretty weird. When building FastRTPS with the build type Release it drops a very significant number of messages (~60%-70%). When building Debug it surprisingly behaves much better - it drops none (or sometimes up to 10%).

@richiprosima Can you please comment here that you have seen this ticket and that you will look into the problem. The ticket should have enough information to reproduce the problem. This seems to be a pretty severe performance problem with the default build type (Release) of FastRTPS.

DensoADAS commented 4 years ago

I have tried cyclonedds and I can't get any drops as well on a real machine (release mode). In a VM though there are some, definitely way less than in the other dds implementations though (in about 30% of tries I get drops -> 0-2% messages).

Also debug mode works better in comparison for all dds implementations but with FastRTPS about 50% of the sent messages still can't be received (the runtime is increasing severely though, which might shadow the actual problem)

richiware commented 4 years ago

I was able to do a quick digging. Our first thought was it is a problem with QoS configuration. In particular with the History depth and the Asynchronous thread. By default in ROS2, FastRTPS and Connext uses the asynchronous thread. This implies the user's sample is stored in the History and another thread is in charge of sent it, not the user's thread directly. Due to the example is writing so fast maybe the history is getting full and removing sample before being sent by the async thread. I made the test in my machine using localhost:

Default ROS2 configuration:

[INFO] [listener]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener]: Received 89098 messages
[INFO] [listener]: Lost 10902 messages

Changing QoS to use sync thread:

[INFO] [listener]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener]: Received 100000 messages
[INFO] [listener]: Lost 0 messages

FYI. We are working to support sending large data using the synchronous thread. Then we can use it by default in ROS2.

fujitatomoya commented 4 years ago

@richiware

this QoS for sync thread you mentioned previous comment, that is for fastrtps native setting, right? (not ROS2 QoS)

could you tell us how/what to set the QoS for fastrtps to enable sync thread? i will confirm it on my side as well.

richiware commented 4 years ago

Write a DEFAULT_FASTRTPS_PROFILES.xml file in the directory where you are launching the executables. The XML content:

<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>
        <publisher profile_name="publisher_default"
        is_default_profile="true">
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </publisher>
    </profiles>
</dds>

When launching the talker, use this command:

RMW_FASTRTPS_USE_QOS_FROM_XML=1 ./build/high_freq_pub/talker

dirk-thomas commented 4 years ago

@richiware While the custom config might workaround the problem I would like to find out why a release build has so much worse performance than a debug build.

JaimeMartin commented 4 years ago

@dirk-thomas

My guess: in the release build you are filling the buffers faster than in the debug build.

dirk-thomas commented 4 years ago

My guess: in the release build you are filling the buffers faster than in the debug build.

@JaimeMartin I don't understand how this statement makes any sense. Both builds use the exact same application with a fixed frequency. How can the debug build keep up with the rate and not loose any packages but the release build drops 60-70%?

JaimeMartin commented 4 years ago

@dirk-thomas because perhaps in the debug release you are not really sending at that frequency as it takes more time to call write. I have read they are using periods of sleep of 1 uS.

We will have more details soon.

dirk-thomas commented 4 years ago

perhaps in the debug release you are not really sending at that frequency as it takes more time to call write.

@JaimeMartin I added measuring the time it takes to do the publishing in DensoADAS/high_freq_pub_example#3. For me the put is the following which doesn't indicate that the release build is publishing data faster at all:

Debug build: [INFO] [talker]: Sent 100000 messages in 13,772,113,709 ns
Release build: [INFO] [talker]: Sent 100000 messages in 15,871,665,372 ns

JaimeMartin commented 4 years ago

@dirk-thomas thanks for the experiment. We are working on this now, we will update you as soon we have more information.

fujitatomoya commented 4 years ago

@dirk-thomas @JaimeMartin @richiware

confirmed no messages dropped with the following QoS profile,

<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>
        <publisher profile_name="publisher_default"
        is_default_profile="true">
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </publisher>
    </profiles>

root@6cc2c62d0dd0:~/ros2_colcon_ws# RMW_FASTRTPS_USE_QOS_FROM_XML=1 ros2 run high_freq_pub talker 
[INFO] [talker_high_freq]: Sending messages...
[INFO] [talker_high_freq]: Sent 100000 messages

root@6cc2c62d0dd0:~/ros2_colcon_ws# RMW_FASTRTPS_USE_QOS_FROM_XML=1 ros2 run high_freq_pub listener 
[INFO] [listener_high_freq]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener_high_freq]: Received 100000 messages
[INFO] [listener_high_freq]: Lost 0 messages

Barry-Xu-2018 commented 4 years ago

Dear all,

About this problem, I have different opinion. I don't think 'drop' behavior is a problem.

'Reliable' is Qos for DDS layer. In specification (Data Distribution Service, v1.4, page 98), 'Reliable' means "Specifies the Service will attempt to deliver all samples in its history. Missed samples may be retried." Based on my understanding, it make sure sample in history can be delivered successfully. That is, once sample is delivered, delivery must be successful (make sure sample is received). Not make sure all samples in history must be delivered (In this case, samples should be dropped from history).

While the speed of importing samples is faster than the speed of DDS send, 'Drop' behavior must exists except HISTORY is set as KEEP_ALL.

For test sample about talker and listener, use history depth with 1000, but send data samples count is 10000, this must cause the problem.

From fastrtps publisher side, publisher post data into the publisher history, and use other thread to send data.
If fastrtps send data by transport is slower than ros2 publish data, the data might be lost(removed) since history depth limitation.

```c++
    bool PublisherHistory::add_pub_change(
            CacheChange_t* change,
            WriteParams &wparams,
            std::unique_lock<RecursiveTimedMutex>& lock,
            std::chrono::time_point<std::chrono::steady_clock> max_blocking_time)
    {
        if(m_isHistoryFull)
        {
            bool ret = false;

            if(m_historyQos.kind == KEEP_ALL_HISTORY_QOS)
            {
                ret = this->mp_writer->try_remove_change(max_blocking_time, lock);
            }
            else if(m_historyQos.kind == KEEP_LAST_HISTORY_QOS)
            {
                ret = this->remove_min_change();
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^      <== delete the published data that might be not send
            }
```

From fastrtps subscriber side, subscriber get data into the subscriber history, and ros2 use other thread to call subscriber_callback.
If the ros2 subscriber callback (ros2 got data from fastrtps) is slower than rtps receiving data, the data might be lost.

    bool SubscriberHistory::received_change_keep_last_no_key(
            CacheChange_t* a_change,
            size_t /* unknown_missing_changes_up_to */ )
    {
        bool add = false;
        if (m_changes.size() < static_cast<size_t>(m_historyQos.depth) )
        {
            add = true;
        }
        else
        {
            // Try to substitute the oldest sample.

            // As the history should be ordered following the presentation QoS, we can always remove the first one.
            add = remove_change_sub(m_changes.at(0));
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       <== delete the subscriber data that might not be got by ros2
        }

        if (add)
        {
            return add_received_change(a_change);
        }

        return false;
    }

NOTE: Is this behavior a problem?

use rclcpp::KeepAll() / rclcpp::KeepLast(100000) to instead of rclcpp::KeepLast(1000), it works good.

root@bdk:~/workspace# ros2 run high_freq_pub listener [INFO] [listener]: Ready to receive messages ^C[INFO] [rclcpp]: signal_handler(signal_value=2) [INFO] [listener]: Received 100000 messages [INFO] [listener]: Lost 0 messages

root@bdk:~/workspace# ros2 run high_freq_pub talker [INFO] [talker]: Sending messages... [INFO] [talker]: Sent 100000 messages ^C[INFO] [rclcpp]: signal_handler(signal_value=2)

MiguelCompany commented 4 years ago

Thank you very much @Barry-Xu-2018 for your thorough explanation

I would like to add what happens when not using the asynchronous writing thread. In that case, the thread adding the sample is blocked until the RTPS datagram is sent to the network, so each sample is added after the previous one has been physically put on the socket output buffer, and then the publisher doesn't drop any message

dirk-thomas commented 4 years ago

While all of this is correct it doesn't explain why it works in debug mode but fails in release mode. And that is the pending question to answer.

fujitatomoya commented 4 years ago

@Barry-Xu-2018 @MiguelCompany

use rclcpp::KeepAll() / rclcpp::KeepLast(100000) to instead of rclcpp::KeepLast(1000), it works good.

are you sure about this? this is actually what I've done before, I receive the following error from rmw_fastrtps right after start publishing data.

root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub talker 
[INFO] [talker_high_freq]: Sending messages...
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  failed to publish message: cannot publish data, at /root/ros2_colcon_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publish.cpp:52, at /root/ros2_colcon_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:290

i did not have time to debug the Fast-RTPS implementation, but it seems that new change does not get applied via create_new_change().

@DensoADAS

could you please confirm if you have the same runtime error with following? https://github.com/DensoADAS/high_freq_pub_example/pull/4

DensoADAS commented 4 years ago

@fujitatomoya Can confirm the exception which probably happens because the resource limit is exceeded. When catching the exception and resending the message (see this branch for an example) all messages are received by the subscriber.

The runtime of the FastRTPS dds is quite high though in comparison to other vendors:

Time until all messages (100000) are published
    FastRTPS: ~20min
    OpenSplice: ~14s
    CycloneDDS: ~13s

Also the workaround using the XML file for FastRTPS does not really change anything for me. The only difference is that I get outputs like the following:

2019-11-25 09:47:58.770 [RTPS_HISTORY Error] Change payload size of '96' bytes is larger than the history payload size of '35' bytes and cannot be resized. -> Function add_change

Barry-Xu-2018 commented 4 years ago

@fujitatomoya

are you sure about this?

I used debug version of ros2 (workspace contains all packages) to test.

If I use ros2 with release to test high_freq_pub, the above error happened (high probability). It seems that there is a bug in rmw_fastrtps/fastrtps.

root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub talker [INFO] [talker_high_freq]: Sending messages...
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  failed to publish message: cannot publish data, at /root/ros2_colcon_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publish.cpp:52, at /root/ros2_colcon_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:290

MiguelCompany commented 4 years ago

@fujitatomoya @Barry-Xu-2018 @DensoADAS

When using KeepAll(), additional parameters get into action: resource_limits and max_blocking_time.

Resource limits is, in a way, similar to the history depth. It specifies the maximum number of samples allowed. You can set it to infinity using maximum vales of 0
max_blocking_time is part of the reliability QoS. It is specified as a duration with sec and nanosec. When the publisher tries to add a new sample and the history is full (due to resource limits), it will block for this amount of time (default 100ms) and return false if it couldn't add the change to the history.

So we have two options for our XML file:

<!-- XML settings for infinite resource limits -->
<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>
        <publisher profile_name="publisher_default" is_default_profile="true">
            <topic>
                <resourceLimitsQos>
                    <max_samples>0</max_samples>
                    <max_instances>0</max_instances>
                    <max_samples_per_instance>0</max_samples_per_instance>
                    <allocated_samples>0</allocated_samples>
                </resourceLimitsQos>
            </topic>
        </publisher>
    </profiles>
</dds>

<!-- XML settings for infinite blocking time -->
<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>
        <publisher profile_name="publisher_default" is_default_profile="true">
            <qos>
                <reliability>
                    <max_blocking_time>
                        <sec>DURATION_INFINITE_SEC</sec>
                    </max_blocking_time>
                </reliability>
            </qos>
        </publisher>
    </profiles>
</dds>

fujitatomoya commented 4 years ago

@MiguelCompany @Barry-Xu-2018 @DensoADAS

the precious setting you mentioned, both of them lead the following error,

root@6cc2c62d0dd0:~/ros2_colcon_ws# RMW_FASTRTPS_USE_QOS_FROM_XML=1 ros2 run high_freq_pub listener
terminate called after throwing an instance of 'eprosima::fastcdr::exception::NotEnoughMemoryException'
  what():  Not enough memory in the buffer stream

i think that this is cpu and memory consumption trading-off, so it is about how efficient can be processed. so far https://github.com/ros2/rmw_fastrtps/issues/338#issuecomment-557140854 can work without any message drop.

and as @dirk-thomas mentined on https://github.com/ros2/rmw_fastrtps/issues/338#issuecomment-557170716, i would like to know the (expected) reason for that. (it just does not add up.)

richiware commented 4 years ago

Sorry for the time-lapse since my last comment. I wanted to make a deep investigation about this problem, and I also was out-off-office for several days.

Although the reason could have been how reliable publisher works when its history is configured as KEEP_LAST, finally this is not the case. Using netstat, I discovered that the kernel was discarding UDP packets. The socket's buffer must be getting full because the subscriber is not draining the buffer fast enough.

Then the problem was in the subscriber's side. I used callgrind to check why the subscriber is not getting UDP packets fast enough. I've found several logical code blocks which I was able to improve. eProsima/Fast-RTPS#899 is a meta-ticket that summarizes these improvements. Also to make easier for you, I've joined all improvements in feature/test-improvements/1.9.x branch, so you can test them with high_freq_pub example,

@dirk-thomas, about why Debug was receiving all the samples and Release was not, this is my conclusion: with Release the subscriber's socket buffer is getting full faster than in Debug so the kernel starts to discard UDP packets.

	1.9.x	1.9.x	feature/test-improvements	feature/test-improvements
	Release	Debug	Release	Debug
nanoseconds	28,118,651,346	28,553,887,775	5,977,088,688	28,560,103,031
success	FAIL	OK	OK	OK

And last question I think I have to answer. Why does the talker take so much time to send all the samples in Release 1.9.x? Because the subscriber is losing samples due to the kernel discarding UDP packets, and then reliability mechanism starts to work and the publisher's asynchronous thread has much to do (sending requested samples for the subscriber), blocking the user's thread more time.

DensoADAS commented 4 years ago

@richiware I have tested the improvement branch with our repo and can confirm that FastRTPS behaves much better than before. I have tested the example multiple times (down below are 5 test runs for each configuration) and got the following results:

Using the master branch of high_freq_example and everything built with debug mode:

5,859,654,444 ns -> 0 lost messages
5,742,361,558 ns -> 0 lost messages
8,303,138,780 ns -> 3040 lost messages
8,034,343,035 ns -> 5370 lost messages
5,901,925,910 ns -> 0 lost messages

Using the feature/keep_all_qos_resending branch of high_freq_example and everything built with debug mode:

6,201,670,377 ns -> 0 lost messages
33,496,412,209 ns -> 0 lost messages
6,381,162,971 ns -> 0 lost messages
30,606,065,461 ns -> 0 lost messages
12,073,535,318 ns -> 0 lost messages

Using the master branch of high_freq_example and everything built with release mode:

2,533,380,340 ns -> 19866 lost messages
2,290,880,031 ns -> 0 lost messages
2,286,291,540 ns -> 0 lost messages
2,484,636,737 ns -> 12342 lost messages
2,314,744,746 ns -> 0 lost messages

Using the feature/keep_all_qos_resending branch of high_freq_example and everything built with release mode:

2,191,338,534 ns -> 0 lost messages
2,179,242,513 ns -> 0 lost messages
2,159,576,817 ns -> 0 lost messages
2,163,511,295 ns -> 0 lost messages
2,152,910,959 ns -> 0 lost messages

With the feature/keep_all_qos_resending branch and release mode the publish time is around 2,000,000,000 ns for about 90% of my test runs and then I also get no exception when publishing (so no re-publishing occurs). Sometimes though the publish time takes about twice that value and the time until the subscriber gets all messages is comparably higher as well then (~ 2-3 times higher than usual, haven't exactly measured this yet). Everything above was tested without the XML files mentioned above and on machine No.1 which I have mentioned in one of my previous comments.

MiguelCompany commented 4 years ago

All PRs mentioned on eProsima/Fast-RTPS#899 have been merged.

@DensoADAS I think this can be closed.

DensoADAS commented 4 years ago

@MiguelCompany I've tested the example again with the latest ros2 sources (Fast-RTPS version: bec830c21e82f850a909fde6739e90d07f0b0bd6), everything built with Release mode:

When using the master branch of the example (keepLast):

[INFO] [1580374873.252230767] [talker_high_freq]: Sent 100000 messages in 21844108083 ns
[INFO] [1580374874.288028606] [rclcpp]: signal_handler(signal_value=2)

real    0m24.217s
user    0m22.093s
sys     0m1.203s

I receive all messages most of the times with durations similar to the one measured above. Sometimes (rarely though) I don't receive all messages (about 1% are missing then).

When using the feature/keep_all_qos_resending branch (keepAll settings):

[INFO] [1580374661.961899286] [talker_high_freq]: Sent 100000 messages in 198026381838 ns
[INFO] [1580374699.792305152] [rclcpp]: signal_handler(signal_value=2)

real    3m57.195s
user    0m3.150s
sys     0m1.295s

I receive all messages but a lot of messages have to be resent since an exception is thrown when publishing them. The resulting duration is quite high all the time.

Maybe somebody else could check/verify this behaviour, since the durations seem quite high.

fujitatomoya commented 4 years ago

@MiguelCompany @DensoADAS

there are still message drops and failure to publish, confirmed.

The resulting duration is quite high all the time.

maybe you need to define what is the quite high, instead it is hard to understand.

the followings are release build, up-to-date ros2 source build environment.

using the master branch of the example (keepLast):

rmw_fastrtps

root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub talker
[INFO] [1582260699.587175751] [talker_high_freq]: Sending messages...
[INFO] [1582260734.571079814] [talker_high_freq]: Sent 100000 messages in 34983911263 ns
^C[INFO] [1582260737.896654254] [rclcpp]: signal_handler(signal_value=2)

root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub listener 
[INFO] [1582260695.229655656] [listener_high_freq]: Ready to receive messages
^C[INFO] [1582260738.803838167] [rclcpp]: signal_handler(signal_value=2)
[INFO] [1582260738.804346691] [listener_high_freq]: Received 99834 messages
[INFO] [1582260738.804354721] [listener_high_freq]: Lost 166 messages

rmw_cyclonedds (FYI)

root@13a691159527:~/ros2_ws/colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub talker
[INFO] [1582260796.731842463] [talker_high_freq]: Sending messages...
[INFO] [1582260797.296652333] [talker_high_freq]: Sent 100000 messages in 564818430 ns
^C[INFO] [1582260813.219777432] [rclcpp]: signal_handler(signal_value=2)

root@13a691159527:~/ros2_ws/colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub listener 
[INFO] [1582260793.507338493] [listener_high_freq]: Ready to receive messages
^C[INFO] [1582260804.081090170] [rclcpp]: signal_handler(signal_value=2)
[INFO] [1582260804.081526864] [listener_high_freq]: Received 100000 messages
[INFO] [1582260804.081540514] [listener_high_freq]: Lost 0 messages

using the feature/keep_all_qos_resending branch (keepAll)::

rmw_fastrtps

root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub talker
[INFO] [1582261236.582966810] [talker_high_freq]: Sending messages...
[INFO] [1582261237.008120375] [talker_high_freq]: Failed to publish msg #5001
[INFO] [1582261237.108303590] [talker_high_freq]: Failed to publish msg #5001
...<every 5000, there is failure to publish message>
[INFO] [1582261295.686259550] [talker_high_freq]: Sent 100000 messages in 59103302060 ns
^C[INFO] [1582261298.092971868] [rclcpp]: signal_handler(signal_value=2)

root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub listener 
[INFO] [1582261231.757531033] [listener_high_freq]: Ready to receive messages
[INFO] [1582261295.686274440] [listener_high_freq]: Received 100000 messages
^C[INFO] [1582261298.844459160] [rclcpp]: signal_handler(signal_value=2)
[INFO] [1582261298.845034745] [listener_high_freq]: Received 100000 messages
[INFO] [1582261298.845056715] [listener_high_freq]: Lost 0 messages

rmw_cyclonedds (FYI)

root@13a691159527:~/ros2_ws/colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub talker
[INFO] [1582261368.855976138] [talker_high_freq]: Sending messages...
[INFO] [1582261369.469863591] [talker_high_freq]: Sent 100000 messages in 613895923 ns
^C[INFO] [1582261372.033819949] [rclcpp]: signal_handler(signal_value=2)

root@13a691159527:~/ros2_ws/colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub listener 
[INFO] [1582261363.066281806] [listener_high_freq]: Ready to receive messages
[INFO] [1582261369.469860111] [listener_high_freq]: Received 100000 messages
^C[INFO] [1582261372.668585602] [rclcpp]: signal_handler(signal_value=2)
[INFO] [1582261372.669049716] [listener_high_freq]: Received 100000 messages
[INFO] [1582261372.669058966] [listener_high_freq]: Lost 0 messages

DensoADAS commented 4 years ago

@fujitatomoya

maybe you need to define what is the quite high, instead it is hard to understand.

That was related to the time measurements in my last comment, as it took about 4 minutes with KeepAll compared to about 24 seconds with KeepLast (just tested it again with the latests ros2 sources and the durations are still the same).

<every 5000, there is failure to publish message>

For me it is 5000 only at the beginning (until about 66000 messages were sent), then the message difference until the publishing fails again drops to lower values (200-300).

ZhenshengLee commented 5 months ago

<resourceLimitsQos>
                    <max_samples>0</max_samples>
                    <max_instances>0</max_instances>
                    <max_samples_per_instance>0</max_samples_per_instance>
                    <allocated_samples>0</allocated_samples>
                </resourceLimitsQos>

In the version of Foxy, in which the fastdds version is 2.1.1, setting resourceLimitsQos to all zero, is invalid, and will cause the error about depth must be <= max_samples

[RTPS_QOS_CHECK Error] INCORRECT TOPIC Qos (ros_discovery_info): depth must be <= max_samples -> Function checkQos

the issue get resolved in the version of humble, in which the fastdds version is 2.6.7.

ZhenshengLee commented 5 months ago

When using KeepAll(), additional parameters get into action: resource_limits and max_blocking_time.

Resource limits is, in a way, similar to the history depth. It specifies the maximum number of samples allowed. You can set it to infinity using maximum vales of 0

max_blocking_time is part of the reliability QoS. It is specified as a duration with sec and nanosec. When the publisher tries to add a new sample and the history is full (due to resource limits), it will block for this amount of time (default 100ms) and return false if it couldn't add the change to the history.

according to this doc https://fast-dds.docs.eprosima.com/en/v2.6.7/fastdds/use_cases/realtime/blocking.html , only if the cmake_option STRICT_REALTIME get open, the max_blocking_time will take effect, and write will timeout and return false.

in ROS2/Fastdds, the STRICT_REALTIME is set to OFF, so you will need to compile the fastrtps from source.

ros2 / rmw_fastrtps