Open DensoADAS opened 4 years ago
@DensoADAS
can you describe your platform detail? what hardware are you trying to use?
Multiple machines were tested, all got the issue:
The software on all machines is up-to-date. The test was executed on all machines using the eloquent binaries and in the VM also with the latest ros2 master.
Tell me if you need more information.
Thanks for reporting this issue! I was able to reproduce this behaviour by sending 10 msgs in frequencies ranging from 1us to 1s. The higher frequency messages were sent in, the more loss I observed. It seems like Opensplice has relatively the most reliable delivery among all middleware implementations, with Fastrtps being the worst. This also confirms the dropped msgs counts. I'm currently not sure what caused this problem but will look into it.
I tried to reproduce the problem and the first thing I noticed is that the talker is starting to publish messages right away without waiting for the subscriber to be discovered and matched.
You can easily reproduce that by printing msg->data
for the first received message in the listener - it is greater than 1 and the offset matches the missing message count exactly.
Adding a sleep in the talker after creating the publisher but before creating the timer (e.g. std::this_thread::sleep_for(std::chrono::seconds(1));
) lets the discovery happen and match the two endpoints before the first message is published. With that change the listener receives all the messages for me.
Please comment if you can reproduce this and if there is any problem remaining.
@dirk-thomas The sleep fixes the missing messages at the beginning but the general problem still exists. Messages are dropped (quite randomly as it seems) throughout the publishing process; you can check this by adding the commented out code in line 34 of the listener:
RCLCPP_ERROR(this->get_logger(), "Lost %d msgs, curr: %d", msgDiff - 1, msg->data);
With the sleep and 100000 sent messages I still get around 80000 dropped ones with FastRTPS and the eloquent binaries.
@claireyywang @dirk-thomas @Tobi2001
confirmed this problem with Fast-RTPS. the more frequency, the more losses happen.
I looked into fastrtps and it is mentioned that in high rate or large data scenarios, there are a couple of parameters need to be tuned to avoid network package drop. Based on the documentation and presentation slides, it is possible to tune those parameters using an xml
file.
The sleep fixes the missing messages at the beginning but the general problem still exists.
I can't reproduce the problem. With the sleep in place (and the error message commented in) the listener still reports the following for me with no error messages:
[INFO] [listener]: Received 100000 messages
[INFO] [listener]: Lost 0 messages
@DensoADAS @Tobi2001 @fujitatomoya
@dirk-thomas and I did more testing and found out that by default (without any args), fastrtps
build type is release
, which caused the ~70% msg drop. But if we build debug
instead, which is what @dirk-thomas used, the performance improves significantly: msg drop around 0-10%.
Here are the steps we took:
cd ~/ros2_ws/build/
rm -rf fastrtps # delete fastrtps build folder from workspace
cd ~/ros2_ws # go to the root of workspace
colcon build --packages-select fastrtps --cmake-args -DCMAKE_BUILD_TYPE=Debug # rebuild in debug type
Please let me know if you are able to get better performance after trying it.
@richiprosima The behavior we are seeing here is pretty weird. When building FastRTPS with the build type Release
it drops a very significant number of messages (~60%-70%). When building Debug
it surprisingly behaves much better - it drops none (or sometimes up to 10%).
DensoADAS/high_freq_pub_example#1 contains the above example with the additional sleep to ensure the endpoints have matched.
I already tried to get rid of all conditional logic in FastRTPS (depending on __DEBUG
and NDEBUG
) but that didn't make any difference. Any insight would be more than welcome.
@DensoADAS
i confirmed that cyclonedds does not have any message drop. i have tested more than 10 times with cyclonedds, can you confirm for double check?
[Subscriber]
root@6cc2c62d0dd0:~/ros2_colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub listener
[INFO] [listener_high_freq]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener_high_freq]: Received 100000 messages
[INFO] [listener_high_freq]: Lost 0 messages
[Publisher]
root@6cc2c62d0dd0:~/ros2_colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub talker
[INFO] [talker_high_freq]: Sending messages...
[INFO] [talker_high_freq]: Sent 100000 messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
@claireyywang @dirk-thomas it surely depends on dds implementation, need some help from each dds vendor. do you want me to create issue for each rmw implementation such as rmw_fastrtps?
@fujitatomoya please try to reproduce the behavior with the different build types of FastRTPS mentioned above before adding another dimension to the problem.
@dirk-thomas
When building FastRTPS with the build type Release it drops a very significant number of messages (~60%-70%). When building Debug it surprisingly behaves much better - it drops none (or sometimes up to 10%).
confirmed almost same behavior.
confirmed almost same behavior.
Thanks for checking.
Since the current ticket indicates a problem with FastRTPS I moved the ticket to the rmw_fastrtps
repository.
it surely depends on dds implementation, need some help from each dds vendor. do you want me to create issue for each rmw implementation such as rmw_fastrtps?
If there are similar problems with other RMW implementations please open a separate ticket in their RMW repo - or if applicable to all RMW impl. in one of the general ROS 2 repos.
The behavior we are seeing here is pretty weird. When building FastRTPS with the build type
Release
it drops a very significant number of messages (~60%-70%). When buildingDebug
it surprisingly behaves much better - it drops none (or sometimes up to 10%).
@richiprosima Can you please comment here that you have seen this ticket and that you will look into the problem. The ticket should have enough information to reproduce the problem. This seems to be a pretty severe performance problem with the default build type (Release) of FastRTPS.
I have tried cyclonedds and I can't get any drops as well on a real machine (release mode). In a VM though there are some, definitely way less than in the other dds implementations though (in about 30% of tries I get drops -> 0-2% messages).
Also debug mode works better in comparison for all dds implementations but with FastRTPS about 50% of the sent messages still can't be received (the runtime is increasing severely though, which might shadow the actual problem)
I was able to do a quick digging. Our first thought was it is a problem with QoS configuration. In particular with the History depth and the Asynchronous thread. By default in ROS2, FastRTPS and Connext uses the asynchronous thread. This implies the user's sample is stored in the History and another thread is in charge of sent it, not the user's thread directly. Due to the example is writing so fast maybe the history is getting full and removing sample before being sent by the async thread. I made the test in my machine using localhost:
Default ROS2 configuration:
[INFO] [listener]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener]: Received 89098 messages
[INFO] [listener]: Lost 10902 messages
Changing QoS to use sync thread:
[INFO] [listener]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener]: Received 100000 messages
[INFO] [listener]: Lost 0 messages
FYI. We are working to support sending large data using the synchronous thread. Then we can use it by default in ROS2.
@richiware
this QoS for sync thread you mentioned previous comment, that is for fastrtps native setting, right? (not ROS2 QoS)
could you tell us how/what to set the QoS for fastrtps to enable sync thread? i will confirm it on my side as well.
Write a DEFAULT_FASTRTPS_PROFILES.xml file in the directory where you are launching the executables. The XML content:
<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<publisher profile_name="publisher_default"
is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
</profiles>
</dds>
When launching the talker, use this command:
RMW_FASTRTPS_USE_QOS_FROM_XML=1 ./build/high_freq_pub/talker
@richiware While the custom config might workaround the problem I would like to find out why a release build has so much worse performance than a debug build.
@dirk-thomas
My guess: in the release build you are filling the buffers faster than in the debug build.
My guess: in the release build you are filling the buffers faster than in the debug build.
@JaimeMartin I don't understand how this statement makes any sense. Both builds use the exact same application with a fixed frequency. How can the debug build keep up with the rate and not loose any packages but the release build drops 60-70%?
@dirk-thomas because perhaps in the debug release you are not really sending at that frequency as it takes more time to call write. I have read they are using periods of sleep of 1 uS.
We will have more details soon.
perhaps in the debug release you are not really sending at that frequency as it takes more time to call write.
@JaimeMartin I added measuring the time it takes to do the publishing in DensoADAS/high_freq_pub_example#3. For me the put is the following which doesn't indicate that the release build is publishing data faster at all:
[INFO] [talker]: Sent 100000 messages in 13,772,113,709 ns
[INFO] [talker]: Sent 100000 messages in 15,871,665,372 ns
@dirk-thomas thanks for the experiment. We are working on this now, we will update you as soon we have more information.
@dirk-thomas @JaimeMartin @richiware
confirmed no messages dropped with the following QoS profile,
<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<publisher profile_name="publisher_default"
is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
</profiles>
root@6cc2c62d0dd0:~/ros2_colcon_ws# RMW_FASTRTPS_USE_QOS_FROM_XML=1 ros2 run high_freq_pub talker
[INFO] [talker_high_freq]: Sending messages...
[INFO] [talker_high_freq]: Sent 100000 messages
root@6cc2c62d0dd0:~/ros2_colcon_ws# RMW_FASTRTPS_USE_QOS_FROM_XML=1 ros2 run high_freq_pub listener
[INFO] [listener_high_freq]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener_high_freq]: Received 100000 messages
[INFO] [listener_high_freq]: Lost 0 messages
Dear all,
About this problem, I have different opinion. I don't think 'drop' behavior is a problem.
'Reliable' is Qos for DDS layer. In specification (Data Distribution Service, v1.4, page 98), 'Reliable' means "Specifies the Service will attempt to deliver all samples in its history. Missed samples may be retried." Based on my understanding, it make sure sample in history can be delivered successfully. That is, once sample is delivered, delivery must be successful (make sure sample is received). Not make sure all samples in history must be delivered (In this case, samples should be dropped from history).
While the speed of importing samples is faster than the speed of DDS send, 'Drop' behavior must exists except HISTORY is set as KEEP_ALL.
For test sample about talker and listener, use history depth with 1000, but send data samples count is 10000, this must cause the problem.
From fastrtps publisher side, publisher post data into the publisher history, and use other thread to send data.
If fastrtps send data by transport is slower than ros2 publish data, the data might be lost(removed) since history depth limitation.
```c++
bool PublisherHistory::add_pub_change(
CacheChange_t* change,
WriteParams &wparams,
std::unique_lock<RecursiveTimedMutex>& lock,
std::chrono::time_point<std::chrono::steady_clock> max_blocking_time)
{
if(m_isHistoryFull)
{
bool ret = false;
if(m_historyQos.kind == KEEP_ALL_HISTORY_QOS)
{
ret = this->mp_writer->try_remove_change(max_blocking_time, lock);
}
else if(m_historyQos.kind == KEEP_LAST_HISTORY_QOS)
{
ret = this->remove_min_change();
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <== delete the published data that might be not send
}
```
From fastrtps subscriber side, subscriber get data into the subscriber history, and ros2 use other thread to call subscriber_callback.
If the ros2 subscriber callback (ros2 got data from fastrtps) is slower than rtps receiving data, the data might be lost.
bool SubscriberHistory::received_change_keep_last_no_key(
CacheChange_t* a_change,
size_t /* unknown_missing_changes_up_to */ )
{
bool add = false;
if (m_changes.size() < static_cast<size_t>(m_historyQos.depth) )
{
add = true;
}
else
{
// Try to substitute the oldest sample.
// As the history should be ordered following the presentation QoS, we can always remove the first one.
add = remove_change_sub(m_changes.at(0));
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <== delete the subscriber data that might not be got by ros2
}
if (add)
{
return add_received_change(a_change);
}
return false;
}
NOTE: Is this behavior a problem?
root@bdk:~/workspace# ros2 run high_freq_pub listener [INFO] [listener]: Ready to receive messages ^C[INFO] [rclcpp]: signal_handler(signal_value=2) [INFO] [listener]: Received 100000 messages [INFO] [listener]: Lost 0 messages
root@bdk:~/workspace# ros2 run high_freq_pub talker [INFO] [talker]: Sending messages... [INFO] [talker]: Sent 100000 messages ^C[INFO] [rclcpp]: signal_handler(signal_value=2)
Thank you very much @Barry-Xu-2018 for your thorough explanation
I would like to add what happens when not using the asynchronous writing thread. In that case, the thread adding the sample is blocked until the RTPS datagram is sent to the network, so each sample is added after the previous one has been physically put on the socket output buffer, and then the publisher doesn't drop any message
While all of this is correct it doesn't explain why it works in debug mode but fails in release mode. And that is the pending question to answer.
@Barry-Xu-2018 @MiguelCompany
use rclcpp::KeepAll() / rclcpp::KeepLast(100000) to instead of rclcpp::KeepLast(1000), it works good.
are you sure about this? this is actually what I've done before, I receive the following error from rmw_fastrtps right after start publishing data.
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub talker
[INFO] [talker_high_freq]: Sending messages...
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
what(): failed to publish message: cannot publish data, at /root/ros2_colcon_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publish.cpp:52, at /root/ros2_colcon_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:290
i did not have time to debug the Fast-RTPS implementation, but it seems that new change does not get applied via create_new_change().
@DensoADAS
could you please confirm if you have the same runtime error with following? https://github.com/DensoADAS/high_freq_pub_example/pull/4
@fujitatomoya Can confirm the exception which probably happens because the resource limit is exceeded. When catching the exception and resending the message (see this branch for an example) all messages are received by the subscriber.
The runtime of the FastRTPS dds is quite high though in comparison to other vendors:
Time until all messages (100000) are published
FastRTPS: ~20min
OpenSplice: ~14s
CycloneDDS: ~13s
Also the workaround using the XML file for FastRTPS does not really change anything for me. The only difference is that I get outputs like the following:
2019-11-25 09:47:58.770 [RTPS_HISTORY Error] Change payload size of '96' bytes is larger than the history payload size of '35' bytes and cannot be resized. -> Function add_change
@fujitatomoya
are you sure about this?
I used debug version of ros2 (workspace contains all packages) to test.
If I use ros2 with release to test high_freq_pub, the above error happened (high probability). It seems that there is a bug in rmw_fastrtps/fastrtps.
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub talker [INFO] [talker_high_freq]: Sending messages...
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
what(): failed to publish message: cannot publish data, at /root/ros2_colcon_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publish.cpp:52, at /root/ros2_colcon_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:290
@fujitatomoya @Barry-Xu-2018 @DensoADAS
When using KeepAll(), additional parameters get into action: resource_limits
and max_blocking_time
.
max_blocking_time
is part of the reliability
QoS. It is specified as a duration with sec and nanosec. When the publisher tries to add a new sample and the history is full (due to resource limits), it will block for this amount of time (default 100ms) and return false if it couldn't add the change to the history.So we have two options for our XML file:
<!-- XML settings for infinite resource limits -->
<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<publisher profile_name="publisher_default" is_default_profile="true">
<topic>
<resourceLimitsQos>
<max_samples>0</max_samples>
<max_instances>0</max_instances>
<max_samples_per_instance>0</max_samples_per_instance>
<allocated_samples>0</allocated_samples>
</resourceLimitsQos>
</topic>
</publisher>
</profiles>
</dds>
<!-- XML settings for infinite blocking time -->
<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<publisher profile_name="publisher_default" is_default_profile="true">
<qos>
<reliability>
<max_blocking_time>
<sec>DURATION_INFINITE_SEC</sec>
</max_blocking_time>
</reliability>
</qos>
</publisher>
</profiles>
</dds>
@MiguelCompany @Barry-Xu-2018 @DensoADAS
the precious setting you mentioned, both of them lead the following error,
root@6cc2c62d0dd0:~/ros2_colcon_ws# RMW_FASTRTPS_USE_QOS_FROM_XML=1 ros2 run high_freq_pub listener
terminate called after throwing an instance of 'eprosima::fastcdr::exception::NotEnoughMemoryException'
what(): Not enough memory in the buffer stream
i think that this is cpu and memory consumption trading-off, so it is about how efficient can be processed. so far https://github.com/ros2/rmw_fastrtps/issues/338#issuecomment-557140854 can work without any message drop.
and as @dirk-thomas mentined on https://github.com/ros2/rmw_fastrtps/issues/338#issuecomment-557170716, i would like to know the (expected) reason for that. (it just does not add up.)
Sorry for the time-lapse since my last comment. I wanted to make a deep investigation about this problem, and I also was out-off-office for several days.
Although the reason could have been how reliable publisher works when its history is configured as KEEP_LAST, finally this is not the case. Using netstat
, I discovered that the kernel was discarding UDP packets. The socket's buffer must be getting full because the subscriber is not draining the buffer fast enough.
Then the problem was in the subscriber's side. I used callgrind
to check why the subscriber is not getting UDP packets fast enough. I've found several logical code blocks which I was able to improve. eProsima/Fast-RTPS#899 is a meta-ticket that summarizes these improvements. Also to make easier for you, I've joined all improvements in feature/test-improvements/1.9.x branch, so you can test them with high_freq_pub
example,
@dirk-thomas, about why Debug was receiving all the samples and Release was not, this is my conclusion: with Release the subscriber's socket buffer is getting full faster than in Debug so the kernel starts to discard UDP packets.
1.9.x | 1.9.x | feature/test-improvements | feature/test-improvements | |
---|---|---|---|---|
Release | Debug | Release | Debug | |
nanoseconds | 28,118,651,346 | 28,553,887,775 | 5,977,088,688 | 28,560,103,031 |
success | FAIL | OK | OK | OK |
And last question I think I have to answer. Why does the talker take so much time to send all the samples in Release 1.9.x? Because the subscriber is losing samples due to the kernel discarding UDP packets, and then reliability mechanism starts to work and the publisher's asynchronous thread has much to do (sending requested samples for the subscriber), blocking the user's thread more time.
@richiware I have tested the improvement branch with our repo and can confirm that FastRTPS behaves much better than before. I have tested the example multiple times (down below are 5 test runs for each configuration) and got the following results:
Using the master branch of high_freq_example and everything built with debug mode:
5,859,654,444 ns -> 0 lost messages
5,742,361,558 ns -> 0 lost messages
8,303,138,780 ns -> 3040 lost messages
8,034,343,035 ns -> 5370 lost messages
5,901,925,910 ns -> 0 lost messages
Using the feature/keep_all_qos_resending branch of high_freq_example and everything built with debug mode:
6,201,670,377 ns -> 0 lost messages
33,496,412,209 ns -> 0 lost messages
6,381,162,971 ns -> 0 lost messages
30,606,065,461 ns -> 0 lost messages
12,073,535,318 ns -> 0 lost messages
Using the master branch of high_freq_example and everything built with release mode:
2,533,380,340 ns -> 19866 lost messages
2,290,880,031 ns -> 0 lost messages
2,286,291,540 ns -> 0 lost messages
2,484,636,737 ns -> 12342 lost messages
2,314,744,746 ns -> 0 lost messages
Using the feature/keep_all_qos_resending branch of high_freq_example and everything built with release mode:
2,191,338,534 ns -> 0 lost messages
2,179,242,513 ns -> 0 lost messages
2,159,576,817 ns -> 0 lost messages
2,163,511,295 ns -> 0 lost messages
2,152,910,959 ns -> 0 lost messages
With the feature/keep_all_qos_resending branch and release mode the publish time is around 2,000,000,000 ns for about 90% of my test runs and then I also get no exception when publishing (so no re-publishing occurs). Sometimes though the publish time takes about twice that value and the time until the subscriber gets all messages is comparably higher as well then (~ 2-3 times higher than usual, haven't exactly measured this yet). Everything above was tested without the XML files mentioned above and on machine No.1 which I have mentioned in one of my previous comments.
All PRs mentioned on eProsima/Fast-RTPS#899 have been merged.
@DensoADAS I think this can be closed.
@MiguelCompany I've tested the example again with the latest ros2 sources (Fast-RTPS version: bec830c21e82f850a909fde6739e90d07f0b0bd6), everything built with Release mode:
When using the master branch of the example (keepLast):
[INFO] [1580374873.252230767] [talker_high_freq]: Sent 100000 messages in 21844108083 ns
[INFO] [1580374874.288028606] [rclcpp]: signal_handler(signal_value=2)
real 0m24.217s
user 0m22.093s
sys 0m1.203s
I receive all messages most of the times with durations similar to the one measured above. Sometimes (rarely though) I don't receive all messages (about 1% are missing then).
When using the feature/keep_all_qos_resending branch (keepAll settings):
[INFO] [1580374661.961899286] [talker_high_freq]: Sent 100000 messages in 198026381838 ns
[INFO] [1580374699.792305152] [rclcpp]: signal_handler(signal_value=2)
real 3m57.195s
user 0m3.150s
sys 0m1.295s
I receive all messages but a lot of messages have to be resent since an exception is thrown when publishing them. The resulting duration is quite high all the time.
Maybe somebody else could check/verify this behaviour, since the durations seem quite high.
@MiguelCompany @DensoADAS
there are still message drops and failure to publish, confirmed.
The resulting duration is quite high all the time.
maybe you need to define what is the quite high, instead it is hard to understand.
the followings are release build, up-to-date ros2 source build environment.
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub talker
[INFO] [1582260699.587175751] [talker_high_freq]: Sending messages...
[INFO] [1582260734.571079814] [talker_high_freq]: Sent 100000 messages in 34983911263 ns
^C[INFO] [1582260737.896654254] [rclcpp]: signal_handler(signal_value=2)
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub listener
[INFO] [1582260695.229655656] [listener_high_freq]: Ready to receive messages
^C[INFO] [1582260738.803838167] [rclcpp]: signal_handler(signal_value=2)
[INFO] [1582260738.804346691] [listener_high_freq]: Received 99834 messages
[INFO] [1582260738.804354721] [listener_high_freq]: Lost 166 messages
root@13a691159527:~/ros2_ws/colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub talker
[INFO] [1582260796.731842463] [talker_high_freq]: Sending messages...
[INFO] [1582260797.296652333] [talker_high_freq]: Sent 100000 messages in 564818430 ns
^C[INFO] [1582260813.219777432] [rclcpp]: signal_handler(signal_value=2)
root@13a691159527:~/ros2_ws/colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub listener
[INFO] [1582260793.507338493] [listener_high_freq]: Ready to receive messages
^C[INFO] [1582260804.081090170] [rclcpp]: signal_handler(signal_value=2)
[INFO] [1582260804.081526864] [listener_high_freq]: Received 100000 messages
[INFO] [1582260804.081540514] [listener_high_freq]: Lost 0 messages
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub talker
[INFO] [1582261236.582966810] [talker_high_freq]: Sending messages...
[INFO] [1582261237.008120375] [talker_high_freq]: Failed to publish msg #5001
[INFO] [1582261237.108303590] [talker_high_freq]: Failed to publish msg #5001
...<every 5000, there is failure to publish message>
[INFO] [1582261295.686259550] [talker_high_freq]: Sent 100000 messages in 59103302060 ns
^C[INFO] [1582261298.092971868] [rclcpp]: signal_handler(signal_value=2)
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub listener
[INFO] [1582261231.757531033] [listener_high_freq]: Ready to receive messages
[INFO] [1582261295.686274440] [listener_high_freq]: Received 100000 messages
^C[INFO] [1582261298.844459160] [rclcpp]: signal_handler(signal_value=2)
[INFO] [1582261298.845034745] [listener_high_freq]: Received 100000 messages
[INFO] [1582261298.845056715] [listener_high_freq]: Lost 0 messages
root@13a691159527:~/ros2_ws/colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub talker
[INFO] [1582261368.855976138] [talker_high_freq]: Sending messages...
[INFO] [1582261369.469863591] [talker_high_freq]: Sent 100000 messages in 613895923 ns
^C[INFO] [1582261372.033819949] [rclcpp]: signal_handler(signal_value=2)
root@13a691159527:~/ros2_ws/colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@13a691159527:~/ros2_ws/colcon_ws# ros2 run high_freq_pub listener
[INFO] [1582261363.066281806] [listener_high_freq]: Ready to receive messages
[INFO] [1582261369.469860111] [listener_high_freq]: Received 100000 messages
^C[INFO] [1582261372.668585602] [rclcpp]: signal_handler(signal_value=2)
[INFO] [1582261372.669049716] [listener_high_freq]: Received 100000 messages
[INFO] [1582261372.669058966] [listener_high_freq]: Lost 0 messages
@fujitatomoya
maybe you need to define what is the quite high, instead it is hard to understand.
That was related to the time measurements in my last comment, as it took about 4 minutes with KeepAll compared to about 24 seconds with KeepLast (just tested it again with the latests ros2 sources and the durations are still the same).
<every 5000, there is failure to publish message>
For me it is 5000 only at the beginning (until about 66000 messages were sent), then the message difference until the publishing fails again drops to lower values (200-300).
<resourceLimitsQos> <max_samples>0</max_samples> <max_instances>0</max_instances> <max_samples_per_instance>0</max_samples_per_instance> <allocated_samples>0</allocated_samples> </resourceLimitsQos>
In the version of Foxy, in which the fastdds version is 2.1.1, setting resourceLimitsQos to all zero, is invalid, and will cause the error about depth must be <= max_samples
[RTPS_QOS_CHECK Error] INCORRECT TOPIC Qos (ros_discovery_info): depth must be <= max_samples -> Function checkQos
the issue get resolved in the version of humble, in which the fastdds version is 2.6.7.
When using KeepAll(), additional parameters get into action:
resource_limits
andmax_blocking_time
.
- Resource limits is, in a way, similar to the history depth. It specifies the maximum number of samples allowed. You can set it to infinity using maximum vales of 0
max_blocking_time
is part of thereliability
QoS. It is specified as a duration with sec and nanosec. When the publisher tries to add a new sample and the history is full (due to resource limits), it will block for this amount of time (default 100ms) and return false if it couldn't add the change to the history.
according to this doc https://fast-dds.docs.eprosima.com/en/v2.6.7/fastdds/use_cases/realtime/blocking.html , only if the cmake_option STRICT_REALTIME
get open, the max_blocking_time will take effect, and write will timeout and return false.
in ROS2/Fastdds, the STRICT_REALTIME
is set to OFF
, so you will need to compile the fastrtps from source.
Bug report
Required Info:
Steps to reproduce issue
Create a simple publisher/subscriber pair, both having reliable QoS settings. Publish a fixed number of messages on the publisher side with a high frequency (>1kHZ -> <1ms intervals (e.g. 1us)).
Example: Talker:
Listener:
Expected behavior
The subscriber receives all messages in order or an exception is thrown (or some other error propagation method).
Actual behavior
Some messages are published but not received on the subscriber side without any notification to the user or thrown errors. Basically behaves similar to best_effort strategy.
Additional information
A full implementation example which reproduces the error can be found here: https://github.com/DensoADAS/high_freq_pub_example.git This is basically the Talker/Listener demo from the demo_nodes_cpp package with a different message type for easier counting and a higher pub rate.
This problem occurs with various middleware implementations (as well as with the eloquent binaries and the ros2 master) and usually gets worse the higher the publisher frequency is. Also different QoS settings were tested (different depth/keepAll) but the problem still occurs (less frequently with keepAll but some messages are still dropped). The problem can be observed already with normal publishing rates, e.g. intervals between published messages of about 1ms.
With the simple example above the following dropped message counts were observed: