Messages are dropped when using FastRTPS on raspberry pi 3

firesurfer commented 7 years ago

We noticed that FastRTPS seems to drop messages when used on a raspberry pi with an x86 computer as sender but also with another raspberry pi as sender.

The used message: msgs/StorageData

bool notavailable
string uuid
string sendernode
string key
string type
int64 componentid
uint64 unixtime
uint8[] data

The used commands for sending and recieving: ros2 topic echo /storage_data_topic ros2 topic pub /storage_data_topic msgs/StorageData '{"uuid": "test", "sendernode": "mynode", "data": [1,2,3,4,5,6,7,8,9,10]}'

Start listen first, then publish. What can be observed: It takes at least 4 messages to recieve one message when publishing from an x86 computer. It doesn't start recieving any message if publishing is done from another pi. Restarting the listener helps. Messages will be recieved afterwards. In general. Sometimes long delays between messages and/or message gets dropped.

The phenomena is even more extrem when used in an own application with the parameters qos profile. The subscription running on the pi won't recieve any messages of this type (other messages are working more or less fine - sometimes delayed by 10s). This also happens when listening with ros2 topic echo but sending with our own application.

Used version of ROS2: Current master branches. x86 Computer: Debian Testing Raspberry Pi 3: Raspbian Testing Network topology: Multiple Switches configured with Spanning Tree Protocol (STP). Multiple raspberry pis in network.

Edit: Using wireshark I could determine there is often an ICMP Destination unreachable (Port unreachable) message with destination of either the x86 computer or the respberry pi.

Edit 2: I could determine that this issue depends on the data in the message. Example: Set "sendernode" to any data. Then send it. It takes at least three sending cycles until a message is recieved. Stopping the sending process and restarting it results in immediate recieving of the message. Stopping it, changing the data a bit, results into one or two sending cycles until a message is recieved. Changing the data a lot, like setting another field results in at least three sending cycles.

mikaelarguedas commented 7 years ago

@firesurfer Can you confirm that this problem happens only if the RaspberryPi is involved and not between x86 machines (same network configuration).

@richiware Did you ever encounter similar problems when testing on RaspberryPi ?

firesurfer commented 7 years ago

@mikaelarguedas I can confirm that this doesn't happen on x86 if both sender and reciever are running on the same machine. In case of two x86 machines I will do another test on friday.

richiware commented 7 years ago

I will prepare a raspberry pi 3 environment and test it.

richiware commented 7 years ago

I was testing with two scenarios:

Raspberry Pi 3 echoing /storage_data_topic and x86 publishing /storage_data_topic
Raspberry Pi 3 echoing /storage_data_topic and Raspberry Pi 3 publishing /storage_data_topic I'm using current master branches of ROS2 repositories.

In both cases the subscriber starts to print data after a delay of 4 seconds. Then it receives data without any problem. I was investigating the delay. Using Wireshark I saw first RTPS packet (Participant discovery message) is sent after 4 seconds since application started. Right know I don't why, whether the application takes much time to boost or other reason.

firesurfer commented 7 years ago

On x86 on our network setup it takes one message until the first message is recieved. It depends if the message data has been changed or not. In case the message data has been changed like additionally setting a field that hasn't been set before there is a delay of one message. In case the data hasn't been changed or a field that already had data is changed there is in most cases no delay.

Edit: Another interesting thing I found while debugging. In case I put a small delay after each publish call in my own program. The messages seems to be transmitted fine.

 this->store_data_publisher->publish(msg);
    std::this_thread::sleep_for(std::chrono::milliseconds(10));

abilbaotm commented 6 years ago

Hi. We are having the similar issues with a ROS2 network where the nodes after some time stop working. We realise that some of the RTPS messages are encapsulated inside an ICMP message (with a Destination unreachable (Port unreachable) error) as commented here https://github.com/ros2/rmw_fastrtps/issues/157#issue-265790753. INFO_DST, INFO_TS and DATA: captura de pantalla de 2018-01-10 16-46-11

Also HEARTBEAT: captura de pantalla de 2018-01-10 16-50-36

firesurfer commented 6 years ago

@richiprosima Any news on this ? This issue practically renders our raspberry pi mesh network unusable with ROS2.

richiware commented 6 years ago

I'm updating my raspberry ros2 environment and will try again to reproduce the issue.

richiware commented 6 years ago

Sorry for the delay. I was travelling. I achieved to prepare a raspberry environment. It is simple: two raspberry pi communicating through a switch using ethernet. Raspberry were communicating between them from 6 hours without problems.

captura de pantalla de 2018-01-29 14-48-05

@abilbaotm What differences are there between your scenario and mine? Are you using ethernet or wireless? What is the info returned by ifconfig -a? All info is appreciated.

abilbaotm commented 6 years ago

Hi. Our setup is the next one.

Each machine in the network have a node and multiple publishers.
The value of the messages changes each time is published.
We are using Ethernet.

The output of ifconfig -a is the next one:


enxb827eb945e61: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
    inet 10.0.0.28  netmask 255.255.255.0  broadcast 10.0.0.255
    inet6 fe80::ba27:ebff:fe94:5e61  prefixlen 64  scopeid 0x20<link>
    ether b8:27:eb:94:5e:61  txqueuelen 1000  (Ethernet)
    RX packets 60045  bytes 13982058 (13.3 MiB)
    RX errors 0  dropped 0  overruns 0  frame 0
    TX packets 81003  bytes 17415879 (16.6 MiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10 loop txqueuelen 1000 (Local Loopback) RX packets 1590 bytes 140098 (136.8 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1590 bytes 140098 (136.8 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

wlan0: flags=4098<BROADCAST,MULTICAST> mtu 1500 ether b8:27:eb:c1:0b:34 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0


At what rate are you publishing? I think that the issue is when data is published too fast. Like more than `50-100 MHz`.

Thanks @richiware for your time!

firesurfer commented 6 years ago

Hi, we noticed this issue when starting to publish. It takes around 4 publishing turns until a message is recieved. Afterwards all messages are recieved. If you cancel the publishing node, wait like 10-20 seconds and then start again the delays is there again. If you cancel the publishing node and restart it immediatly there is no delay.

When using wireshark we also get a lot of ICMP - Destination unreachable (Port unreachable) messages when using ros2. Nevertheless we can communicate with the raspberry pis via ROS2 that are mentioned in the Destination field of the corresponding wireshark message. But there is the above mentioned delay or the messages are dropped (I can't say if the messages are just delayed or if the first 2 or 3 messages are dropped)

I just did some new tests. Apparently it depends if there is any other ROS2 communication on the network. For testing purposes I connected only two raspberry pis and ran our own software together with ros2 topic and only ros2 topic in comparison. In the second case all messages are recieved. What I noticed is that apparently ros2 topic pub has a long delay at startup. We noticed ourselfs that having a delay at startup resolves some message transport problems in our software. @mikaelarguedas could you perhaps explain why there is such a long delay at startup ?

And our network configuration of one pi.

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.180.55.31  netmask 255.255.255.0  broadcast 10.180.55.255
        ether b8:27:eb:8a:92:3b  txqueuelen 1000  (Ethernet)
        RX packets 127525  bytes 13552241 (12.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 61305  bytes 9454530 (9.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1  (Local Loopback)
        RX packets 10490  bytes 1520412 (1.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10490  bytes 1520412 (1.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

firesurfer commented 6 years ago

Hi the problem with ros2 echo / pub was that apparently you have to wait long enough for echo to be setup properly. During my testing last friday it seems that I waited just a bit longer sometimes.

Nevertheless I could track down the ICMP - Destination unreachable error. But I'll open another issue tracker for that.

I think that the question why there is such a long delay at startup is still valid.

firesurfer commented 6 years ago

Please see https://github.com/ros2/ros2/issues/480 regarding this issue. A colleague of mine created a container environment in which the error comes up.

In our real setup the issue became better (but not completly solved) after the deadlock fix was commited in FastRtps last week: https://github.com/eProsima/Fast-RTPS/commit/17e717c0740dab99c353b07c4e76237bcc7a32ba

ros2 / rmw_fastrtps

Messages are dropped when using FastRTPS on raspberry pi 3 #157