ros2 / rmw_cyclonedds

ROS 2 RMW layer for Eclipse Cyclone DDS
Apache License 2.0
117 stars 90 forks source link

CycloneDDS Unnecessarily Sends Packets Through Network #489

Open amalnanavati opened 7 months ago

amalnanavati commented 7 months ago

Bug report

Overview: While monitoring the LAN in/out throughput on my router, I launch a publisher on my local machine, followed by one subscriber (ros2 topic echo ...), followed by a second subscriber (ros2 topic echo ...). As soon as I launch the second subscriber, my router registers packets going through the LAN corresponding to the throughput of the topic, even though all publishers and subscribers are running on the local machine.

Required Info:

Steps to reproduce issue

Reproducing this issue requires a way to monitor router LAN throughput; I use NetCloud OS for the Cradlepoint IBR900-600M router. Additionally, this issue is most pronounced when the topic has high throughput; thus, for the publisher I use the DummyRealSense node from this package, and its raw aligned color to depth image.

While monitoring LAN in/out throughout of your router, do:

  1. ros2 daemon stop
  2. In every terminal: export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
  3. In one terminal, run the publisher: ros2 launch feeding_web_app_ros2_test feeding_web_app_dummy_nodes_launch.xml run_web_bridge:=false run_food_detection:=false run_face_detection:=false run_food_on_fork_detection:=false run_motion:=false
  4. After a few seconds, run the first subscriber: ros2 topic echo /camera/aligned_depth_to_color/image_raw | grep sec
  5. After a few seconds, run the second subscriber: ros2 topic echo /camera/aligned_depth_to_color/image_raw | grep sec
  6. After a few seconds, terminate the second subscriber.
  7. After a few seconds, terminate the first subscriber.
  8. After a few seconds, terminate the publisher.
  9. After a few seconds, turn off the ROS2 daemon: ros2 daemon stop
  10. In every terminal: export RMW_IMPLEMENTATION=rmw_fastrtps_cpp
  11. Repeat steps 3-8

Expected behavior

LAN in/out throughput should be near 0 Mbps the entire time.

Actual behavior

While using CycloneDDS, after we launch the second subscriber, the LAN throughput jumps up to the throughput of the topic (~70 Mbps). Terminating the second subscriber brings it back down. This issue does not happen with FastRTPS. See the below screenshot for LAN throughput with annotated timestamps corresponding to the above steps.

Screenshot 2024-04-19 at 9 45 33 AM

Additional information

eboasson commented 7 months ago

That sounds very much like Cyclone DDS switching to multicast to avoid having sending the data multiple times (once for each subscriber) and all multicasts getting forwarded to the physical network.

There are two possible reasons for forwarding the multicasts to the network:

The first is easily fixed by enabling IGMP snooping, the second is easily fixed by configuring the used multicast addresses differently (Discovery/DefaultMulticastAddress in the Cyclone DDS XML is the easiest).

Another approach is of course to not use multicast at all for data, that's most easily done by setting General/AllowMulticast to spdp (the default if the machine is using a WiFi adapter). Incidentally, setting Discovery/DefaultMulticastAddress to 0.0.0.0 has the same effect but still allows you to configure multicast for specific topics.

Yet another route is to configure it to use both the loopback interface and the ethernet interface. IINM all else being equal it prioritises loopback over ethernet, and so if the loopback interface supports multicast it will do that. The issue with this route is that Linux has a quirk where an IPv4 loopback interface will happily perform multicasts, but the "multicast" flag isn't set on the interface. You can set that by hand, or you can use an override in the Cyclone configuration. (Cyclone master is a bit more helpful, but not 100% compatible with ROS 2 at the moment. This I am trying to fix asap.)

And with regards to the differences with FastRTPS: different DDS implementations make different choices w.r.t. the use of multicast, selecting interfaces to use, and choosing addresses to send data to when multiple options exist. Cyclone is a bit unusual in that it uses only one interface by default and that it really likes to use multicast.

The one-interface-by-default may not be ideal in the "modern" context with containers, virtual network interfaces and oftentimes multi-homed machines (wifi+wired), because it means it sometimes doesn't work out-of-the-box. The other side of the coin is that at least it avoids spraying junk all over the networks when most systems really want to use a very specific subset of the network interfaces, but that this typically stays hidden for a long time.

For multicast the biggest issues are that switches drop it more quickly than they drop unicast, that it works very badly with WiFi and that IP multicast typically ends up anywhere there's a DDS process running (if one sticks to the default addresses), even when those processes are unrelated. So making it less a bet less eager to switch to multicast is probably wise.

amalnanavati commented 7 months ago

This response is super helpful, thank you so much for the details!

For multicast the biggest issues are that switches drop it more quickly than they drop unicast

This is precisely the issue we had that motivated me to look into WiFi throughout, leading to the above issue. Our router was dropping clients, and it’s helpful to understand that multicast could be the reason.

For what it’s worth, I did enable multicast on loopback manually, and still got the above results. It seemed to me that Cyclone was preferring Ethernet as its “one interface by default.”

It sounds like getting CycloneDDS to work in this scenario would require more careful configuration of the XML file. This issue is not high-priority for me as FastRTPS seems (for now) to be working in my scenario. However, if/when I switch back to CycloneDDS, I’ll follow your pointers on tuning the XML file. Thanks for the detailed pointers and explanation!

eboasson commented 7 months ago

Yes, it defaults to a single interface and prefers Ethernet ... it takes some more attention on the configuration.

If you'd be interested in checking that it works as you expect when you simply avoid multicast, it should be sufficient to do export CYCLONEDDS_URI="<Disc><DefaultMulticastAddress>0.0.0.0</></>" before starting the applications. You don't actually need to have an XML file somewhere, you can do a shorthand directly in the environment variable.

I'm sure you understand I would prefer it if you use Cyclone DDS 😂 But if you're happy with FastRTPS then go do fun things instead of tweaking middleware!

amalnanavati commented 6 months ago

I started looking into this more, and here is what I found:

I'd ideally like the following behavior: use multicast on both loopback and ethernet, but prefer loopback and only use ethernet if necessary (e.g., if the communication is going across devices). This is driven by our application details, where we have many subscribers per publisher, and most publisher-subscriber pairs are on the same machine, but a few cross machines (requiring ethernet).

Do you have suggestions on achieving the above behavior with a CycloneDDS config file? I'm also looking through the docs, but any suggestions you have would be much appreciated. Currently, my config file is pretty barebones:

<?xml version="1.0" encoding="utf-8"?>
<CycloneDDS
  xmlns="https://cdds.io/config"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd"
>
  <Domain Id="any">
    <Discovery>
      <DefaultMulticastAddress>0.0.0.0</DefaultMulticastAddress>
    </Discovery>
  </Domain>
</CycloneDDS>

(As a side-note, although FastRTPS addressed the issue of our network dropping clients, it introduced a new issue where sometimes there would be high latency in received small-sized messages. In our application we have lots of publishers/subscribers operating at 15-30Hz, so lots of messages are being communicated. I assume the FastRTPS issue might be due to it not using multicast, and you had mentioned that CycloneDDS strongly prefers multicast, so I figured I'd return to the CycloneDDS world and try to configure it correctly 😄 )

sloretz commented 6 months ago

@eboasson It looks like there's been a good discussion here. Is this one resolved or does it need more work? 🧇

amalnanavati commented 5 months ago

@eboasson Any pointers on what parameters to set to get the behavior I specified in my above message? Thanks in advance!

eboasson commented 5 months ago

I am so hopelessly behind on everything RMW, and ever so slightly less behind on everything Cyclone ... Thankfully you just kept pinging me ☺️🙏

I tried your suggestion of doing export CYCLONEDDS_URI="0.0.0.0</></>". This worked in terms of preventing packets from unnecessarily being sent through the network. But did I understand you correctly that doing so essentially disables multicast?

Yes, but there's "disabling" and "really disabling". This means there is no default multicast address anymore, which has two effects:

The terminology is ancient and inherited from OpenSplice DDS's original native protocol stack (one that antedates the DDSI protocol), but if you can live with that then something like:

    <Discovery>
      <DefaultMulticastAddress>0.0.0.0</DefaultMulticastAddress>
    </Discovery>
    <Partitioning>
      <NetworkPartitions>
        <NetworkPartition Name="multi" Address="239.255.0.2"/>
      </NetworkPartitions>
      <PartitionMappings>
        <PartitionMapping DCPSPartitionTopic="*.*low_rate*" NetworkPartition="multi"/>
      </PartitionMappings>
    </Partitioning>

to map all traffic with topics containing "low_rate" in the name to 239.255.0.2 and leave everything else at unicast.

You can't currently write exclusions, so there's currently no way to say "everything but this topic". I think it would be really easy to improve it to do exclusions and to allow specifying that it has to map to unicast only. It's just not there today.

I'd ideally like the following behavior: use multicast on both loopback and ethernet, but prefer loopback and only use ethernet if necessary (e.g., if the communication is going across devices). This is driven by our application details, where we have many subscribers per publisher, and most publisher-subscriber pairs are on the same machine, but a few cross machines (requiring ethernet).

I think that requires no more than:

    <General>
      <Interfaces>
        <NetworkInterface name="lo0" priority="default" multicast="true"/>
        <NetworkInterface name="en0" priority="default"/>
      </Interfaces>
    </General>

(macOS network interface names, but you get the idea.) The multicast="true" overrides whatever the kernel says and should work on Linux and is not needed if you have set the MULTICAST flag on the interface. I have to admit there's a slight risk I am confusing behaviour in master with behaviour in 0.10.x and in any case on master it would figure out that multicast over loopback works fine without any configuration options.

(If you want to use Cyclone DDS master with ROS, then https://github.com/eboasson/rmw_cyclonedds/tree/cdds-master will probably do the trick. I intend to do a PR after I dot the i's and cross the t's ... It has some nice side benefits, like integrating type discovery in ROS with type discovery in DDS.)

Cyclone considers loopback to be lower cost than Ethernet so this should automatically result in all traffic using loopback only when possible. If everything works the way it should, doing:

<NetworkInterface name="en0" priority="default" allow_multicast="false"/>

should prevent the use of multicast over ethernet while allowing it over loopback. (I'm not sure it will work as intended, especially not in 0.10.x.)

Hope this is still useful after so much time!