best effort subscription not working between two computers

ghost commented 6 years ago

What works:

run ros2 run demo_nodes_py talker on one computer
run ros2 run demo_nodes_py listener on another computer This seems to work because talker publishes as reliable and listener subscribes as reliable

What doesn't work:

run ros2 run demo_nodes_py talker on one computer
run ros2 topic echo /chatter std_msgs/String on another computer It seems like this not working is related to the fact that topic echo subscribes as best effort. If this is run on a single computer, everything works fine, but something about best effort subscription isn't working between computers.

This is a major roadblock that will keep us from updating to bouncy.

ClarkTucker commented 6 years ago

Works for me... is there something more you could share that describes what happens?

ros2_1

ros2_2

ghost commented 6 years ago

We're running through an unmanaged switch. Both computers are plugged into it, and there's no gateway.

ClarkTucker commented 6 years ago

That configuration seems like it should work.

ghost commented 6 years ago

What configuration are you running?

ClarkTucker commented 6 years ago

The same.

ghost commented 6 years ago

I've emailed you a link to our build.

ClarkTucker commented 6 years ago

Your build also works for me.

Would it be possible to get a network packet capture taken on one of the two hosts? Start the capture, run the two test programs, wait for a bit (30 seconds?), then stop capture...

ctucker@ubuntu_2:~/asi_ros2$ ros2 topic echo /chatter std_msgs/String data: 'Hello World: 2'

data: 'Hello World: 3'

data: 'Hello World: 4'

data: 'Hello World: 5'

data: 'Hello World: 6'

data: 'Hello World: 7'

ghost commented 6 years ago

on_listener_computer.pcapng.tar.gz

It doesn't look like I was seeing any of the autodiscovery from the other computer, but maybe I was just looking at it wrong.

ClarkTucker commented 6 years ago

Yep. Can you take a capture on the other computer?

ghost commented 6 years ago

test_two.tar.gz It looks like the echo computer saw autodiscovery stuff this time (weird). I've attached captures taken at the same time on both the talker and echo computers

ClarkTucker commented 6 years ago

In that last set of captures, it looks like discovery completed successfully, and I can see that there was a match on the /chatter topic. However, no DATA messages show up at all. Is it possible that you are running a firewall on either machine?

ghost commented 6 years ago

the builtin ufw is the only one that I know of, and it's disabled on both computers. And messages do get through if we subscribe reliable. It's the just the best effort subscription (echo) that doesn't work.

ClarkTucker commented 6 years ago

Hmmm. I get very different captures when I run the two programs:

ros2 run demo_nodes_py talker
ros2 topic echo /chatter

They create only a single DDS DataWriter / DataReader on the "/chatter" topic, and none of the others that I see in your capture[s] ( for example, "/talker/get_parametersReply", "/talker/get_parameter_typesReply", etc).

Are you running a different test?

ghost commented 6 years ago

Ah. The other computer was running the cpp talker by accident. That includes parameter services. The python nodes don't. We could make another capture without it if that helps.

ClarkTucker commented 6 years ago

OK, that explains it, I just wanted to make sure I was looking at the right thing.

ClarkTucker commented 6 years ago

I still can't reproduce this locally... Let's try using the 'log' version of the coredx library:

Find the location of the libdds_cf.so file
Rename that file to be libdds_cf_nolog.so: mv libdds_cf.so libdds_cf_nolog.so
Create a link to the logging library: ln -s libdds_cf_log.so libdds_cf.so

Then, set the DDS_DEBUG environment variable to 7, and run the test:

export DDS_DEBUG=7
ros2 run demo_nodes_py talker 2>&1 | grep -E 'chatter|UDP' > talker_debug.log

And, for completeness, you could do the same on the 'echo' side.

I would expect the log to look a little like this:

...
1539870361.028823409: UDP         : DATA   : read msg from 127.0.0.1:43700 (fd 6) (748 bytes)
1539870361.028854505: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.028872756: UDP         : DATA   : read msg from 127.0.0.1:43700 (fd 6) (112 bytes)
1539870361.028900436: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.028918015: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.028937638:             : DISCVRY: EXISTING WRITER...alive on topic rt/chatter
1539870361.028947979: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.028969146: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.079378326:             : DATA   : Reader(     DCPSPublication) [01060A00.00460000.2FBB0001.000003C7] sending ACKNACK to Locator( UDPv4     U  Address: 10.0.0.70 port:7410)
1539870361.079400643: UDP         : DATA   : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079433241:             : DATA   : Reader(    DCPSSubscription) [01060A00.00460000.2FBB0001.000004C7] sending ACKNACK to Locator( UDPv4     U  Address: 10.0.0.70 port:7410)
1539870361.079436873: UDP         : DATA   : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079446069:             : DATA   : Reader(  ParticipantMessage) [01060A00.00460000.2FBB0001.000200C7] sending ACKNACK to Locator( UDPv4     U  Address: 10.0.0.70 port:7410)
1539870361.079449156: UDP         : DATA   : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079521036: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.079548636: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.079566803: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870362.033470897:             : DATA   : Writer(          rt/chatter): new change 1
...

ChuiVanfleet commented 6 years ago

Clark, I've been working with Bryant on this issue. Here are the logs: chatter_with_debug.tar.gz There are 4 files:

log of listener debug output
log of talker while the listener was running debug output
log of echo debug output
log of talker while the ros2 echo was running debug output

We really appreciate your help on this. Let me know if there is anything else we can do to help resolve this.

thanks.

ClarkTucker commented 6 years ago

OK. That's very helpful. I can verify that the talker is sending samples in both scenarios. They are sent over multicast (and apparently not received). When matched with the listener (reliable), we also send a heartbeat (multicast + unicast). This allows the listener to NACK the missing sample which is then [re]sent via unicast.

When matched with echo (best_effort), the sample is sent over multicast only. This as in the listener scenario, is not received.

So, the question is, why are the multicast 'chatter' samples not being received at the listener/echo machine? [The earlier captures show that at least some of the 'discovery' data is successfully transferred...]

Could you rerun the echo scenario with an additional debug setting:

export COREDX_UDP_DEBUG=66

And a slightly different grep:

grep -E 'chatter|UDP|IP'

This should show us specifically which interface[s] coredx is trying to write to.

ChuiVanfleet commented 6 years ago

Here you go. Thank you for the quick response!

echo_udp_ip_debug.log

Also, for what it's worth, talker is running on the 172.31,255.112 computer, and the listener is running on the 172.31.255.103 computer.

ClarkTucker commented 6 years ago

Cool, thanks. Could you send the 'talker' side as well?

ChuiVanfleet commented 6 years ago

My bad. We ran both talker and echo again.

echo_with_debug_66.tar.gz

ClarkTucker commented 6 years ago

I think I've got it. Because the two computers share a 'common' IP address [172.17.0.1], we are incorrectly(?) inferring that the two applications (talker + echo) are hosted on the same computer. This impacts how we write multicast packets, resulting in the observed behavior.

If the 'common' 172.17.0.1 address is not required, then my first recommendation would be to change it so that it is not unique.
If that is not possible, then you could configure CoreDX to not use that address. This can be achieved by setting the IP address explicitly with export COREDX_IP_ADDR=172.31.255.xyz. Alternatively, by tailoring the UDP transport configuration [would require mods to rmw_coredx -- it currently just uses a default udp transport configuration].
Finally, you could configure CoreDX to ignore the fact that it thinks the two applications are hosted on the same machine. The setting CoreDX_UdpTransportConfig . try_to_keep_mcast_local = FALSE (0) should do the trick. [This would also require some modification of the rmw_coredx layer to support udp transport configuration.]

ChuiVanfleet commented 6 years ago

So I'm confused about this 'common' ip address. In all the logs that we've sent you, All other NICs were disabled, leaving only the connection on the 172.31.255.1/24 subnet. Where is this 172,17.0.1 address coming from? Is that the UDP multicast address?

Thanks for your helping me understand.

ChuiVanfleet commented 6 years ago

So setting the COREDX_IP_ADDR variable appears to work for us.

ClarkTucker commented 6 years ago

CoreDX queries the OS for all the 'up' network interfaces. For example, on the .103 machine, we get this:


1539879209.990466447: IP          : TRANSPT: INTERFACES: 
1539879209.990468904: IP          : TRANSPT:    IfIndex: 13 family IPv4  addr: 172.17.0.1:0 mcast: 1 loop: 0
1539879209.990470701: IP          : TRANSPT:    IfIndex: 18 family IPv4  addr: 172.31.255.103:0 mcast: 1 loop: 0
1539879209.990472709: IP          : TRANSPT:    IfIndex: 18 family IPv6  addr: fe80:0:0:0:fa7d:947:76ea:5884,0 (scp:18) mcast: 1 loop: 0

ClarkTucker commented 6 years ago

And, by default, we will make use of all 'up' interfaces.

I'm glad to hear that the setting COREDX_IP_ADDR worked.

ChuiVanfleet commented 6 years ago

So we both do have docker installed which is using that 172.17.0.1 ip address. Let me try disabling that network interface and try that again. Do you have docker installed on your two test machines as well?

ClarkTucker commented 6 years ago

Nope. Just a single interface.

ChuiVanfleet commented 6 years ago

We just removed the docker ip interface and all appears to be working correctly. Even if docker is installed on one computer then coredx works fine.

If I understand correctly, and correct me if I'm wrong, coredx checks the ip address of the publisher and subscriber to determine if they are on the same computer or not. However in cases where docker is installed, coredx will always assume that the publisher and subscriber are on the same machine. Could it be changed to use something more unique like a mac address instead?

Thank you for your help!

ClarkTucker commented 6 years ago

In general, I think your analysis is correct. However, I would say it slightly differently to indicate that it is not really tied to Docker, and that the behavior is not mandatory:

Each CoreDX participant checks the IP address of each discovered peer participant to determine if they are on the same computer or not. In cases where identical IP addresses are detected, CoreDX will, by default, assume that the two participants are on the same machine. This default behavior can be disabled with the CoreDX_UdpTransportConfig . try_to_keep_mcast_local flag.

Concerning using MAC address for this test: The only information we are guaranteed to have about a peer is IP address. We don't have any information about the MAC address of discovered peers, otherwise that might be a better test.

ChuiVanfleet commented 6 years ago

Okay. I understand. Thanks again for your help and quick replies!

ClarkTucker commented 6 years ago

OK, Thanks for your patience and help as we worked through this! I really appreciate it!

tocinc / rmw_coredx

best effort subscription not working between two computers #30