Open ghost opened 6 years ago
Works for me... is there something more you could share that describes what happens?
We're running through an unmanaged switch. Both computers are plugged into it, and there's no gateway.
That configuration seems like it should work.
What configuration are you running?
The same.
I've emailed you a link to our build.
Your build also works for me.
Would it be possible to get a network packet capture taken on one of the two hosts? Start the capture, run the two test programs, wait for a bit (30 seconds?), then stop capture...
ctucker@ubuntu_2:~/asi_ros2$ ros2 topic echo /chatter std_msgs/String data: 'Hello World: 2'
data: 'Hello World: 3'
data: 'Hello World: 4'
data: 'Hello World: 5'
data: 'Hello World: 6'
data: 'Hello World: 7'
on_listener_computer.pcapng.tar.gz
It doesn't look like I was seeing any of the autodiscovery from the other computer, but maybe I was just looking at it wrong.
Yep. Can you take a capture on the other computer?
test_two.tar.gz It looks like the echo computer saw autodiscovery stuff this time (weird). I've attached captures taken at the same time on both the talker and echo computers
In that last set of captures, it looks like discovery completed successfully, and I can see that there was a match on the /chatter topic. However, no DATA messages show up at all. Is it possible that you are running a firewall on either machine?
the builtin ufw is the only one that I know of, and it's disabled on both computers. And messages do get through if we subscribe reliable. It's the just the best effort subscription (echo) that doesn't work.
Hmmm. I get very different captures when I run the two programs:
ros2 run demo_nodes_py talker
ros2 topic echo /chatter
They create only a single DDS DataWriter / DataReader on the "/chatter" topic, and none of the others that I see in your capture[s] ( for example, "/talker/get_parametersReply", "/talker/get_parameter_typesReply", etc).
Are you running a different test?
Ah. The other computer was running the cpp talker by accident. That includes parameter services. The python nodes don't. We could make another capture without it if that helps.
OK, that explains it, I just wanted to make sure I was looking at the right thing.
I still can't reproduce this locally... Let's try using the 'log' version of the coredx library:
mv libdds_cf.so libdds_cf_nolog.so
ln -s libdds_cf_log.so libdds_cf.so
Then, set the DDS_DEBUG environment variable to 7, and run the test:
export DDS_DEBUG=7
ros2 run demo_nodes_py talker 2>&1 | grep -E 'chatter|UDP' > talker_debug.log
And, for completeness, you could do the same on the 'echo' side.
I would expect the log to look a little like this:
...
1539870361.028823409: UDP : DATA : read msg from 127.0.0.1:43700 (fd 6) (748 bytes)
1539870361.028854505: UDP : DATA : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.028872756: UDP : DATA : read msg from 127.0.0.1:43700 (fd 6) (112 bytes)
1539870361.028900436: UDP : DATA : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.028918015: UDP : DATA : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.028937638: : DISCVRY: EXISTING WRITER...alive on topic rt/chatter
1539870361.028947979: UDP : DATA : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.028969146: UDP : DATA : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.079378326: : DATA : Reader( DCPSPublication) [01060A00.00460000.2FBB0001.000003C7] sending ACKNACK to Locator( UDPv4 U Address: 10.0.0.70 port:7410)
1539870361.079400643: UDP : DATA : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079433241: : DATA : Reader( DCPSSubscription) [01060A00.00460000.2FBB0001.000004C7] sending ACKNACK to Locator( UDPv4 U Address: 10.0.0.70 port:7410)
1539870361.079436873: UDP : DATA : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079446069: : DATA : Reader( ParticipantMessage) [01060A00.00460000.2FBB0001.000200C7] sending ACKNACK to Locator( UDPv4 U Address: 10.0.0.70 port:7410)
1539870361.079449156: UDP : DATA : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079521036: UDP : DATA : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.079548636: UDP : DATA : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.079566803: UDP : DATA : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870362.033470897: : DATA : Writer( rt/chatter): new change 1
...
Clark, I've been working with Bryant on this issue. Here are the logs: chatter_with_debug.tar.gz There are 4 files:
We really appreciate your help on this. Let me know if there is anything else we can do to help resolve this.
thanks.
OK. That's very helpful. I can verify that the talker is sending samples in both scenarios. They are sent over multicast (and apparently not received). When matched with the listener (reliable), we also send a heartbeat (multicast + unicast). This allows the listener to NACK the missing sample which is then [re]sent via unicast.
When matched with echo (best_effort), the sample is sent over multicast only. This as in the listener scenario, is not received.
So, the question is, why are the multicast 'chatter' samples not being received at the listener/echo machine? [The earlier captures show that at least some of the 'discovery' data is successfully transferred...]
Could you rerun the echo scenario with an additional debug setting:
export COREDX_UDP_DEBUG=66
And a slightly different grep:
grep -E 'chatter|UDP|IP'
This should show us specifically which interface[s] coredx is trying to write to.
Here you go. Thank you for the quick response!
Also, for what it's worth, talker is running on the 172.31,255.112 computer, and the listener is running on the 172.31.255.103 computer.
Cool, thanks. Could you send the 'talker' side as well?
My bad. We ran both talker and echo again.
I think I've got it. Because the two computers share a 'common' IP address [172.17.0.1], we are incorrectly(?) inferring that the two applications (talker + echo) are hosted on the same computer. This impacts how we write multicast packets, resulting in the observed behavior.
If the 'common' 172.17.0.1 address is not required, then my first recommendation would be to change it so that it is not unique.
If that is not possible, then you could configure CoreDX to not use that address. This can be achieved by setting the IP address explicitly with export COREDX_IP_ADDR=172.31.255.xyz
. Alternatively, by tailoring the UDP transport configuration [would require mods to rmw_coredx -- it currently just uses a default udp transport configuration].
Finally, you could configure CoreDX to ignore the fact that it thinks the two applications are hosted on the same machine. The setting CoreDX_UdpTransportConfig . try_to_keep_mcast_local = FALSE (0)
should do the trick. [This would also require some modification of the rmw_coredx layer to support udp transport configuration.]
So I'm confused about this 'common' ip address. In all the logs that we've sent you, All other NICs were disabled, leaving only the connection on the 172.31.255.1/24 subnet. Where is this 172,17.0.1 address coming from? Is that the UDP multicast address?
Thanks for your helping me understand.
So setting the COREDX_IP_ADDR
variable appears to work for us.
CoreDX queries the OS for all the 'up' network interfaces. For example, on the .103 machine, we get this:
1539879209.990466447: IP : TRANSPT: INTERFACES:
1539879209.990468904: IP : TRANSPT: IfIndex: 13 family IPv4 addr: 172.17.0.1:0 mcast: 1 loop: 0
1539879209.990470701: IP : TRANSPT: IfIndex: 18 family IPv4 addr: 172.31.255.103:0 mcast: 1 loop: 0
1539879209.990472709: IP : TRANSPT: IfIndex: 18 family IPv6 addr: fe80:0:0:0:fa7d:947:76ea:5884,0 (scp:18) mcast: 1 loop: 0
And, by default, we will make use of all 'up' interfaces.
I'm glad to hear that the setting COREDX_IP_ADDR worked.
So we both do have docker installed which is using that 172.17.0.1 ip address. Let me try disabling that network interface and try that again. Do you have docker installed on your two test machines as well?
Nope. Just a single interface.
We just removed the docker ip interface and all appears to be working correctly. Even if docker is installed on one computer then coredx works fine.
If I understand correctly, and correct me if I'm wrong, coredx checks the ip address of the publisher and subscriber to determine if they are on the same computer or not. However in cases where docker is installed, coredx will always assume that the publisher and subscriber are on the same machine. Could it be changed to use something more unique like a mac address instead?
Thank you for your help!
In general, I think your analysis is correct. However, I would say it slightly differently to indicate that it is not really tied to Docker, and that the behavior is not mandatory:
Each CoreDX participant checks the IP address of each discovered peer participant to determine if they are on the same computer or not. In cases where identical IP addresses are detected, CoreDX will, by default, assume that the two participants are on the same machine. This default behavior can be disabled with the CoreDX_UdpTransportConfig . try_to_keep_mcast_local
flag.
Concerning using MAC address for this test: The only information we are guaranteed to have about a peer is IP address. We don't have any information about the MAC address of discovered peers, otherwise that might be a better test.
Okay. I understand. Thanks again for your help and quick replies!
OK, Thanks for your patience and help as we worked through this! I really appreciate it!
What works:
ros2 run demo_nodes_py talker
on one computerros2 run demo_nodes_py listener
on another computer This seems to work because talker publishes as reliable and listener subscribes as reliableWhat doesn't work:
ros2 run demo_nodes_py talker
on one computerros2 topic echo /chatter std_msgs/String
on another computer It seems like this not working is related to the fact that topic echo subscribes as best effort. If this is run on a single computer, everything works fine, but something about best effort subscription isn't working between computers.This is a major roadblock that will keep us from updating to bouncy.