Open vmayoral opened 9 years ago
@vmayoral Would it be possible to send a Wireshark trace of this communication problem?
@jvoe, @GerardoPardo capture between RTI's connext and Tinq available here (RTI Connext subscriber, Tinq publisher).
Some remarks:
!!sdisca
Domain 0 (pid=223): {1}
GUID prefix: 8f937eb8:00df0d4f:03e90000
RTPS Protocol version: v2.1
Vendor Id: 1.14 - Technicolor, Inc. - Qeo
Technicolor DDS version: 4.0-0, Forward: 0
SecureTransport: none
Authorisation: Authenticated
Entity name: Technicolor Chatroom
Flags: Enabled
Meta Unicast:
UDP:172.23.1.215:7856(3) {MD,UC} H:3
Meta Multicast:
UDP:239.255.0.1:7400(4) {MD,MC} H:4
Default Unicast:
UDP:172.23.1.215:7857(1) {UD,UC} H:1
Default Multicast:
UDP:239.255.0.1:7401(2) {UD,MC} H:2
Manual Liveliness: 0
Lease duration: 50.000000000s
Endpoints: 10 entries (5 readers, 5 writers).
000001-3, {22}, InlineQoS: No, Writer, imu/simple_msgs::dds_::Vector3_
000002-4, {24}, InlineQoS: No, Reader, imu/simple_msgs::dds_::Vector3_
Topics:
BuiltinParticipantMessageReader/ParticipantMessageData
BuiltinParticipantMessageWriter/ParticipantMessageData
SEDPbuiltinPublicationsReader/PublicationBuiltinTopicData
SEDPbuiltinPublicationsWriter/PublicationBuiltinTopicData
SEDPbuiltinSubscriptionsReader/SubscriptionBuiltinTopicData
SEDPbuiltinSubscriptionsWriter/SubscriptionBuiltinTopicData
SPDPbuiltinParticipantReader/ParticipantBuiltinTopicData
SPDPbuiltinParticipantWriter/ParticipantBuiltinTopicData
imu/simple_msgs::dds_::Vector3_
Security: level=Unclassified, access=any, RTPS=clear
Resend period: 10.000000000s
Destination Locators:
UDP:239.255.0.1:7400(4) {MD,MC} H:4
TCP:239.255.0.1:7400 {MD,MC}
Discovered participants:
Peer #0: {25} - Local activity: 18.05s
GUID prefix: ac1701d7:00000d46:00000001
RTPS Protocol version: v2.1
Vendor Id: 1.1 - Real-Time Innovations, Inc. - Connext DDS
Meta Unicast:
UDPv6:2:7:207:::7410 {MD,UC}
UDP:172.23.1.215:7410 {MD,UC}
Meta Multicast:
UDP:239.255.0.1:7400(4) {MD,MC} H:4
Default Unicast:
UDPv6:2:7:207:::7411 {UD,UC}
UDP:172.23.1.215:7411 {UD,UC}
Manual Liveliness: 0
Lease duration: 100.000000000s
Endpoints: 4 entries (2 readers, 2 writers).
Topics: <none>
Source:
UDP:172.23.1.215:59433 {MD,UC}
Timer = 81.51s
@vmayoral It looks like you try to communicate between 2 different DDS host processes on the same machine, the 1st being RTI Connext, the 2nd the Tinq DDS.
Just a few thoughts ...
Hope this helps?
Yes, if you want to run RTI Connext DDS in the same machine as another implementation you need to disable the RTI Connext DDS shared memory transport. Apologies for this, we should be smarter and detect this situation...
Disabling the shared memory transport can be done using the XML configuration o QoS (recommended so you do not touch application code) or programatically. This is controlled by the transport_builtin.mask in the DomainParticipantQos.
You can find examples of each of the two approaches here: http://community.rti.com/comment/851#comment-851 http://community.rti.com/kb/why-doesnt-my-rti-connext-application-communicate-rti-connext-application-installed-windows
@jvoe thanks for taking a look at the capture. You are right, my bad. Please find a new capture with all the interfaces enabled here.
@GerardoPardo thanks for your input however shared memory transport
is already disabled (otherwise we would not be able to interoperate between PrismTech's OpenSplice and Connext in the same machine which we are doing). The issue should be somewhere else.
Thanks both for your support.
@vmayoral It looks like RTI Connext is sending to the loopback address instead of to one of the announced Tinq DDS locators. This leads to the ICMP Destination Unreadable messages of course, since we only have sockets on the announced locators (see 'scx' output).
@GerardoPardo Any reason why RTI Connext is doing this? Using the loopback address as a source is normal, but I would expect that the SPDP announced locators would be used as the destinations.
@jvoe Yes I noticed the same thing. A few thoughts come to mind:
(1) We considered it would not make sense to announce 127.0.0.1 since this is not a routable IP address and only can be used if you are in he same host, which you can deduce from looking at the IP addresses. So we never expect to see it in the announced locators...
(2) Sending to localhost avoids the NIC hardware so is presumably more efficient than sending to the external IP address
(3) In case there are multiple NICS and multiple locators announced sending to just the one localhost is less work than to all the IPs.
So our UDP transport assumes that if it is enabled, then "localhost" is being listened to... I can see why this is confusing, especially given it is not specified anywhere...
Do you see a problem in always listening to localhost?
BTW I can confirm that the "default to shared memory" issue is already fixed and our next product release will not exhibit this OOB interoperability annoyance.
@GerardoPardo There are a few reasons why we don't listen to the loopback locator by default:
@jvoe thank you for the detailed explanation. I see how what you are doing makes sense in your situation.
It was my understanding that the handling of the loopback vs external IP was OS-specific and while most desktop/server OSs may be smart and automatically avoid going to he NIC other embedded OSs would rely on having the correct configuration of the routing table and the actual path followed could be different. I have not looked at this in recent years so this information could be dated or even wrong...
The approach of being smart about with interface is "responsive" seems very neat but if I understood correctly it would only work for reliable traffic so the best-efforts one would still be sent multiple times? The "localhost" trick would work for best efforts as well. That said I like very much that in your approach is able to optimize the multi-NIC/IP traffic even when sending from a different computer which our "localhost" trick cannot handle...
I need to think tis more but it would seem that depending of how they internal middleware is architected this may not be so trivial to implement this cacheing. We process the ACKNACKs completely at a layer above the transport so when we receive it and correlate it to the HB we no longer remember how the ACKNACKs was received. In fact it would be legal for us to send a HB on one transport like UDP and receive the ACKS on a completely different transport. The RTPS layer is happy as long as an ACK is received and it does not care how...
So I think we need to answer two things:
(a) What is the best approach that we can follow quickly to avoid this type of Our-Of-the-box interoperability issue?
(b) Going forward what should the best way to handle this multi-nic/IP situation be and how to we get this into RTPS 2.3? It would be nice to have something that would: b1) Work for all kinds of traffic (reliable & best efforts) b2) Work also when sending to a different machine. However be smart because if there are different physical networks the the application may want to send on the different paths for redundancy (this is something many of our customers rely on).
As far as (a), being biased here :), it would seem that by default you listened to localhost for incoming packets it would address the interoperability issue... For us it would be hard to using the actual IP addresses rather than localhost in the short term as we would need to implement something similar to the cacheing/learning you describe. We would rather do that in conjunction with (b)
Regarding (b) I think it would be very good to have you be a member of the RTPS 2.3 revision task force. Any chance you guys may join OMG? If not can work with you on the side, but it would be nice to have you influence our direction more directly... If we can close some of these issues by March 2015 when RTPS 2.3 comes out it would be great.
Just for an additional data point: CoreDX DDS does listen on localhost, by default; specifically to support on-machine interop with RTI. We do not write to localhost unless specifically configured to do so.
@GerardoPardo Some implementations have a specific route entry for local IP addresses, specifying the interface, others don't. The first will force a loopback to occur in all cases, the second is not so clear whether this is detected before the packet is sent to the NIC.
So it is indeed a bit murky to be sure that we can always assume, in all cases, whether specifying a local non-loopback address is effective for looping back. On the other hand, I haven't seen an IP stack implementation yet that didn't handle this properly, efficient or not.
I did a bit of testing on both Linux and Windows to see if there is a difference in latency between a local address and the loopback address, but I don't see any difference. In fact, the latency variations are larger than the difference between the two addressing methods. The advantage of the loopback address is that it makes it possible to communicate before an actual IP address is assigned (via DHCP or manually), and that can indeed be important for bootstrap purposes.
I suppose that if we add a receive loopback destination socket that this would enable communication between the two DDS implementations :-) ...
The best approach might be to use it as a configurable implicit fallback locator in Tinq DDS, so that if something is received on it it would be handled as a normal valid receive and the source loopback address should then be handled as a valid source locator for reply purposes.
Use of the loopback address as a destination when there are still other locators for that destination should always be an implementation option, I think, but should be encouraged to handle cases when there are no IP addresses assigned yet.
Not every implementation will check if an IP address is really a local one. In fact, this is next to impossible for TCP locators where NAT is used and both local and remote might have the same (local) IP address.
If the data is received via the loopback IP or IPv6 receive socket, then you can really be sure that it is a local host. Alternatively, if received on a UDP or UDPv6 receive locator socket with the source IP identical to the destination IP address, you can also be sure. All other cases should be seen as non-local host receives though.
As to the notion of the source/reply IP locator in Tinq DDS, this is handled by requiring every transport subsystem to add the source locator as an extra argument when calling the rtps_receive() function. One of the first actions there is to store this locator in the RTPS receive context. Specific submessage receive functions will then use this data to update the reply locator when it is still empty.
The reply locator is kept in the proxy context. Note that an InfoReply has precedence, so that the InfoReply data will be used in preference to the source locator in the reply. If the reply locator is not set in the proxy, we send on all participant locators (meta or user). Whenever we detect that there is a communication problem, i.e. no reply after N HeartBeat transmissions, the reply locator is cleared.
This mechanism might still be useful for Best Effort connections, albeit somewhat less safe, by optionally registering either the SPDP or the Builtin Participant Message topics source locator IP addresses, converting the meta port numbers to user port numbers. Of course, strictly speaking, this shouldn't be done, since in theory there is no clean relationship between meta and user locators.
Regarding OMG membership, this is no longer an option, I'm afraid. Company politics have decided to go the Allseen/AllJoyn road for future consumer IoT strategies and are no longer interested in DDS based solutions.
There are various reasons why this happened, the main one wanting to be in a bigger group of companies, especially because that group is backed by companies like Qualcomm and Microsoft. We were almost alone in promoting a DDS-based IoT solution for consumer devices and didn't have enough backing for it, even though many people still think that we have/had a superior solution.
I wouldn't mind helping out personally, of course, but it would, by necessity, have to be outside of the scope of the OMG :-)
@ClarkTucker Thanks for your input regarding the CDR encapsulation offset. I guess that means that all implementations have the same behavior now ... :-)
I'll add a loopback receive socket just as CoreDX DDS has (as explained in a previous post) for interop with RTI.
@vmayoral I'll let you know when I have something ready ..
@vmayoral Using this loopback socket mechanism is not so simple to do, tbh. I managed to get something working on a device without IP addresses configured (all interfaces disabled) when only the Multicast destination addresses are used, but this is clearly not a nice solution.
Learning the reply locators doesn't seem to work in this case -- the combination of using send_udp locators for sending and separate receive locators for receive, as used on the embedded board currently precludes learning a correct source port. The send_udp locator uses a random source port (since no bind() is done as it is used for any destination, user or meta), which can't be correlated to any proper participant locator, since there are none!
The alternative, i.e. using the receive locators as sending sockets directly would lead to issues on NuttX, as the sockets can't be bidirectional there because they are used in different threads.
Another alternative which I haven't explored fully would be to have two sending UDPv4 locators (and two for UDPv6) per domain, one for user data and one for meta data. This might work, by binding them to the correct source port, but requires a lot more work to set things up, and I'm not sure whether that will really work. Since there are then 2 locators for the same port, i.e. 1 wildcard IP on the sending UDP, and 1 bound completely on the receive UDP port, this will clearly cause issues. It will be OS-specific if and how this would/could work. So I'm not eager to go in this direction, unless we abandon separate send/receive locators altogether, which would clearly be an issue for you.
Both requirements thus seem to exclude themselves, separate sockets for send/receive (NuttX select() limitation) and multiple DDS instances on the same host without IP addresses assigned.
Once a valid IP address is assigned, the correlation can be done, of course. If this is good enough, for you, I could send you a patch.
When tested with RTI Connext it seems that discovery is not finishing properly (using the Desktop impl):
Seems like SPDP does its job but not SEDP.