ros2 / ros2_embedded_nuttx

This repository isn't actively being worked on. If you would like to take over maintainership please open a ticket on https://github.com/ros2/ros2
72 stars 34 forks source link

Tinq Desktop <-> OpenSplice interoperability #25

Closed vmayoral closed 9 years ago

vmayoral commented 9 years ago

It doesn't seem straightforward to interoperate between the different DDS implementations.

The types in OpenSplice are of the kind: simple_msgs::dds_::Vector3_

When trying to set up this type in Tinq i get a segmentation fault:

!!sdisca
...
            UDP:239.255.0.1:7400(4) {MD,MC} H:4
        Default Unicast: 
            UDP:172.23.1.215:7701 {UD,UC}
        Default Multicast: 
            UDP:239.255.0.1:7401(2) {UD,MC} H:2
        Manual Liveliness: 0
        Lease duration: 50.000000000s
        Endpoints: 8 entries (4 readers, 4 writers).
Segmentation fault (core dumped)

There's a null pointer somewhere an an issue with these kind of type name.

vmayoral commented 9 years ago

Issue seems to be at https://github.com/ros2/ros2_embedded_nuttx/blob/master/dds/src/dds/domain.c#L2238.

vmayoral commented 9 years ago

Publisher also stops with a segmentation fault (about 10s after the subscriber stopped, seems related to the discovery package):

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff630a700 (LWP 22729)]
0x0000000000472f04 in disc_publication_remove (pp=0x7ffff0000dc0, 
    wp=0x7ffff0001ac0) at ../../src/disc/disc_pub.c:403
403     Domain_t    *dp = tp->domain;

When debugging the subscriber:

$2 = {entity = {flags = 16, type = 5, handle = 33}, u = {participant = 0x7ffff0000dc0, publisher = 0x7ffff0000dc0, subscriber = 0x7ffff0000dc0}, entity_id = {
    id = "\000\000\001\003", w = 50397184}, topic = 0x0, qos = 0x0, ucast = 0x0, mcast = 0x0, next = 0x0, rtps = 0x0}

The topic is a null pointer (doesn't seem to processed appropriately). This behavior is also observed when using a pub/sub example Tinq <-> Tinq. Refer to:

vmayoral commented 9 years ago

Using only a single "topic name" and "type" (previously i was mixing several "types" with the same "topic name") seems to have fixed the issue. This might be something to look into the future.

vmayoral commented 9 years ago

I made sure that both ends (OpenSplice and Tinq) are using doubles in the Vector3 definition (Tinq represents them as DDS_FLOAT_64_TYPE).

Still Tinq <-> and OpenSplice do not interoperate:

Tinq publishing - OpenSplice subscribed
Tinq output

sending messages of the kind x=1+i, y=2+i, z=3 + i with i incrementing on every iteration

IMU accel message: x=1.000000, y=2.000000 and z=3.000000
IMU accel message: x=2.000000, y=3.000000 and z=4.000000
IMU accel message: x=3.000000, y=4.000000 and z=5.000000
IMU accel message: x=4.000000, y=5.000000 and z=6.000000
IMU accel message: x=5.000000, y=6.000000 and z=7.000000
IMU accel message: x=6.000000, y=7.000000 and z=8.000000
IMU accel message: x=7.000000, y=8.000000 and z=9.000000
IMU accel message: x=8.000000, y=9.000000 and z=10.000000
IMU accel message: x=9.000000, y=10.000000 and z=11.000000
IMU accel message: x=10.000000, y=11.000000 and z=12.000000
IMU accel message: x=11.000000, y=12.000000 and z=13.000000
IMU accel message: x=12.000000, y=13.000000 and z=14.000000
IMU accel message: x=13.000000, y=14.000000 and z=15.000000
...
OpenSplice output
-------------------------
Got accel x=0
Got accel y=5.31794e-315
Got accel z=5.31665e-315
-------------------------
-------------------------
Got accel x=0
Got accel y=5.31859e-315
Got accel z=5.31729e-315
-------------------------
-------------------------
Got accel x=0
Got accel y=5.31924e-315
Got accel z=5.31794e-315
-------------------------
-------------------------
Got accel x=0
Got accel y=5.31988e-315
Got accel z=5.31859e-315
-------------------------
-------------------------
Got accel x=0
Got accel y=5.32053e-315
Got accel z=5.31924e-315
-------------------------
-------------------------
Got accel x=0
Got accel y=5.32086e-315
Got accel z=5.31988e-315
-------------------------
-------------------------
Got accel x=0
Got accel y=5.32118e-315
Got accel z=5.32053e-315
-------------------------
-------------------------
Got accel x=0
Got accel y=5.3215e-315
Got accel z=5.32086e-315
-------------------------
-------------------------
Got accel x=0
Got accel y=5.32183e-315
Got accel z=5.32118e-315
-------------------------
...
OpenSplice publishing - Tinq subscribed
Tinq output
IMU accel message: x=0.000000, y=0.000000 and z=0.000000
IMU accel message: x=0.000000, y=0.000000 and z=0.000000
IMU accel message: x=0.000000, y=0.000000 and z=0.000000
IMU accel message: x=0.000000, y=0.000000 and z=0.000000
IMU accel message: x=0.000000, y=0.000000 and z=0.000000
IMU accel message: x=0.000000, y=0.000000 and z=0.000000
IMU accel message: x=0.000000, y=0.000000 and z=0.000000
IMU accel message: x=0.000000, y=0.000000 and z=0.000000
IMU accel message: x=0.000000, y=0.000000 and z=0.000000
IMU accel message: x=0.000000, y=0.000000 and z=0.000000
...
OpenSplice output

sending messages of the kind x=1, y=2, z=3 all the time

published Vector3 ROS msg #1
published Vector3 ROS msg #2
published Vector3 ROS msg #3
published Vector3 ROS msg #4
published Vector3 ROS msg #5
published Vector3 ROS msg #6
published Vector3 ROS msg #7
published Vector3 ROS msg #8
published Vector3 ROS msg #9
published Vector3 ROS msg #10
published Vector3 ROS msg #11
published Vector3 ROS msg #12
...
jvoe commented 9 years ago

@vmayoral We did interop testing with RTI a long time ago using the freely downloadable RTI Shapes demo as well as with the TwinOaks CoreDX DDS implementation. Both worked well at the time.

Be aware that was before we implemented the X-types typecode library. Once we used that, we couldn't do interoperability testing anymore, since neither TwinOaks nor RTI had an X-types compatible DDS at the time. Especially since Qeo mandated the use of the mutable types for communication, and that was only available in X-types.

We never did interoperability tests with Prismtech since we didn't have their software.

The results of your tests seem to clearly indicate that there is indeed a problem which needs to be fixed. Can you give a Wireshark trace, of all the erroneous cases? I'll be glad to take a look at these issues. It would be helpful to have a publisher on both sides as well to see what Prismtech actually expects.

vmayoral commented 9 years ago

@jvoe please find here a description of how to reproduce my current scenario as well as two wireshark captures.

jvoe commented 9 years ago

@vmayoral I looked at the PrismTech RTPS packets. Didn't know those guys sent so much garbage on the wire, TBH ;-) So many weird/non-standard topics and lots of vendor-specific PIDs as well, wow.

Anyway, looking at the output, I see the same data being sent continuously. I thought you would send incrementing floating point numbers from the PrismTech source?

This actual user data is constantly being sent, after the RTPS header info, representing a sequence of 3 doubles in Little-endian notation: 00:00:00:00:00:00:f0:3f:00:00:00:00:00:00:00:40:00:00:00:00:00:00:08:40

I'm not sure this is correct, since this doesn't really look like IEEE 754 encoded numbers.

Also, 196 bytes to send 3 doubles? Quite a bit of overhead, I must say.

vmayoral commented 9 years ago

@jvoe thanks a lot for taking the time to look into it.

Anyway, looking at the output, I see the same data being sent continuously. I thought you would send incrementing floating point numbers from the PrismTech source?

From PrismTech OpenSplice I'm always sending the same 3 values as described here.

Also, 196 bytes to send 3 doubles? Quite a bit of overhead, I must say.

It does sound indeed too much. We'll keep looking into it.

jvoe commented 9 years ago

@vmayoral Hmm .. I might have been a bit hasty in my conclusions. It looks like a natural alignment issue which is different in Tinq and PrismTech. According to the official Corba spec, 8-byte quantities such as doubles need to be aligned on an 8-byte alignment, starting from the beginning of the payload data, which is defined to start from the beginning of the serializedData field, i.e. from the Encapsulation kind field (00 01). That's why Tinq inserts 4 padding bytes before the first double, after the Encapsulation options (00 00), since the packet offset is at that moment 4. PrismTech doesn't seem to do that. Result is that the CDR interpretation between the two is different, leading to severe misinterpretations of the data. It might be worth rereading the CORBA v3.1.1 - Part 2: CDR specification (chapters 9.4.1 and 9.4.1.1 are quite clear on this). So, I don't think that Tinq is wrong here.

jvoe commented 9 years ago

@vmayoral As a workaround you might consider using single precision floating point numbers i.o. double precision floating point. Not only are those handled in hardware by the Cortex-M4 floating point unit (contrary to double-precision numbers that are handled in software), but you won't have alignment issues in DDS packets, making interoperability possible.

As far as I know (I was present at the OMG conferences in Paris and Berlin), the interop tests that are/were done at those conferences only use the shapes demo, which has a very simple data topic:

struct ShapeType { string<128> color; int x; int y; int shapesize; };

Of course, this data type has no issues with 8-byte alignments.

I'm not sure about RTI DDS since I only used their Shapes demo, but I recall that we once had an interop problem with TwinOaks CoreDX DDS, where a similar 8-byte alignment clearly showed that something was different in both Twinoaks and Technicolor DDS implementations. At the time we used the same alignment as PrismTech, i.e. aligning from the start of the CDR data, i.o. from the Encapsulation kind field. After discussing this with the developer of Twinoaks, who referred us to the CORBA documentation, we agreed that we were wrong and changed the alignment offset to what it is now. As a result, we had no more interop issues with Twinoaks.

vmayoral commented 9 years ago

@jvoe thanks a lot for your time and for helping us understand the problem. You were right, using a message consisting of single precision floating points did it for Tinq and OpenSplice to interoperate.

In order to avoid possible issues of communication between different DDS vendors in the future we'll try ping RTI and PrismTech about this matter.

In the meantime, could you provide a pointer in the code where Tinq inserts the 4 padding bits at beginning of the serializedData field? We'd like to set up a compiler flag to selectively include this padding or not depending on the user needs.

jvoe commented 9 years ago

@vmayoral This is something that will need to be changed in lots of places, and is not a trivial thing to do, but be my guest if you want to have a try :-) ...

I would personally be very hesitant to do this excercise knowing full well from first hand experience how much effort it took to validate the correctness of the typecode in so many complex and different cases (checkout the test/cdr directory for a test program to test correct typecode operation). Correct X-types CDR marshalling and unmarshalling is really hard to do, I'm afraid. Also, remember Edsger Dijkstra's quote: "Testing only proves the presence of bugs, never the absence."

The padding thing is not only at the beginning of the serializedData field, btw. Padding is a requirement in CDR encoding for all type sizes > 1. So a single byte followed by a short (2-bytes) introduces 1 byte padding. A single byte followed by a long (4-bytes) will result in 3 bytes padding. And we're not talking about embedded substructure paddings yet, which makes things even more complex. Checkout the ALIGN() and CDR_ALIGN() macros for example in src/xtypes/xcdr.c and count how many times this occurs. Don't try to fix the PrismTech interop by emptying the macros since you'll break the interop with all vendors for all but the most simple types!

This for example is the first occurance I noticed while quickly browsing to xtypecode.c for the initial 4-byte extra offset specification, but this is in lots of places as you'll notice from the dp + 4 and 4 arguments in lots of xcdr and pl_cdr function calls in that file only.

We really didn't do this for fun. There was an actual interoperability problem that was only solved by adding this extra complexity to the code. And this was done only after days of discussions and spec-reading. Please read the CORBA specifications on CDR encoding, and you'll understand the complexities much better.

Be aware that the PrismTech code doesn't use RTPS by default. They still use an alternative/proprietary protocol when talking to other PrismTech DDS nodes and only use RTPS when another vendor's DDS wants to communicate. I'm not saying that it is a 100% sure that Tinq is right and PrismTech is wrong due to this, but I've seen plenty of spec misunderstandings so far to understand that it is easy to make mistakes with these things.

GerardoPardo commented 9 years ago

Hi,

In CDR primitive types should definitely be aligned to their length (8 in the case of double). The issue as you stated is where to start counting the offset used for the alignment: Before or after CDR Encapsulation header.

Earlier versions of RTI Connext DDS (4.2 and before) starting counting the offset before the CDR Encapsulation header. This was changed more than 5 years ago with 4.3. However we can still generate the old alignment calling rtiddsgen with the option -user42aligment.

Which is correct? The answer in the case of the "Parameter List Encapsulation" can be found in the RTPS spec. RTPS states that the CDR stream is logically reset after each Parameter so that the data will not have initial padding after the parameter. see section 9.4.2.11 of RTPS version 2.2 http://www.omg.org/spec/DDSI-RTPS/2.2).

Since mutable-type encapsulation reuses the ParameterList encapsulation this is how mutable types should compute alignments.

In the case of the "Extensible" and "Final" encapsulations, the corresponding text in section 9.4.2.12 is not so clear. It just says: "Note that when using CDR, primitive types must be aligned to their length. For example, a long must start on a 4-byte boundary. The boundaries are counted from the start of the CDR stream."

But it is not clear where the stream begins. Is it the beginning of the RTPS message, the DATA submessage, the SerializedPayload (before the encapsulation header) or the actual stream where the serialized data is put (after the encapsulation header).

We discovered this problem in some of the first interoperability demos between RTI and PrismTech and came to the conclusion that "resetting" the stream after the encapsulation header was the technically superior approach, so we changed it to that. Unfortunately it seems we did not update the RTPS specification to clarify this...

Resetting after the encapsulation header was considered better because it is consistent with the way C compilers operate. C compilers return aligned memory on calls to malloc & calloc. So they never pad on the first member even if it is a double. Assuming that people who care about this things design their types to minimize padding then the serialization would also not pad if we "reset" the alignment after the encapsulation header.

In addition it makes the approach of the "Extensible" and "Final" encapsulations similar to the "mutable encapsulation"

I will file an issue with the OMG against the RTPS spec to have this clarified in the next version.

Gerardo

jvoe commented 9 years ago

@GerardoPardo @brunodebus @bramstes @vmayoral Thanks for clearing this out, Gerardo. It looks like the TwinOaks CoreDX DDS implementation we used in Technicolor had it wrong too then since we changed our implementation to be compatible with them, although our previous implementation did it exactly as you explained right now :-)

Are you aware of this still being the case in the more recent CoreDX DDS versions? I haven't spoken to Clark in a long time, so I wouldn't know.

I guess I need to do some changes in the typecode part. Of course this will break backwards compatibility, so this will have to be done with a selectable parameter. Tuning this dynamically, based on the VendorId and Version seems a little too complex to do ...

jvoe commented 9 years ago

@vmayoral I sent you a patch to fix the interop issue. You can now decide at compile time whether the initial offet of the CDR encoded data should be either 0 (RTI > 4.2, PrismTech) or 4 (Tinq, CoreDX) with the -DCDR_DOFS=x where x = 0 or 4. I hope this helps ..

vmayoral commented 9 years ago

@jvoe received an applied. Communication with OpenSplice works now reliably. Thanks a lot.

RTI's connect and Tinq are still not able to interoperate. Moving the discussion to https://github.com/ros2/ros2_embedded_nuttx/issues/26#issuecomment-64325497.

ClarkTucker commented 9 years ago

@jvoe @GerardoPardo Earlier versions of CoreDX DDS (3.4 and before) performed alignment with respect to the beginning of the CDR Encapsulation header [this was compatible with early versions of RTI]. This behavior was changed 2+ years ago (in CoreDX DDS v3.5) to align with respect to the first byte following the header. However (similar to RTI), we can still operate with the "old" alignment by calling coredx_ddl with the option "-a 1".

Hope that helps...