ros2 / rmw_cyclonedds

ROS 2 RMW layer for Eclipse Cyclone DDS
Apache License 2.0
108 stars 89 forks source link

Data loss if Data arrives before first Heartbeat #367

Closed Greek64 closed 2 years ago

Greek64 commented 2 years ago

I'm trying to communicate with a Cyclone DDS service client, and I seem to have stumbled into a weird behavior, where if a DATA message arrives before the very first HEARTBEAT of the RTPS writer, the DATA message arrival is acknowledged, but the Data itself is not forwarded to the actual application.

I'm using the service tutorial (AddTwoInts.srv) as base.

Looking through the cylonedds.log at "finest" we can see how the Cyclone Request Writer and Response Reader are created

1643368747.787843 [0]     client: new_writer(guid 9f6a1001:6da1b6c8:bb56f65c:1503, (default).rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_)
1643368747.787864 [0]     client: ref_participant(9f6a1001:6da1b6c8:bb56f65c:1c1 @ 0x55e7e1b07e70 <- 9f6a1001:6da1b6c8:bb56f65c:1503 @ 0x55e7e1b66114) user 22 builtin 8
1643368747.787900 [0]     client: WRITER 9f6a1001:6da1b6c8:bb56f65c:1503 QOS={user_data=52<"clientid=9f.6a.10.1.6d.a1.b6.c8.bb.56.f6.5c.0.0.0.7;">,topic_name="rq/add_two_intsRequest",type_name="example_interfaces::srv::dds_::AddTwoInts_Request_",topic_data=0<>,group_data=0<>,durability=0,durability_service=0:0:1:-1:-1:-1,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=1:1000000000,lifespan=9223372036854775807,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,ownership=0,ownership_strength=0,presentation=0:0:0,partition={},transport_priority=0,adlink_entity_factory=1,adlink_writer_data_lifecycle=1}
[...]
1643368747.788147 [0]     client: new_reader(guid 9f6a1001:6da1b6c8:bb56f65c:1604, (default).rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_)
1643368747.788168 [0]     client: ref_participant(9f6a1001:6da1b6c8:bb56f65c:1c1 @ 0x55e7e1b07e70 <- 9f6a1001:6da1b6c8:bb56f65c:1604 @ 0x55e7e1b67894) user 23 builtin 8
1643368747.788198 [0]     client: READER 9f6a1001:6da1b6c8:bb56f65c:1604 QOS={user_data=52<"clientid=9f.6a.10.1.6d.a1.b6.c8.bb.56.f6.5c.0.0.0.7;">,topic_name="rr/add_two_intsReply",type_name="example_interfaces::srv::dds_::AddTwoInts_Response_",topic_data=0<>,group_data=0<>,durability=0,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=1:1000000000,destination_order=0,history=1:-1,resource_limits=-1:-1:-1,ownership=0,presentation=0:0:0,partition={},time_based_filter=0,transport_priority=0,type_consistency_enforcement=1:00000,adlink_entity_factory=1,adlink_reader_lifespan=0:9223372036854775807,adlink_reader_data_lifecycle=9223372036854775807:9223372036854775807,adlink_subscription_keys=0:{}}

Through SPDP the Request Reader and Response Writer are discovered and "matched" to the local endpoints

1643368772.528995 [0] dq.builtin: SEDP ST0 1d4e:ec915c7c:0:103 reliable volatile writer: (default).rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_ p(open) NEW (as udp/192.168.0.90:7401@2 udp/192.168.0.90:7411@2 ssm=0) QOS={user_data=0<>,topic_name="rr/add_two_intsReply",type_name="example_interfaces::srv::dds_::AddTwoInts_Response_",topic_data=0<>,group_data=0<>,durability=0,durability_service=0:0:1:-1:-1:-1,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=1:100000000,lifespan=9223372036854775807,destination_order=0,history=0:1,resource_limits=-1:-1:-1,ownership=0,ownership_strength=0,presentation=0:0:0,partition={},transport_priority=0,adlink_writer_data_lifecycle=0}
1643368772.529239 [0] dq.builtin: match_proxy_writer_with_readers(pwr 1d4e:ec915c7c:0:103) scanning all rds of topic rr/add_two_intsReply
1643368772.529274 [0] dq.builtin:   reader 9f6a1001:6da1b6c8:bb56f65c:1604 init_acknack_count = 1
1643368772.529301 [0] dq.builtin:   reader_add_connection(pwr 1d4e:ec915c7c:0:103 rd 9f6a1001:6da1b6c8:bb56f65c:1604)
1643368772.529329 [0] dq.builtin:   proxy_writer_add_connection(pwr 1d4e:ec915c7c:0:103 rd 9f6a1001:6da1b6c8:bb56f65c:1604)
[...]
1643368772.529594 [0] dq.builtin: SEDP ST0 1d4e:ec915c7c:0:4 reliable volatile reader: (default).rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_ p(open) NEW (as udp/192.168.0.90:7401@2 udp/192.168.0.90:7411@2 ssm=0) QOS={user_data=0<>,topic_name="rq/add_two_intsRequest",type_name="example_interfaces::srv::dds_::AddTwoInts_Request_",topic_data=0<>,group_data=0<>,durability=0,deadline=9223372036854775807,latency_budget=0,liveliness=0:9223372036854775807,reliability=1:100000000,destination_order=0,history=0:1,resource_limits=-1:-1:-1,ownership=0,presentation=0:0:0,partition={},time_based_filter=0,transport_priority=0,type_consistency_enforcement=1:00000,adlink_reader_lifespan=0:9223372036854775807,adlink_reader_data_lifecycle=9223372036854775807:9223372036854775807,adlink_subscription_keys=0:{}}
1643368772.529705 [0] dq.builtin: match_proxy_reader_with_writers(prd 1d4e:ec915c7c:0:4) scanning all wrs of topic rq/add_two_intsRequest
1643368772.529736 [0] dq.builtin:   proxy_reader_add_connection(wr 9f6a1001:6da1b6c8:bb56f65c:1503 prd 1d4e:ec915c7c:0:4)
1643368772.529764 [0] dq.builtin:   writer_add_connection(wr 9f6a1001:6da1b6c8:bb56f65c:1503 prd 1d4e:ec915c7c:0:4) - ack seq 0

Afterwards the local Request Writer sends the cached service request over the wire

1643368772.630328 [0]     client: write_sample 9f6a1001:6da1b6c8:bb56f65c:1503 #1: ST0 rq/add_two_intsRequest/example_interfaces::srv::dds_::AddTwoInts_Request_:{17752991966044001665,1,{2,3}}
1643368772.630543 [0]     client: xpack_addmsg 0x55e7e1b5cb20 0x55e7e1b32720 0(data(9f6a1001:6da1b6c8:bb56f65c:1503:#1/1)): niov 0 sz 0 => now niov 3 sz 92

And gets a service response

1643368772.631828 [0]       recv: DATA(1d4e:ec915c7c:0:103 -> 9f6a1001:6da1b6c8:bb56f65c:0 #1 L(:1c1 99625.171990) 1d4e:ec915c7c:0:103 -> 9f6a1001:6da1b6c8:bb56f65c:0: no heartbeat seen yet)

After it gets the first HEARTBEAT from the Response Writer, it acknowledges having received the sample, but the user application never gets the actual data (the client hangs waiting)

1643368773.027827 [0]       recv: HEARTBEAT(#61:1..1  L(:1c1 99625.567966)1d4e:ec915c7c:0:103 -> 9f6a1001:6da1b6c8:bb56f65c:0: 9f6a1001:6da1b6c8:bb56f65c:1604@1(sync))
1643368773.027990 [0]        tev: acknack 9f6a1001:6da1b6c8:bb56f65c:1604 -> 1d4e:ec915c7c:0:103: F#1:2/0:
1643368773.028085 [0]        tev: send acknack(rd 9f6a1001:6da1b6c8:bb56f65c:1604 -> pwr 1d4e:ec915c7c:0:103)

If, on the other hand, the Response Writer only send the DATA message in response to ACKNACKs (aka RTPS PUSH_MODE=FALSE) the data is received and returned to the user application as expected

1643372421.893738 [0]       recv: HEARTBEAT(#62:1..1  L(:1c1 103274.433855)1d4e:ec915c7c:0:103 -> ab691001:4a4536f2:42041845:0: ab691001:4a4536f2:42041845:1604@0(sync))
1643372421.893865 [0]        tev: xpack_addmsg 0x7f6e04001380 0x7f6e08003fd0 0(control): niov 0 sz 0 => now niov 2 sz 64
1643372421.894184 [0]        tev: acknack ab691001:4a4536f2:42041845:1604 -> 1d4e:ec915c7c:0:103: F#2:1/1:1
1643372421.894259 [0]        tev: send acknack(rd ab691001:4a4536f2:42041845:1604 -> pwr 1d4e:ec915c7c:0:103)
[...]
1643372422.095025 [0]       recv: DATA(1d4e:ec915c7c:0:103 -> ab691001:4a4536f2:42041845:0 #1 L(:1c1 103274.635164) => EVERYONE
1643372422.095318 [0]       recv: data(application, vendor 0.0): 1d4e:ec915c7c:0:103 #1: ST0 rr/add_two_intsReply/example_interfaces::srv::dds_::AddTwoInts_Response_:{16168460377599915105,1,{5}})

Additional Information

Version Information:

ros-galactic-rmw-cyclonedds-cpp/focal,now 0.22.4-1focal.20211222.231803 amd64 [installed,automatic]
ros-galactic-rmw-dds-common/focal,now 1.2.1-2focal.20211011.181136 amd64 [installed,automatic]
ros-galactic-rmw-implementation-cmake/focal,now 3.3.1-1focal.20210507.022500 amd64 [installed,automatic]
ros-galactic-rmw-implementation/focal,now 2.4.1-3focal.20211222.232048 amd64 [installed,automatic]
ros-galactic-rmw/focal,now 3.3.1-1focal.20210630.221547 amd64 [installed,automatic]
ros-galactic-ros-base/focal,now 0.9.3-2focal.20211223.041146 amd64 [installed]
ros-galactic-ros-core/focal,now 0.9.3-2focal.20211223.004907 amd64 [installed]

cap2.log contains the log for the first case (DATA before HEARTBEAT), cap3.log contains the log for the latter case (DATA after HEARTBEAT/ACKNACK), and PCAP.zip contains network captures for both cases. cap2.log cap3.log PCAP.zip

clalancette commented 2 years ago

@eboasson Friendly ping here.

eboasson commented 2 years ago

Hi @Greek64 what DDS implementation are you using for the service? It advertises itself as vendor 0.0, which is undefined/unknown (not illegal, as far as I know, but this is definitely not an unpatched Cyclone DDS). It matters, because the DDSI specification is missing a proper handshake between the reader and the writer that one needs for correct behaviour for volatile data with respect to what data is guaranteed to arrive and under what circumstances a sample-lost event may need to be raised.

Cyclone implements that handshake using standard messages, but for the reader needs to know a bit about the behaviour of the writer. And so it avoids the handshake if the other side is a non-Cyclone DDS implementation. And that can result in exactly this data loss.

Greek64 commented 2 years ago

Hi @Greek64 what DDS implementation are you using for the service?

It is my own RTPS/DDS implementation, which I have not released publicly yet. It is basically an RTPS/DDS implementation written in VHDL for FPGA boards. That is also why you may see some unconventional stuff like always nacking the next 32 SNs, not depending on what the HEARTBEAT was saying (should theoretical not be illegal as I read the Specs). My end goal is to make some kind of ROS2 VHDL API, and for that I'm testing the interoperability of my implementation with Cyclone DDS, since it is the current default RMW implementation.

It matters, because the DDSI specification is missing a proper handshake between the reader and the writer that one needs for correct behaviour for volatile data with respect to what data is guaranteed to arrive and under what circumstances a sample-lost event may need to be raised.

Ah, yes. For how detailed DDSI-RTPS Specification is written, it is missing a lot. (They even haven't defined what a serialized key payload is, and every vendor somehow winged it).

My implementation handles volatile as follows:

Cyclone implements that handshake using standard messages, but for the reader needs to know a bit about the behaviour of the writer.

So, i guess the question comes down to, what is the behaviour that Cyclone DDS expects, so that I can mimic it?

eboasson commented 2 years ago

It is my own RTPS/DDS implementation, which I have not released publicly yet. It is basically an RTPS/DDS implementation written in VHDL for FPGA boards.

Nice! We discussed one some time ago, and indeed some friends and I thought of doing something of the sort around 2005, so pre-DDSI, almost pre-DDS. But nothing ever came of that. This is really cool.

every vendor somehow winged it

Not sure about that ... try disposing an instance with RTI. All you get is a DDSI keyhash despite what it says in the spec — that they wrote! — and if the maximum size of the serialized key doesn't fit in 16 bytes, they expect you to make do with an MD5 hash. For an implementation like Cyclone DDS that doesn't organise things by MD5 hash ...

So, i guess the question comes down to, what is the behaviour that Cyclone DDS expects, so that I can mimic it?

Not retransmit any historical data to a volatile reader and get whitelisted (by vendor id, so that's going to be a bit tricky if you stick with 0.0 [1]) by Cyclone so that the reader will send a retransmit request for everything. The whitelisting is because the spec doesn't say what will happen and Cyclone therefore depends on implementation-specific behaviour.

Usually, a volatile writer won't have much historical data around (but it may), and all that happens if you do send historical data is that it shows up in the reader. So you might want to try it out [2].

Also I would suggest to open any new issues that pop up at https://github.com/eclipse-cyclonedds/cyclonedds as long as it doesn't have anything to do with ROS 2 or the RMW layer, that makes the most sense to me and would be (I think) the most polite towards those who monitor ROS 2 related things but not Cyclone DDS related things.

[1] Getting a vendor id is easy, by the way (see https://www.dds-foundation.org/dds-rtps-vendor-and-product-ids/); or you could just pick one and change to an "official" one later. [2] If you want to try it out without worrying about vendor ids see (I think those are the only places where it matters):

Greek64 commented 2 years ago

Also I would suggest to open any new issues that pop up at https://github.com/eclipse-cyclonedds/cyclonedds as long as it doesn't have anything to do with ROS 2 or the RMW layer, that makes the most sense to me and would be (I think) the most polite towards those who monitor ROS 2 related things but not Cyclone DDS related things.

Yeah, I was unsure if the "issue" was in the RMW or underlying DDS implementation, so I thought of opening the issue here (since I effectively test ROS service interoperability).

Not sure about that ... try disposing an instance with RTI. All you get is a DDSI keyhash despite what it says in the spec

Yeah, isn't it also fast-rtps that assumes an PID_KEY_HASH on every instance state (aka PID_STATUS_INFO)?

Getting a vendor id is easy, by the way (see https://www.dds-foundation.org/dds-rtps-vendor-and-product-ids/);

That was my plan later when I release the implementation in a public github repo.

or you could just pick one and change to an "official" one later.

I was unsure about the "legality" of spoofing other vendor IDs, but I guess for testing purposes I could do that.

Not retransmit any historical data to a volatile reader

Well, from the perspective of my implementation it is not sending historical data, it just so happens that the data arrives before the HEARTBEAT timeout (HEARTBEAT are triggered in a separate fixed timer), and is send before the first HEARTBEAT is sent. So I guess the correct thing to do is to trigger a HEARTBEAT immediately on remote reader match. That should inform the reader of what is historical data.

And since the DATA is not actually "lost", but regarded as historical data by design, this issue has solved itself.

Thanks for the info and help!

vmayoral commented 1 year ago

@Greek64, I was kindly referred by @eboasson to this thread.

I believe we share common interests and are looking at similar things. I work at Acceleration Robotics and an FPGA-native RTPS implementation is something that we've been considering. Do you think we could chat and exchange some thoughts?

Greek64 commented 1 year ago

@vmayoral Sure thing.

vmayoral commented 1 year ago

Sure thing.

Fantastic! Do you think you could drop me an e-mail to victor at accelerationrobotics.com? We'll take it from there.