Bug related to Reliable resend of old sample after emission of a new one

Summary

Bug discovered in tagged version 0.9.0

In Reliable mode, if sample A is sent first but not received, then sample B is sent, the middleware will drop sample B, waiting for sample A, to make sure to have a proper history. Then, sample B should be re-emitted and received by the DataReader. This doesn't seem to be the case.

It seems that sample A is never received, even if it is re-sent.

Steps to reproduce

Test setup configuration

Tested on Ubuntu 22.04. The test uses iptables to drop packets.

We give iptables rights to the user executing the test:

/etc/sudoers

test_runner ALL=(ALL) NOPASSWD: /usr/sbin/iptables

Minimal reproducible example

minimal_reproducible_example.zip

The following diagram explains the minimal reproducible example:

On the network capture, we see that samples are re-sent correctly, and ACK packets are good too, but we don't see any logs about receiving Sample 2 in the application.

What is the expected correct behavior?

Receiving application receives all samples.

Relevant logs and/or screenshots

Minimal reproducible example logs:

2024-02-20 10:37:55.667223 | main       | INFO  | rdds_xxx     | Creating DomainParticipant
2024-02-20 10:37:55.671392 | main       | INFO  | rdds_xxx     | Creating Subscriber
2024-02-20 10:37:55.671535 | main       | INFO  | rdds_xxx     | Creating Data Writer
2024-02-20 10:37:55.671840 | main       | INFO  | rdds_xxx     | Starting Subscriber thread
2024-02-20 10:37:55.671864 | main       | INFO  | rdds_xxx     | Starting Publishing
2024-02-20 10:37:55.671956 | main       | INFO  | rdds_xxx     | Wait 2 seconds to make sure the DataWriter and DataReader matched themselves.
2024-02-20 10:37:55.672112 | ReceiverTh | INFO  | rdds_xxx     | DDS receiver thread started
2024-02-20 10:37:55.673280 | ReceiverTh | INFO  | rdds_xxx     | DataReader event: SubscriptionMatched { total: CountWithChange { count: 1, count_change: 1 }, current: CountWithChange { count: 1, count_change: 1 }, writer: GUID {01125c4b32
55512c8dd3551a EntityId {[0, 0, 1] EntityKind::WRITER_WITH_KEY_USER_DEFINED}} }
2024-02-20 10:37:57.672216 | main       | INFO  | rdds_xxx     | Publishing 1
2024-02-20 10:37:57.674486 | ReceiverTh | INFO  | rdds_xxx     | Received 1
2024-02-20 10:37:57.735232 | main       | INFO  | rdds_xxx     | Start dropping UDP packets with payload 'DropMe'
2024-02-20 10:37:57.735293 | main       | INFO  | rdds_xxx     | Wait 1 sec
2024-02-20 10:37:58.735612 | main       | INFO  | rdds_xxx     | Publishing 2
2024-02-20 10:37:58.735784 | main       | INFO  | rdds_xxx     | Wait 1 sec
2024-02-20 10:37:58.736114 | RustDDS Pa | WARN  | rustdds::net | send_to_udp_socket - send_to 239.255.0.1:7401 : Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" } len=99
2024-02-20 10:37:58.736242 | RustDDS Pa | WARN  | rustdds::net | send_to_udp_socket - send_to 239.255.0.1:7401 : Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" } len=99
2024-02-20 10:37:58.736298 | RustDDS Pa | WARN  | rustdds::net | send_to_udp_socket - send_to 239.255.0.1:7401 : Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" } len=99
2024-02-20 10:37:58.736348 | RustDDS Pa | WARN  | rustdds::net | send_to_udp_socket - send_to 239.255.0.1:7401 : Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" } len=99
2024-02-20 10:37:58.736398 | RustDDS Pa | WARN  | rustdds::net | send_to_udp_socket - send_to 239.255.0.1:7401 : Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" } len=99
2024-02-20 10:37:59.751278 | main       | INFO  | rdds_xxx     | Stop dropping UDP packets with payload 'DropMe'
2024-02-20 10:37:59.751340 | main       | INFO  | rdds_xxx     | Publishing 3
2024-02-20 10:37:59.801671 | main       | INFO  | rdds_xxx     | Wait 5 secs
2024-02-20 10:37:59.877035 | ReceiverTh | INFO  | rdds_xxx     | Received 3
2024-02-20 10:38:04.802077 | RustDDS Pa | ERROR | rustdds::rtp | Event for unknown writer EntityId {[0, 0, 1] EntityKind::WRITER_WITH_KEY_USER_DEFINED}

Minimal reproducible example commented network capture

Complete test-plan sent by e-mail (Test Reliability_QOS_3).

Analysis

There must be a problem in the user API of RustDDS, between the RTPS reception and the add to the sample queue. We may want to take a look at src/dds/with_key/datasample_cache.rs.

Thank you for such a detailed report and a brilliant reproduction example!

This does indeed look like a bug in RustDDS's Reliable Reader logic, so we'll investigate.

Initial findings so far:

...
2024-07-30 12:59:39.927991 | main       | INFO  | rdds_344     | Publishing 1
2024-07-30 12:59:39.928985 | ReceiverTh | INFO  | rdds_344     | Received 1
2024-07-30 12:59:39.997281 | main       | INFO  | rdds_344     | Start dropping UDP packets with payload 'DropMe'
2024-07-30 12:59:39.997394 | main       | INFO  | rdds_344     | Wait 1 sec
2024-07-30 12:59:40.998122 | main       | INFO  | rdds_344     | Publishing 2
2024-07-30 12:59:40.998338 | main       | INFO  | rdds_344     | Wait 1 sec
2024-07-30 12:59:40.998598 | RustDDS Pa | WARN  | rustdds::net | send_to_udp_socket - send_to 239.255.0.1:7401 : Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" } len=99
2024-07-30 12:59:40.998713 | RustDDS Pa | WARN  | rustdds::net | send_to_udp_socket - send_to 239.255.0.1:7401 : Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" } len=99
2024-07-30 12:59:42.018813 | main       | INFO  | rdds_344     | Stop dropping UDP packets with payload 'DropMe'
2024-07-30 12:59:42.018950 | main       | INFO  | rdds_344     | Publishing 3
2024-07-30 12:59:42.019101 | main       | INFO  | rdds_344     | Wait 5 secs
2024-07-30 12:59:42.133660 | ReceiverTh | INFO  | rdds_344     | Received 2
2024-07-30 12:59:47.020708 | RustDDS Pa | ERROR | rustdds::rtp | Event for unknown writer EntityId {[0, 0, 1] EntityKind::WRITER_WITH_KEY_USER_DEFINED}

I am able to reproduce the issue, but it is not the same!

Now receiver reports "Received 2" instead of "Received 3", like you were getting. And sample number 3 is not received.

We will continue looking into this.

This was indeed a bug in DataReader's caching and metadata generation logic. It was possible for a Reliable DataReader to deliver samples to the application in the wrong order, if they are received out-of-order, which is what you are testing here.

I just released RustDDS 0.10.2, where this is fixed.

However, the result that you are receiving only sample 2, and sample 3 is lost, is because in your test code the DataReader has QoS History::KeepLast depth =1. It is not specified in the code, but DDS spec says this is the default. This causes the DataReader's receive buffer have effectively only one slot, and therefore no space to reorder 3 and 2when they are received out of order. This can be fixed simply by adding .history(History::KeepLast{ depth: 2}) to the QoS at receiving side.

My test results with History depth=2 are as follows:

RustDDS 0.9.0: received 1,2,3 - ok
RustDDS 0.10.0: received 1,3,2 - incorrect order
RustDDS 0.10.1: received 1,3,2 - incorrect order
RustDDS 0.10.2: received 1,2,3 - ok

Naturally, in order to cope with more severe data reordering, the History depth would have to be increased more.

I am still uncertain if this is the intention of the DDS spec. Is a DDS implementation allowed to lose samples with QoS Reliability::Reliable, if the QoS History does not provide a buffer large enough to reorder the samples?

If History is really necessary to get reordering to work, then that seems a bit unintuitive.
If reordering should work regardless of History, then how long reorder buffer should be provided?

If you have any thoughts on this, or manage to find a part of the DDS Spec to clarify this, please leave a comment.

I think the DDS Spec part 2.2.3.14 RELIABILITY is answering that question:

The setting of this policy has a dependency on the setting of the RESOURCE_LIMITS policy. In case the RELIABILITY kind is set to RELIABLE the write operation on the DataWriter may block if the modification would cause data to be lost or else cause one of the limits in specified in the RESOURCE_LIMITS to be exceeded. Under these circumstances, the RELIABILITY max_blocking_time configures the maximum duration the write operation may block. If the RELIABILITY kind is set to RELIABLE, data-samples originating from a single DataWriter cannot be made available to the DataReader if there are previous data-samples that have not been received yet due to a communication error. In other words, the service will repair the error and re-transmit data-samples as needed in order to re-construct a correct snapshot of the DataWriter history before it is accessible by the DataReader.

IMO:

Since ResourceLimits is not implemented yet in RustDDS, we assume its default value (Infinite).
Even if we have infinite resources, the 1st sample would be lost if we send a 2nd before every reader received it, so the writer should block. I think that in that case, the History size doesn't matter, and thus, the ResourceLimits neither.

Thanks a lot for your answer and for the fix, I'm grateful to be able to participate to this open-source project !

IMO:

Since ResourceLimits is not implemented yet in RustDDS, we assume its default value (Infinite).

Even if we have infinite resources, the 1st sample would be lost if we send a 2nd before every reader received it, so the writer should block. I think that in that case, the History size doesn't matter, and thus, the ResourceLimits neither.

Sorry, but I cannot follow what is your logic here.

The term "send" here is a bit ambiguous. Also the word "it" after "receive" is ambiguous, but I assume it refers to the "1st sample".

If we send samples 1,2,3,.., we can send them in order, with Reliability=Reliable, without waiting for an acknowledgement from all readers in between, provided that we can buffer them in the Writer for possible retransmit.

In my reading, the two spec paragraphs have a different meaning.

First paragraph says that a Reliable DataWriter may (or even should) block until timeout, if the write operation would cause data to be lost. This is a mechanism to eventually throttle writing, if a DataWriter tries to write faster than the matched Reliable DataReaders can acknowledge that they have the data. If the write call results in a timeout, then that indicates the write did not succeed.

The second paragraph states that a Reliable Datareader should receive the samples in order. If a RTPS Reader receives 1 and 3, it should deliver only 1 to DataReader and request retransmit of 2 from Writer. Note that a Writer may respond either by sending 2, or by a Gap message indicating that 2 is no longer available. In the latter case DataReader will deliver 3, and 2 is never delivered. There is no explicit indication to the application about this, but it can be detected by e.g. inspecting sample lost status in a DataReader or DomainParticipant. This is where DDS reliability is different from e.g. TCP, which will absolutely refuse to continue while there is a gap in the received data.

jhelovuo / RustDDS