Open tot0k opened 4 months ago
Thank you for such a detailed report and a brilliant reproduction example!
This does indeed look like a bug in RustDDS's Reliable Reader logic, so we'll investigate.
Initial findings so far:
...
2024-07-30 12:59:39.927991 | main | INFO | rdds_344 | Publishing 1
2024-07-30 12:59:39.928985 | ReceiverTh | INFO | rdds_344 | Received 1
2024-07-30 12:59:39.997281 | main | INFO | rdds_344 | Start dropping UDP packets with payload 'DropMe'
2024-07-30 12:59:39.997394 | main | INFO | rdds_344 | Wait 1 sec
2024-07-30 12:59:40.998122 | main | INFO | rdds_344 | Publishing 2
2024-07-30 12:59:40.998338 | main | INFO | rdds_344 | Wait 1 sec
2024-07-30 12:59:40.998598 | RustDDS Pa | WARN | rustdds::net | send_to_udp_socket - send_to 239.255.0.1:7401 : Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" } len=99
2024-07-30 12:59:40.998713 | RustDDS Pa | WARN | rustdds::net | send_to_udp_socket - send_to 239.255.0.1:7401 : Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" } len=99
2024-07-30 12:59:42.018813 | main | INFO | rdds_344 | Stop dropping UDP packets with payload 'DropMe'
2024-07-30 12:59:42.018950 | main | INFO | rdds_344 | Publishing 3
2024-07-30 12:59:42.019101 | main | INFO | rdds_344 | Wait 5 secs
2024-07-30 12:59:42.133660 | ReceiverTh | INFO | rdds_344 | Received 2
2024-07-30 12:59:47.020708 | RustDDS Pa | ERROR | rustdds::rtp | Event for unknown writer EntityId {[0, 0, 1] EntityKind::WRITER_WITH_KEY_USER_DEFINED}
I am able to reproduce the issue, but it is not the same!
Now receiver reports "Received 2" instead of "Received 3", like you were getting. And sample number 3 is not received.
We will continue looking into this.
This was indeed a bug in DataReader's caching and metadata generation logic. It was possible for a Reliable DataReader to deliver samples to the application in the wrong order, if they are received out-of-order, which is what you are testing here.
I just released RustDDS 0.10.2, where this is fixed.
However, the result that you are receiving only sample 2
, and sample 3
is lost, is because in your test code the DataReader
has QoS History::KeepLast depth =1. It is not specified in the code, but DDS spec says this is the default. This causes the DataReader's receive buffer have effectively only one slot, and therefore no space to reorder 3
and 2
when they are received out of order. This can be fixed simply by adding .history(History::KeepLast{ depth: 2})
to the QoS at receiving side.
My test results with History depth=2 are as follows:
Naturally, in order to cope with more severe data reordering, the History depth would have to be increased more.
I am still uncertain if this is the intention of the DDS spec. Is a DDS implementation allowed to lose samples with QoS Reliability::Reliable, if the QoS History does not provide a buffer large enough to reorder the samples?
If History is really necessary to get reordering to work, then that seems a bit unintuitive.
If reordering should work regardless of History, then how long reorder buffer should be provided?
If you have any thoughts on this, or manage to find a part of the DDS Spec to clarify this, please leave a comment.
I think the DDS Spec part 2.2.3.14 RELIABILITY
is answering that question:
The setting of this policy has a dependency on the setting of the RESOURCE_LIMITS policy. In case the RELIABILITY kind is set to RELIABLE the write operation on the DataWriter may block if the modification would cause data to be lost or else cause one of the limits in specified in the RESOURCE_LIMITS to be exceeded. Under these circumstances, the RELIABILITY max_blocking_time configures the maximum duration the write operation may block. If the RELIABILITY kind is set to RELIABLE, data-samples originating from a single DataWriter cannot be made available to the DataReader if there are previous data-samples that have not been received yet due to a communication error. In other words, the service will repair the error and re-transmit data-samples as needed in order to re-construct a correct snapshot of the DataWriter history before it is accessible by the DataReader.
IMO:
Thanks a lot for your answer and for the fix, I'm grateful to be able to participate to this open-source project !
IMO:
- Since ResourceLimits is not implemented yet in RustDDS, we assume its default value (Infinite).
- Even if we have infinite resources, the 1st sample would be lost if we send a 2nd before every reader received it, so the writer should block. I think that in that case, the History size doesn't matter, and thus, the ResourceLimits neither.
Sorry, but I cannot follow what is your logic here.
The term "send" here is a bit ambiguous. Also the word "it" after "receive" is ambiguous, but I assume it refers to the "1st sample".
If we send samples 1,2,3,.., we can send them in order, with Reliability=Reliable, without waiting for an acknowledgement from all readers in between, provided that we can buffer them in the Writer for possible retransmit.
In my reading, the two spec paragraphs have a different meaning.
First paragraph says that a Reliable DataWriter may (or even should) block until timeout, if the write operation would cause data to be lost. This is a mechanism to eventually throttle writing, if a DataWriter tries to write faster than the matched Reliable DataReaders can acknowledge that they have the data. If the write call results in a timeout, then that indicates the write did not succeed.
The second paragraph states that a Reliable Datareader should receive the samples in order. If a RTPS Reader receives 1 and 3, it should deliver only 1 to DataReader and request retransmit of 2 from Writer. Note that a Writer may respond either by sending 2, or by a Gap message indicating that 2 is no longer available. In the latter case DataReader will deliver 3, and 2 is never delivered. There is no explicit indication to the application about this, but it can be detected by e.g. inspecting sample lost status in a DataReader or DomainParticipant. This is where DDS reliability is different from e.g. TCP, which will absolutely refuse to continue while there is a gap in the received data.
Summary
In Reliable mode, if sample
A
is sent first but not received, then sampleB
is sent, the middleware will drop sampleB
, waiting for sampleA
, to make sure to have a proper history. Then, sampleB
should be re-emitted and received by the DataReader. This doesn't seem to be the case.It seems that sample
A
is never received, even if it is re-sent.Steps to reproduce
Test setup configuration
Tested on Ubuntu 22.04. The test uses
iptables
to drop packets.We give iptables rights to the user executing the test:
/etc/sudoers
Minimal reproducible example
minimal_reproducible_example.zip
The following diagram explains the minimal reproducible example:
On the network capture, we see that samples are re-sent correctly, and ACK packets are good too, but we don't see any logs about receiving Sample 2 in the application.
What is the expected correct behavior?
Receiving application receives all samples.
Relevant logs and/or screenshots
Minimal reproducible example logs:
Minimal reproducible example commented network capture
Complete test-plan sent by e-mail (Test
Reliability_QOS_3
).Analysis
There must be a problem in the user API of RustDDS, between the RTPS reception and the add to the sample queue. We may want to take a look at
src/dds/with_key/datasample_cache.rs
.