Open stfl opened 6 years ago
If I remember correctly, when using SCTP_PR_SCTP_TTL, each message is sent at least once. It should be abandoned if the lifetime has expired before it is sent, but this is not implemented yet. Can you verify that the messages with the long delay are only transmitted once? That would backup my above thinking. If they are retransmitted, than there is a different bug in addition to the above.
I am measuring the timestamps outside of usrsctp so whatever I get is already only the payload. I don't directly see weather it's the first transmission or a retransmission. I don't have a pcap at hand of such a test run but if further investigation is required, I will capture a pcap when I do another test run and analyze it.
btw: TTD stands for time to delivery, which is the term I use in my thesis.. ;)
Is there a plan to fix the issue or any work in progress regarding first transmission abandonment?
I have the issue, that I can't really use the the results for my thesis obtained from the current implementation. A fix for this would be help me out a lot.
Are you aware of anybody who has implemented a fix for this, even if it is not a clean fix?
Thank you, best regards, Stefan
Yepp, there is plan. I'm not aware of someone having an unofficial patch. I can see if @msvoelker (who is working in my lab) can have a look into this... You could help us testing this way... I'll drop a note tomorrow.
Hi Stefan, I will have a look at this issue. I'm going to try to reproduce it with packetdrill first (probably next week). I'll get back to you, once I created a first patch.
Timo
The last couple of days, I'm observing the same problem with partial reliability even with zero RTX or TTL set to 1. Messages are transmitted too slow in comparison to other network transports where ordered unreliable messages delivered up to x20-30 times faster under 100-200 RTT with 5-10% of packet loss respectively...
Which congestion control is the "other" transport protocol using? Can you compare the throughput between reliable and unreliable transfer? Is that different? @msvoelker Can you test this in the lab?
Which congestion control is the "other" transport protocol using?
One is re-implementation of DCCP as described in RFC 4340 with almost the same reliability strategies based on ACK vector. Another is similar to CUBIC TCP, but based on a more traditional sliding window. Both encapsulated into UDP.
Can you compare the throughput between reliable and unreliable transfer? Is that different?
A reliable transmission in comparison to transports that I'm using is quite good and consistent while the one which similar to DCCP shows better latencies under bad network conditions.
The problem is that in SCTP unreliable messages are highly affected by congestion, and as a result, I'm getting very outdated, delayed data that the application no longer needs.
And which CC are you using for DCCP? The one in RFC 4341? I just want to figure out which effect the CC you are using has, and which effect is based on the way SCTP (and possibly the stack) implemented unreliability.
Right now the default CC for SCTP is new reno and it is know for sub-optimal performance in high RTT, high loss scenarios... One could add a cubic implementation...
And which CC are you using for DCCP?
I'm not sure, but it seems that it's based on CCID 2.
Right now the default CC for SCTP is new reno and it is know for sub-optimal performance in high RTT, high loss scenarios...
Yes, I've tried HSTCP and it shows better results than default CC, but still not really suitable for real-time data transfer...
@tuexen Dunno, it's possible to quickly disable any built-in CC without side effects for implementation of a custom algorithm?
@tuexen I've eliminated CC from the source code and set cwnd
to a constant value, but the problem is still there, messages are delayed for a very long intervals for some reason.
@nxrighthere For ordered or unordered messages? What is the message size?
I've tried both ordered/unordered, and it doesn't make any difference.
Here's captured traffic between two machines connected over the wireless network with around 200-250 ms RTT: sctp_wifi.zip
This one with enabled HSTCP.
The last couple of days, I'm observing the same problem with partial reliability even with zero RTX or TTL set to 1. Messages are transmitted too slow in comparison to other network transports where ordered unreliable messages delivered up to x20-30 times faster under 100-200 RTT with 5-10% of packet loss respectively...
I'm trying to reproduce this. Can you provide concrete numbers for a concrete environment? Let's say for 200 ms RTT and 5 % packet loss rate. What is the size of the messages you are sending? How long are you running a test?
I did only small tests so far. My setup is RTT = 200 ms Packet Loss Rate = 5 % Message Size = 20 bytes My test application was able to send in about 21 seconds 52354 messages reliable and 57541 messages unreliable.
The problem is not how many messages the transport is able to transmit. the problem is that messages are highly delayed as it explained by @stfl.
My issue is the same:
The problem is that in SCTP unreliable messages are highly affected by congestion, and as a result, I'm getting very outdated, delayed data that the application no longer needs.
Partial reliability doesn't work properly, to get an idea what's going wrong you need to compare it to other popular semi-reliable transports such as ENet for example, which transmitting packets as efficient as possible but without huge delays.
What I've tried, and it doesn't help:
SCTP_PR_SCTP_RTX
to 0SCTP_PR_SCTP_TTL
to 1SCTP_UNORDERED
cwnd
to a constant valueI identified @stfl problem as related to the bug that sends messages (for the first time) even if the TTL of the message is already expired. This is still an open bug. Since you are also using RTX policy, there seems to be something else.
Have you measured the app-to-app message delay? If so, do you see a high average or peeks for single messages? What is the size of the messages you send?
Have you measured the app-to-app message delay?
Yes, it's 16 ms (the application's framerate locked to 60 frames per second).
If so, do you see a high average or peeks for single messages?
For a single message, delays between packets under congestion cause stalls and unresponsiveness (as you can observe on the video).
What is the size of the messages you send?
I'm enqueuing many small messages <30 bytes in tight loops that aggregated into a single packet below MTU (SCTP_NODELAY
is not used).
You mean 16 ms is the average before any packet loss; once a packet is lost, congestion control adds some delay, correct?
You wrote, you found a way to disable the congestion control for your test. Do you have a total average of 16 ms then?
You mean 16 ms is the average before any packet loss; once a packet is lost, congestion control adds some delay, correct?
Yes, 16 ms without packet loss and any external latency. Under congestion sstat_primary.spinfo_srtt
raises up to 3,000 ms for some reason, while the actual RTT is ~100 ms.
You wrote, you found a way to disable the congestion control for your test. Do you have a total average of 16 ms then?
Nope, I didn't gather that, but it doesn't change much from what I see in my tests.
Is 16 ms a good number compared to ENet? When you got a RTT of 100 ms, I would assume an app-to-app delay of at least 50 ms. Do you add the link delay only on the way back to the sender?
Does ENet also use a congestion control? If so, which one? Do you see higher delays in case of packet loss with ENet as well?
Is 16 ms a good number compared to ENet?
Yes, when congestion doesn't occur, both have pretty the same good enough latencies.
When you got a RTT of 100 ms, I would assume an app-to-app delay of at least 50 ms. Do you add the link delay only on the way back to the sender?
Yes, the actual app-to-app delay raises up to ~60 ms. Lag simulated on the server-side at sending and receiving through a virtual device which controls the traffic.
Does ENet also use a congestion control? If so, which one?
Yes, it's CUBIC-like I believe, flow control is fixed sliding window.
Do you see higher delays in case of packet loss with ENet as well?
Yes, but they correspond to the expected numbers, unlike in the case with SCTP.
Found the reason why sstat_primary.spinfo_srtt
is showing incorrect values like 3,000 ms while the actual RTT is 150 ms: it was enabled by default explicit congestion notifications (it looks like that the client is also connecting faster than before when this option is disabled).
Still, investigate why there's a huge delay between packets delivery. Messages themselves arrive just in excepted time, but they are held in the buffer for some reason rather than just being dispatched.
@nxrighthere Can you elaborate how ECN affects the computation of the RTT? Does you network actually do ECN marking? How are you reading ECN markings?
Does you network actually do ECN marking? How are you reading ECN markings?
None of that, it just affects RTT calculation for some reason. If I set usrsctp_sysctl_set_sctp_ecn_enable(0)
then RTT is calculated properly.
I don't think ECN should affect the correctness of RTT calculations. If it does, it is a bug I would like to fix. That is why I'm asking... Can you double check?
Nevermind it was just a coincidence, the problem is still there...
I am investigating SCTP using one-way delay measurements.
when I assign TTL partial reliability with a deadline of 140ms I still receive packets with a delay >800ms
(here I am using CMT but the same results occure when using regular SCTP with only a backup path)
Should usrsctp not allow the sending of packets that are already too long in the queue? Is the pr deadline only evaluated in the SACK processing? When is the deadline calculated? does that happen right after the send() call is it calculated when the chunk moves from the sendqueue to the sentqueue?
The SCTP PR extension: https://tools.ietf.org/html/rfc3758#page-15 states:
I use two GPS time synced linux machines linked together with Gbit eth links. On both links in both directions is use netem with a delay of 40ms and a drop rate of 5%. (simple gilbert with probabilities: good->bad: 5% and bad->good 75%... so actually PDR at about 6.5%) There is an initial delay of ~42ms caused by gstreamer from generating the timestamp and the send() call to the usrsctp socket. The minimum delay achievable is 82ms...
is following socket options: