RTT calculation of connection is pretty unreliable when _just_ using the library as is

StarStarJ commented 1 week ago

Hello together,

I am upgrading my app to use QUIC and I wanted to use the network stats of the connection to replace my custom ping-pong calculation which i currently use in my app. Now with this wonderful library I saw it already has such network stats implemented, which is pretty neat, since it saves me quite some work.

However I saw that the RTT calculation seems pretty unreliable: To make sure it's not a problem I caused, i changed the example code of this crate and saw similar problems, it seems to be related to the amount of packets send/received:

Here is my small change, i basically just put a loop around the client code and printed the RTT information: https://github.com/StarStarJ/quinn/commit/18c16b6944b9546f141a45b1878bf1bd96ad4ce5?diff=unified&w=1

The important change is: https://github.com/StarStarJ/quinn/commit/18c16b6944b9546f141a45b1878bf1bd96ad4ce5?diff=unified&w=1#diff-c1480185b8a920b105ec923f9a63194786ac1688fe25fb07611c9a640ba9a194R153

For example changing the sleep value from 500 to 20 significantly decreases the RTT value.

Now is the question do I have to call some function on the connection to keep this value realistic, or is that a bug. Since the value only decreases the lower the timeout is it kinda feels like i must call a function on the connection as soon as possible, maybe quinn only sends acks if there is an active polling on that connection?

In any case it would be nice for an example to get the actually best estimate of RTT using this library, since from docs + examples alone I didn't understand where my thinking problem is here.

Thanks in advance

Ralith commented 1 week ago

You are getting bad behavior because you are calling a blocking function, std::thread::sleep, in async code. Never do this. See https://ryhl.io/blog/async-what-is-blocking/.

StarStarJ commented 1 week ago

No that is not the reason, and your conclusion is not good:

If I have heavy calculations that could also take 500ms CPU time (on this single thread at least), that is why tokio has the rt-multi-thread feature.
Replacing std::thread::sleep(Duration::from_millis(500)); with tokio::time::sleep(Duration::from_millis(500)).await; still shows the exact same problem.

The question still stands, what is important to have right behavior? I also tried moving the sending part in a tokio::task (tried delaying the sending process instead, tried to wait for the previous sending task after open a new bidirectional stream etc.) to call open_bi as soon as possible, which also didn't fix it.

Ralith commented 1 week ago

If I have heavy calculations that could also take 500ms CPU time (on this single thread at least), that is why tokio has the rt-multi-thread feature.

No, it isn't. Please read the article I linked, which directly addresses this misconception.

Replacing std::thread::sleep(Duration::from_millis(500)); with tokio::time::sleep(Duration::from_millis(500)).await; still shows the exact same problem.

It seems to work okay to me. What about the behavior seems not "right" to you?

StarStarJ commented 1 week ago

For example changing the sleep value from 500 to 20 significantly decreases the RTT value.

It's about 1ms off just by changing the sleep value. On my real app it seems to suffer even more, it get's very inconsistent, jumps between few microseconds up to a few milliseconds. (The client unregulary sends packets, sometimes more sometimes few, this is the reason i simulated the same using the sleep command, which at least showed similar problems)

Ralith commented 1 week ago

tokio timer precision is 1ms, so variation on that scale is expected at the bare minimum, to say nothing of any variation in actual network path latency and peer load. OS timer precision is even worse (~15ms?) if you're on Windows and you don't make some special winapi calls.

StarStarJ commented 1 week ago

Mh ok, i am on linux

if i do ping -i 1 localhost or ping -i 0.1 localhost i get very stable ping times, so i was surprised it differed so much depending on the packet load.

I guess I'll simply not rely on the rtt values as very reliable (at least not to such an extend), even tho i have to say 15ms sounds like a huge bug to me. in my app i see differences/jitter of upto ~8ms (on average 2-3ms) already using the quinn rtt value, that is higher than the refresh rate of my screen.

Anyway, thanks a lot for your time.

Ralith commented 1 week ago

i get very stable ping times, so i was surprised it differed so much depending on the packet load.

ping has different design priorities than tokio. For example, using limited timer precision allows for scheduling and resetting large numbers of concurrent timers to be very efficient, which can drastically improve server performance. If you really want you could drop in your own high-precision timer (and pay higher CPU cost to do so), but for most applications this is not worth the effort: network applications should be designed to tolerate quite a bit more latency jitter than you have reported; e.g. wifi alone can routinely delay packets by 100+ms.

i have to say 15ms sounds like a huge bug to me

Yes, Windows is very idiosyncratic in this respect. See https://github.com/tokio-rs/tokio/issues/5021.

in my app i see differences/jitter of upto ~8ms

Make sure that you are using a release build and not doing computationally demanding or otherwise blocking work on networking threads.

quinn-rs / quinn

RTT calculation of connection is pretty unreliable when _just_ using the library as is #1902