private-octopus / picoquic

Minimal implementation of the QUIC protocol
MIT License
544 stars 161 forks source link

[RTT] Miscalculation of rtt on server side #1601

Closed fbussery closed 6 months ago

fbussery commented 9 months ago

Picoquic version: 2023-09-20 8:58 Test environement: Simulated adsl network

The simulated adsl:

The observed problem:

Sometime we notice that the rtt min is very low. (around 500 micro seconds). This is definitively not expected. The impact is that the bbr is mistaking the bandwidth.

I did attach the log qlog-05944aef04573529.server.log

On line 208, we have: [12904429, "recovery", "metrics_updated", {"cwnd": 5760,"bytes_in_flight": 0,"smoothed_rtt": 105856,"min_rtt": 552,"latest_rtt": 552}],

Expected behaviour

We expect the rtt to be all the time more the 120000

I did not reproduced the issue using picoquicdemo perf that always return a correct rtt_min (120ms)

fbussery commented 9 months ago

Further tests shows that problem may be the picoquic_current_time accurency in my system. Did test also with CLOCK_MONOTONIC. But with similar behaviour.

fbussery commented 9 months ago

seems that there are no bug in picoquic concerning the rtt. I did check the sequence in picoquic and it is correct. My simulation setup may be mis-functionning. I will close the ticket tomorrow

huitema commented 9 months ago

On Linux and macOS, the code indeed uses "gettimeofday", which does provide a time in microseconds but does not exactly specify the resolution. I think it does work well, and has the advantage of being portable. There is a drawback, per the history section of the man page: "The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the system time)."

Given that, yes, using clock_gettime() might be safer. We just have to verify portability, and also the availability of absolute time so logs on several machines can be synchronized. If you have a PR, I will review it.

fbussery commented 9 months ago

Just to clarify, there is no problem with the clock on my side. I finally found out that some of my nodes were starting traffic shaping late after on boot. But I will provide you anyway this merge request. My only fear is that my "bad" setup did drive to have a very small rtt_min of my connection since it was estimated before the traffic shaping. I fear that this can induce erroneous evaluation in the congestion algorithm. I remember, for instance that I had problem with the newreno algorithm that wasn't able to converge correctly to the expected bitrate(I think this was the reason). Wondering if such behaviour could happens in real live (for instance on wifi or cell phone)?

huitema commented 9 months ago

What kind of traffic shaping? We already have tests of BBR versus traffic shaping, assuming a shaper enforcing a max data rate with a leaky bucket algorithm. Should we add tests for your particular scenario, in which the shaper only kicks in after a while? Or is it enough to simulate something simpler, e.g., change the data rate of the connection midway?

fbussery commented 9 months ago

I need to check with the guy in charge of the simulator. He's in holidays now. My assumption is that our simulator start the shaping after a couple of seconds. So, we switch from an ultra fast local network to a simulated adsl line. This is definitively not a normal use case, so no reason for a specific test scenario for that. On our side, we plan to make some scenario with fluctuating bitrate and latency. I assume that we may encounter this kind of scenario with wifi in real life. And my concern is that we use min_rtt in a lot of places like: https://github.com/private-octopus/picoquic/blob/master/picoquic/bbr.c#L780 which could be in some case very outdated values.

huitema commented 9 months ago

That points to an open research topic -- how to determine that the link state has changed. BBR does that with a 10 second timeout, which is not quite optimal. It would be great if we could find some reliable test, such as maybe checking some combination of RTT, measured throughput and packet loss (or packet marking rate).

fbussery commented 9 months ago

I did attach here two small patches One for the time accuracy. The other, a very small one in order to avoid a warning with valgrind patch-issue-1601.zip

huitema commented 9 months ago

I needed to take a closer look at time and clocks before using the "monotonic" clock.

Picoquic uses a single clock for two reasons:

1) Computing delays between network events, such as RTT, timers, etc. 2) Checking some variables against wall clock, e.g., checking the validity of TLS tickets, X.509 certificates, etc.

The monotonic clock is better than the wall clock for the first usage, because the wall clock can be set ahead of backward by the admin. (Both can be affected by adjtime() and by NTP. ) But the monotonic clock is inadequate for the second usage -- it makes no guarantee about a fixed start point in time.

For now, the simple solution may well be to keep gettimeofday. Updated: The function picoquic_get_tls_time() is already defined and returns the time "as seen by TLS". This function should be used in the "ticket" and "token" code, instead of the "current_time". Once this is updated, we can switch to "clock monotonic" or maybe "clock bootime", and thus protect the code against jumps in clock.