lynxthecat / cake-autorate

Eliminate the excess latency and jitter terrorizing your 4G, 5G, LTE, Starlink or other variable rate connection!
https://www.bufferbloat.net
GNU General Public License v2.0
268 stars 24 forks source link

Improve controller #248

Closed lynxthecat closed 1 year ago

lynxthecat commented 1 year ago

This pull request is intended to capture improvements to the controller including detection of bufferbloat and adjustments to the shaper rates. Addresses https://github.com/lynxthecat/cake-autorate/issues/225.

lynxthecat commented 1 year ago

@moeller0 how about this as a first step? Since your proposal to introduce proportional shaper rate reductions on bufferbloat detection outlined here: https://github.com/lynxthecat/cake-autorate/commit/6a8c740b8ec53d14b5db0b807ab53744ee31887e#commitcomment-12290585 involves working with the average delays (surely taken across the bufferbloat detection window), it makes sense to me as a first step to calculate the average download and upload delays taken across the bufferbloat detection window and use those to detect bufferbloat in each direction.

So we:

Here is some sample debug output I generated using an download test so you can see what the average delay values look like relative to the compensated average thresholds during a download speed test:

https://gist.github.com/lynxthecat/bbd0c57ce906560a68cb3d8ee1b6ece7

From testing it seems to work at least as well as the previous mechanism of evaluating whether X out of the last Y samples were delayed. In fact, it looks to me like this change alone already offers a reasonable improvement:

lynxthecat/cake-autorate mater:

https://www.waveform.com/tools/bufferbloat?test-id=41b29500-2f18-4bf6-99e4-85fcc91a2dd4

lynxthecat/cake-autorate improve-controller:

https://www.waveform.com/tools/bufferbloat?test-id=fe78f7ca-fcb7-4d26-8d96-3386f4e4dd2d

moeller0 commented 1 year ago

@moeller0 how about this as a first step? Since your proposal to introduce proportional shaper rate reductions on bufferbloat detection outlined here: 6a8c740#commitcomment-12290585 involves working with the average delays (surely taken across the bufferbloat detection window), it makes sense to me as a first step to calculate the average download and upload delays taken across the bufferbloat detection window and use those to detect bufferbloat in each direction.

I respectfully disagree, I want one change at a time with sufficient testing time to be able to evaluate get a feel whether a change helps or not... Also we take multiple reflectors as we do not really trust all of them (that is besides rate limiting and deprioritization specific reflectors can be affected by path congestion further away from our access link, and we really should not try to work around e.g. congestion in the nearest google data center...). The only act of X out of Y reflectors are above threshold (with ideally X == Y) really serves to only act if we have decent evidence for access path congestion. If we take the average and we stick to my example of google#s data center, and they have 1000ms delta delay (as the issue just occurred) then in all likelihood we will have an above threshold average. IMHO this can easily lead to situations where the averaging counter-acts our consensus mechanism.

detect bufferbloat if the download or upload average delay is greater than the download or upload compensated average delay threshold

As I tried ot exp0lain I would caution against that approach as it will counter act the reflector diversity we aim for to not try to counter act bufferbloat on too far a (single) link...

From testing it seems to work at least as well as the previous mechanism of evaluating whether X out of the last Y samples were delayed. In fact, it looks to me like this change alone already offers a reasonable improvement:

And that is why anecdotal testing is error prone, it is really hard to test the rare cases when the average is completely dominated by a single reflector...

lynxthecat/cake-autorate mater:

I like that, calling the default branch 'mater' has a nice latin ring to it and also reminds me of the main computer in alien...

lynxthecat commented 1 year ago

But @moeller0 we will rotate out any such bad reflectors already with our reflector health checking code.

Couple of extra runs:

lynxthecat/cake-autorate master:

https://www.waveform.com/tools/bufferbloat?test-id=939316be-fa36-4e14-850e-40966c97cb0d

lynxthecat/cake-autorate improve-controller:

https://www.waveform.com/tools/bufferbloat?test-id=e0184d0e-ccc5-4dcd-942a-bc194622e219

From my crude testing involving waveform tests and running pings with saturating load this change seems like an improvement.

moeller0 commented 1 year ago

On 28 July 2023 13:34:32 CEST, lynxthecat @.***> wrote:

But @moeller0 we will rotate out any bad reflectors already with our rotator health checking code.

This operates on a different timescale... as far as I am concerned the consensus over a diverse set of reflectors is one of the core ideas of autorate that proved its merit, I would be very careful in changing that and certainly not based on cursory testing alone. Such testing catches really bad ideas quickly but passing such a test does not tell much about the performance in edge cases, but it is these edge cases that affect how well this thing be'haves under unfortunate loads.

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

moeller0 commented 1 year ago

 

   

Gesendet: Freitag, 28. Juli 2023 um 15:50 Uhr Von: "lynxthecat" @.> An: "lynxthecat/cake-autorate" @.> Cc: "moeller0" @.>, "Mention" @.> Betreff: Re: [lynxthecat/cake-autorate] Improve controller (PR #248)

 

What tests would you recommend I try?

For subtle changes like replacing the minum over a set of reflectors with average over a set of reflectors I think you would need to run the new code for at leasat a fortnight. But here for me the question is easy to answer, using the minumum over a set makes theoretucal sense, taking the average not. Question: since we are actually discussing different changes why is this additoinal change so important fir you right now?

 

 

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

lynxthecat commented 1 year ago

OK @moeller0 you have convinced me.

So how about this now (please check the commit(s) / files changed in this pull request)?

I think this is pretty much exactly your idea.

Now:

Here are some example values from a speed test:

https://gist.github.com/lynxthecat/b92810b6cc1687d820c2cc5e380e8c4e

moeller0 commented 1 year ago

I think this is pretty much exactly your idea.

Hardly, this is just a dust-off of a proposal you made some time ago, I take responsibility for helping define the initial algorithm, but this is IMHO a cooperative idea.

lynxthecat commented 1 year ago

What do you think of these defaults:

# owd delta threshold in ms is the extent of OWD increase to classify as a delay
# these are automatically adjusted based on maximum on the wire packet size
# (adjustment significant at sub 12Mbit/s rates, else negligible)
dl_owd_delta_thr_ms=30.0 # (milliseconds)
ul_owd_delta_thr_ms=30.0 # (milliseconds)

# average owd delta threshold in ms at which maximum adjust_down_bufferbloat is applied
dl_avg_owd_delta_thr_ms=60.0 # (milliseconds)
ul_avg_owd_delta_thr_ms=60.0 # (milliseconds)
moeller0 commented 1 year ago

These likely need to be adapted for each link. It seems OK if you distribute values that work reasonably on your link and instruct users at seriously considering changing those... That said, I have a feel for how to set the xx_owd_delta_thr_ms values by looking at CDFs with and without load, but I have nointuition yet for the xx_avg_owd_delta_thr_ms values... my gut feeling is 60ms might be a bit high but realistically one needs to look at the actual max adjustment factor as well... This just needs some testing...

lynxthecat commented 1 year ago

So far so good @moeller0. This change in terms of bufferbloat reduction scaling seems to be working very well. I'm tempted to call it a day with these changes and pull them in to master. Then after a week or two more we could issue a v3.1.0 release with these new changes.

moeller0 commented 1 year ago

So far so good @moeller0. This change in terms of bufferbloat reduction scaling seems to be working very well.

So your intuition a while ago seems fully validated.

I'm tempted to call it a day with these changes and pull them in to master. Then after a week or two more we could issue a v3.1.0 release with these new changes.

As I said, you should expose them to enough testing to be sure you covered not only the simple test cases (like speed-tests), but after that the best way forward seems to broaden the testing base and if that means pushing to "mater" (still like this better than the alternatives) then so be it.

lynxthecat commented 1 year ago

@bairhys, or anyone else reading this, this pull request is very likely to get merged into 'master' unless there emerge specific issues with it. Would you mind switching over to it:

setup.sh lynxthecat/cake-autorate improve-controller 

And report back with any findings?

On my 4G connection I think it gives better performance in terms of managing the latency/bandwidth compromise.

If you previously specified a custom dl_delay_thr_ms or ul_delay_thr_ms those have changed to dl_owd_delta_thr_ms and ul_owd_delta_thr_ms, and in that case you may also want to specify your own values for dl_avg_owd_delta_thr_ms and ul_avg_owd_delta_thr_ms that are around twice whatever you set for the owd_delta_thr_ms values. Albeit the default values for both the owd delta thresholds and average owd delta thresholds (30ms and 60ms) will likely work fine on your connection too.

Also the DATA log lines have been tweaked for consistency and to further include the average owds and associated compensated thresholds (check out the new headers).

bairhys commented 1 year ago

Hi @lynxthecat, I am running this branch now. I made a new branch for my exporter to read the updated DATA and SUMMARY log lines. The rate of development of this project is really impressive!

lynxthecat commented 1 year ago

Thanks @bairhys! It helps that I use cake-autorate 24/7 for my own 4G connection and so naturally I want the best possible performance out of it myself. That has driven much of the change, but I remain grateful to @moeller0 because he has always taken an active interest and helped to guide things in the right direction.

Since the project seems to be in a pretty mature state now, I anticipate that the rate of change will slow down. We have something that works pretty well now for multiple different connection types and specifics. And we have kept it very customisable.

Any comment on the performance of this branch? In any case, I'd love to see some of your fancy plots with the new elements. Could you possibly post such plots here?

@moeller0 any chance you could be convinced to modify the Octave plotter to read in and plot the new elements - see: https://github.com/lynxthecat/cake-autorate/commit/22bf566cdaa7f20f52dd510f48ff4b641400d82b?

bairhys commented 1 year ago

I did some back to back testing comparing v3.0.1 to this branch improve-controller and later no sqm.

Testing setup was

Results

All results summarized here with a score given to all tests.

I defined one as overall score download speed upload speed / (download lat upload lat) Then another as a throughput score download speed upload speed And last as latency score download lat upload lat

V3.0.1
dl up dl lat ul lat Rating 1 (dl*ul/dl_lat/ul_lat) higher better Rating 2 (dl*ul) higher better Rating 3 (dl_lat*ul_lat) lower better
66.7 5.64 17 14 1.5806218487395 376.188 238
52.3 5.69 14 15 1.41708095238095 297.587 210
65.8 8.25 18 14 2.15416666666667 542.85 252
66.5 8.23 16 12 2.85049479166667 547.295 192
TOTAL 8.00236425945378 1763.92 892
improve-controller
73 5.7 19 13 1.68461538461538 416.1 247
70.1 7.84 18 13 2.34864957264957 549.584 234
68.6 8.25 16 14 2.5265625 565.95 224
66.7 8.56 17 14 2.39895798319328 570.952 238
TOTAL 8.95878544045823 2102.586 943
No-sqm
dl up dl lat ul lat
79.1 8.31 62 234 0.0453074855252275 657.321 14508

The extra latency between the improve-controller and v3.0.1 tests are similar even though v3.0.1 is a bit better.

improve-controller had slightly better throughput initially then they started to converge.

Overall the difference in additional latency is minimal so based on throughput improve-controller looks to have a slight edge but need more tests to confirm. I could see this test becoming automated to give better and less time consuming long term test results.

raw data

test 1 v3.0.1 image

test 2 improve-controller image

test 3 v3.0.1 image

test 4 improve-controller image

test 5 v3.0.1 image

test 6 improve-controller image

test 7 v3.0.1 image

test 8 improve-controller image

test 9 no-sqm image

#!/bin/bash

# *** STANDARD CONFIGURATION OPTIONS ***

output_processing_stats=1   # enable (1) or disable (0) output monitoring lines showing processing stats
output_load_stats=1         # enable (1) or disable (0) output monitoring lines showing achieved loads
output_reflector_stats=1    # enable (1) or disable (0) output monitoring lines showing reflector stats
output_cake_changes=1       # enable (1) or disable (0) output monitoring lines showing cake bandwidth changes
debug=1             # enable (1) or disable (0) out of debug lines

### For multihomed setups, it is the responsibility of the user to ensure that the probes
### sent by this instance of cake-autorate actually travel through these interfaces.
### See ping_extra_args and ping_prefix_string

dl_if=ifb4wan  # download interface
ul_if=wan     # upload interface

# Set either of the below to 0 to adjust one direction only
# or alternatively set both to 0 to simply use cake-autorate to monitor a connection
adjust_dl_shaper_rate=1 # enable (1) or disable (0) actually changing the dl shaper rate
adjust_ul_shaper_rate=1 # enable (1) or disable (0) actually changing the ul shaper rate

min_dl_shaper_rate_kbps=30000  # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=50000 # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=100000  # maximum bandwidth for download (Kbit/s)

min_ul_shaper_rate_kbps=4000  # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=6000 # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=15000  # maximum bandwidth for upload (Kbit/s)

# *** OVERRIDES ***

randomize_reflectors=0

pinger_binary=tsping
bairhys commented 1 year ago

Here are some plots, the new DATA columns are near the bottom

test

lynxthecat commented 1 year ago

Looks good to me. You could try playing around with the defaults such as by lowering the new average owd delta thresholds if you wanted to reduce latency at the expense of bandwidth (and the old behaviour can be obtained by just setting those values to 0). But the new regime and defaults look to me like they are working well on your connection.