lynxthecat / cake-autorate

Eliminate the excess latency and jitter terrorizing your 4G, 5G, LTE, Starlink or other variable rate connection!
https://www.bufferbloat.net
GNU General Public License v2.0
268 stars 24 forks source link

Consider alternatives for update_shaper_rate #225

Closed lynxthecat closed 8 months ago

lynxthecat commented 1 year ago

@moeller0 and @patrakov our latest shaper rate adjustment code is:

https://github.com/lynxthecat/cake-autorate/blob/7350fc06db65f75bc236a644cacefb1efd7f4bc6/cake-autorate.sh#L359

update_shaper_rate()
{
    local direction="${1}" # 'dl' or 'ul'

    case "${load_condition["${direction}"]}" in

        # upload Starlink satelite switching compensation, so drop down to minimum rate for upload through switching period
        ul*sss)
            shaper_rate_kbps["${direction}"]="${min_shaper_rate_kbps[${direction}]}"
            ;;
        # download Starlink satelite switching compensation, so drop down to base rate for download through switching period
        dl*sss)
            shaper_rate_kbps["${direction}"]=$(( shaper_rate_kbps["${direction}"] > base_shaper_rate_kbps["${direction}"] ? base_shaper_rate_kbps["${direction}"] : shaper_rate_kbps["${direction}"] ))
            ;;
        # bufferbloat detected, so decrease the rate providing not inside bufferbloat refractory period
        *bb*)
            if (( t_start_us > (t_last_bufferbloat_us["${direction}"]+bufferbloat_refractory_period_us) ))
            then
                adjusted_achieved_rate_kbps=$(( (achieved_rate_kbps["${direction}"]*achieved_rate_adjust_down_bufferbloat)/1000 )) 
                adjusted_shaper_rate_kbps=$(( (shaper_rate_kbps["${direction}"]*shaper_rate_adjust_down_bufferbloat)/1000 )) 
                shaper_rate_kbps["${direction}"]=$(( adjusted_achieved_rate_kbps > min_shaper_rate_kbps["${direction}"] && adjusted_achieved_rate_kbps < adjusted_shaper_rate_kbps ? adjusted_achieved_rate_kbps : adjusted_shaper_rate_kbps ))
                t_last_bufferbloat_us["${direction}"]="${EPOCHREALTIME/./}"
            fi
            ;;
                # high load, so increase rate providing not inside bufferbloat refractory period 
        *high*)
            if (( t_start_us > (t_last_bufferbloat_us["${direction}"]+bufferbloat_refractory_period_us) ))
            then
                shaper_rate_kbps["${direction}"]=$(( (shaper_rate_kbps["${direction}"]*shaper_rate_adjust_up_load_high)/1000 ))
            fi
            ;;
        # low or idle load, so determine whether to decay down towards base rate, decay up towards base rate, or set as base rate
        *low*|*idle*)
            if (( t_start_us > (t_last_decay_us["${direction}"]+decay_refractory_period_us) ))
            then

                if ((shaper_rate_kbps["${direction}"] > base_shaper_rate_kbps["${direction}"]))
                then
                    decayed_shaper_rate_kbps=$(( (shaper_rate_kbps["${direction}"]*shaper_rate_adjust_down_load_low)/1000 ))
                    shaper_rate_kbps["${direction}"]=$(( decayed_shaper_rate_kbps > base_shaper_rate_kbps["${direction}"] ? decayed_shaper_rate_kbps : base_shaper_rate_kbps["${direction}"]))
                elif ((shaper_rate_kbps["${direction}"] < base_shaper_rate_kbps["${direction}"]))
                then
                    decayed_shaper_rate_kbps=$(( (shaper_rate_kbps["${direction}"]*shaper_rate_adjust_up_load_low)/1000 ))
                    shaper_rate_kbps["${direction}"]=$(( decayed_shaper_rate_kbps < base_shaper_rate_kbps["${direction}"] ? decayed_shaper_rate_kbps : base_shaper_rate_kbps["${direction}"]))
                fi

                t_last_decay_us["${direction}"]="${EPOCHREALTIME/./}"
            fi
            ;;
        *)
            log_msg "ERROR" "unknown load condition: ${load_condition[${direction}]} in update_shaper_rate"
            kill $$ 2>/dev/null
            ;;
    esac
        # make sure to only return rates between cur_min_rate and cur_max_rate
        ((shaper_rate_kbps["${direction}"] < min_shaper_rate_kbps["${direction}"])) && shaper_rate_kbps["${direction}"]="${min_shaper_rate_kbps[${direction}]}"
        ((shaper_rate_kbps["${direction}"] > max_shaper_rate_kbps["${direction}"])) && shaper_rate_kbps["${direction}"]="${max_shaper_rate_kbps[${direction}]}"
}

Let's continue the discussion about shaper rate control alternatives from #115 here.

lynxthecat commented 1 year ago

@patrakov copying my post from #115 here:

Would you be able to summarize the rate control logic you tested? I had a very quick look, but it seems a summary would be helpful.

blindly trusts the achieved rate pre-shaper as something that the link definitely can support, even during bufferbloat periods, for the purpose of never setting the shaper below let's say 90% of that

I can't wrap my head around what that means in terms of how the shaper rate is set. Would you be able to copy/paste the code from the update_shaper_rate (formerly get_next_shaper_rate) function?

lynxthecat commented 1 year ago

Also @moeller0 wrote:

Just a reminder: earlier, I proposed a change (now officially rejected) that, for the download direction, blindly trusts the achieved rate pre-shaper as something that the link definitely can support, even during bufferbloat periods, for the purpose of never setting the shaper below let's say 90% of that. @lynxthecat can you compare?

What the current code does is to take the minimum of "current shaper rate factor1" and "last achieved rate factor2". Your proposal is to jettison the first term. I think this is wrong, because the achieved rate necessarily measured the past and if say we have a step like reduction from 100 to 10 "speed units" the last achieved rate will (to simplify) 100 which clearly after the step is incorrect... My mental model is we base our decision mainly on the current shaper rate, but that the achieved rate only into account if that would cause a steeper rate reduction... exactly as the achieved rate is be necessity looking into the past.

patrakov commented 1 year ago

I am not a user of cake-autorate anymore, but here is the patch that I used in the past:

diff --git a/cake-autorate.sh b/cake-autorate.sh
index d1f0e5f..366cbf8 100755
--- a/cake-autorate.sh
+++ b/cake-autorate.sh
@@ -309,8 +309,7 @@ get_next_shaper_rate()
        *bb*)
            if (( t_next_rate_us > (t_last_bufferbloat_us+bufferbloat_refractory_period_us) )); then
                adjusted_achieved_rate_kbps=$(( (achieved_rate_kbps*achieved_rate_adjust_down_bufferbloat)/1000 )) 
-               adjusted_shaper_rate_kbps=$(( (shaper_rate_kbps*shaper_rate_adjust_down_bufferbloat)/1000 )) 
-               shaper_rate_kbps=$(( adjusted_achieved_rate_kbps > min_shaper_rate_kbps && adjusted_achieved_rate_kbps < adjusted_shaper_rate_kbps ? adjusted_achieved_rate_kbps : adjusted_shaper_rate_kbps ))
+               shaper_rate_kbps=$(( adjusted_achieved_rate_kbps >= min_shaper_rate_kbps ? adjusted_achieved_rate_kbps : shaper_rate_kbps ))
                t_last_bufferbloat_us=${EPOCHREALTIME/./}
            fi
            ;;
@@ -1405,30 +1404,10 @@ fi

 # Initialize rx_bytes_path and tx_bytes_path if not set
 if [[ -z "${rx_bytes_path:-}" ]]; then
-   case "${dl_if}" in
-       veth*)
-           rx_bytes_path="/sys/class/net/${dl_if}/statistics/tx_bytes"
-           ;;
-       ifb*)
-           rx_bytes_path="/sys/class/net/${dl_if}/statistics/tx_bytes"
-           ;;
-       *)
-           rx_bytes_path="/sys/class/net/${dl_if}/statistics/tx_bytes"
-           ;;
-   esac
+   rx_bytes_path="/sys/class/net/${ul_if}/statistics/rx_bytes"
 fi
 if [[ -z "${tx_bytes_path:-}" ]]; then
-   case "${ul_if}" in
-       veth*)
-           tx_bytes_path="/sys/class/net/${ul_if}/statistics/rx_bytes"
-           ;;
-       ifb*)
-           tx_bytes_path="/sys/class/net/${ul_if}/statistics/rx_bytes"
-           ;;
-       *)
-           tx_bytes_path="/sys/class/net/${ul_if}/statistics/tx_bytes"
-           ;;
-   esac
+   tx_bytes_path="/sys/class/net/${ul_if}/statistics/tx_bytes"
 fi

 if ((debug)) ; then

Of course the change to the get_next_shaper_rate() is wrong - the new logic should have been applied on download only.

lynxthecat commented 1 year ago

I finally tried this, but without having adjusted the other settings the performance was not satisfactory - too much latency let through for too little bandwidth gained. I did try tweaking the achieved_rate_adjust_down_bufferbloat by setting it to 0.8, but this didn't help enough.

I am still open minded that a better approach may exist, and am keen to explore alternatives.

patrakov commented 1 year ago

Could you please be a bit more specific on how much latency and throughout you get with and without the change?

patrakov commented 1 year ago

Also I think that the average throughput is not the right metric to use for evaluation of my patch. Latency - sure, it's (to a reasonable extent) the whole point. For throughput, we should be looking for the percentage of time where the link is wrongly shaped below the acceptable limit for the video chat. In other words, how fast it recovers from short events when it REALLY cannot support the video chat.

lynxthecat commented 1 year ago

Sure, here is a waveform test with the original code:

https://www.waveform.com/tools/bufferbloat?test-id=e1c6f7be-d2c9-47b5-8768-8118f458849c

And one with the adjustment - just ignore upload since I didn't special case that:

https://www.waveform.com/tools/bufferbloat?test-id=99adce61-bf49-447c-bd13-6a47bc0bb421

@patrakov I am keen to explore different options for the rate controller and default settings to see if performance can be improved.

patrakov commented 1 year ago

Thanks for sharing. I guess, for a completely fair comparison, I would also need a waveform bufferbloat test without cake-autorate at all.

But I think now that the change is not applicable to your link. Note that in my setup, I had different settings related to ping rates and rate adjustment factors. With a higher ping rate, cake-autorate on your link recovers by itself much faster than on mine, but, due to the ISP limiting ICMP rates, I cannot afford that.

Regarding the quick recovery, here is what I mean.

Without the change:

Welcome to the whole-company meeting, and, as our internal meeting policy dictates, please turn on your cameras. Does anyone have any urgent items not in the agenda? OK, no such things. The first topic to discuss would be quack quack quack quack quack quack quack quack quack quack quack quack

I.e., perfect until the first bufferbloat event after which it never recovers.

With the change:

Welcome to the whole-company meeting, and, as our internal meeting policy dictates, please turn on your cameras. Does anyone have any urgent items not in the agenda? OK, no such things. The first topic to discuss would be quack quack quack quack quack by Igor. Igor, please keep it short, no more than three minutes total.

Without cake-autorate at all:

Welcome to the whole-company meeting, and, as our internal meeting policy dictates, please turn on your cameras. Does anyone have any urgent items not in the agenda? OK, no such things. The first topic to discuss would be quack quack report... for... the... [multi-second latency!!!] past week by Igor. Igor, please keep it short, no more than three minutes total.

lynxthecat commented 1 year ago

I wonder if the new ewma code I have been experimenting with would help in your case? Now, we retain an ewma of the achieved rate and use that in the controller:

image

This is with using a pretty conservative alpha (0.2), but a more aggressive alpha might help smooth over temporary drops on your link in achieved rate that are not reflective of connection capacity given burstiness?

patrakov commented 1 year ago

Sounds convincing. I will try later this week on some less-important meeting.

lynxthecat commented 1 year ago

At the moment it's in this branch: https://github.com/lynxthecat/cake-autorate/tree/achieved-rate-ewma.

cake-autorate really is stable now - so hopefully it's now only about focusing on the control aspects.

moeller0 commented 1 year ago

The achieved rate ewma is not going to help here. IIRC @patrakov's issue seems to be persistence at the minimum rate, and that happens if the delta delay is too high either sustained or if there are enough latency spikes to keep the controller in reduce-rate condition. The estimated load has likely little to no bearing on the controller's operation in that regime.

lynxthecat commented 1 year ago

But it's also possible that the bufferbloat drops of shaper rate based on achieved rate were sometimes too drastic owing to bursty achieved rate and drops that coincide with the bufferbloat events. If so, the ewma would smooth over such drops, and safeguard against this.

moeller0 commented 1 year ago

Well, this starts to smell like taking the achieved rate into account for next rate selection simply is not a policy that you want to select for your network then... adding low pass filters will compromise temporal fidelity, this is fine for the baseline rates as we explicitly want slow adaptation and want to separate longer term lower frequency changes from the higher frequency delta... but here I just see yet another toghle to bias the controller for throughput over responsiveness, but already have plenty of those....

lynxthecat commented 1 year ago

Here is some more data.

cake-autorate master:

https://www.waveform.com/tools/bufferbloat?test-id=8e61edd4-c21e-4039-b539-f86cb97116b3

cake-autorate master having jettisoned achieved rate:

https://www.waveform.com/tools/bufferbloat?test-id=f2434e81-82d4-48e5-9ad8-c5dc89690b6e

cake-autorate achieved-rate-ewma:

https://www.waveform.com/tools/bufferbloat?test-id=e093f8cb-dd5d-4244-8e78-4bb6e59502e9

So @moeller0 admittedly with this limited data the first looks best?

But still, looking at:

image

don't you think using the ewma looks appropriate?

The problem I see is that we want to reduce the window size to make the capacity estimate more instantaneous at the point bufferbloat is detected, and yet given bursty data we ought to use a longer window size. It strikes me (and I thought it struck you too) that using an EWMA would to some extent alleviate this tension - we keep the measurement interval small and smooth over burstiness? From the graph above to me it looks healthy.

I don't have a good intuitive feel for this so very much value your thoughts.

moeller0 commented 1 year ago

So with the last controller I looked at we look at the achieved rate: A) to calculate the load-percentage (to decide WHEN to increase the shaper rate) B) on bufferbloat detection to catch large reductions in capacity that are below our current_shaper_rate * reduction_factor` default set for the next shaper rate

The achieved rate EWMA is essentially the lower frequency component of the achieved rate time series, which tends to reflect some weighted average over the recent past and hence is more conservative than the full frequency signal (which deviates from the average in both directions)

For A) this means that we are likely to initiate rate increase steps a bit later with the EWMA, but since we already do with a healthy margin that difference likely will be mostly in the noise. For B) taking the EWMA will simply result in a smaller rate reduction step. If the capacity drop really was as large as the achieved rate implies than taking the smaller value from the EWMA likely means will need more rate reduction steps and that means the latency spike we currently experience (otherwise the controller would not drop the rate) will last longer. If the capacity drop was not that large and the achieved rate was too low for random reasons the smaller step from the EWMA might not cause a longer latency spike but will conserve more throughput (by virtue of a smaller rate reduction step).

IMHO this can be summarized with such an EWMA (depending on the weight of history versus current value) will bias the controller towards higher throughput and towards lower responsiveness*. This can be a fine policy (after all autorate exposes quite a number of toggles to allow to tailor its behaviour for different networks and different desired responsiveness/throughput trade-offs), but it is IMHO not a genuine better way to operate the controller. It certainly is a bit of a 180 in regards to taking the achieved rate on download as veridical signal. However, we currently don't do that, we just use it as additional heuristic to allow shaving off a few reduction steps (and accompanying refractory periods) on truly horrific capacity drops... (and we silently accept that if the rate drop happens not because the capacity dropped, but because for whatever reason all flows where finished and no more data was arriving, we end up doing a too large rate reduction step; howeever in that case we have not much traffic left so that step will not matter much)

*) Responsiveness is the inverse of latency, if latency is a duration, then responsiveness is a frequency, sort of.

lynxthecat commented 1 year ago

It seems you are thinking the EWMA is, on balance, not worth keeping?

I have this feeling that our rate controller might benefit from being a little less jumpy. I'm not best sure how to achieve that though?

Relevant parameters might be:

# the load is categoried as low if < high_load_thr and high if > high_load_thr relative to the current shaper rate
high_load_thr=0.75   # % of currently set bandwidth for detecting high load

# rate adjustment parameters
# bufferbloat adjustment works with the lower of the adjusted achieved rate and adjusted shaper rate
# to exploit that transfer rates during bufferbloat provide an indication of line capacity
# otherwise shaper rate is adjusted up on load high, and down on load idle or low
achieved_rate_adjust_down_bufferbloat=0.9 # how rapidly to reduce achieved rate upon detection of bufferbloat
shaper_rate_adjust_down_bufferbloat=0.9   # how rapidly to reduce shaper rate upon detection of bufferbloat
shaper_rate_adjust_up_load_high=1.01      # how rapidly to increase shaper rate upon high load detected
shaper_rate_adjust_down_load_low=0.99     # how rapidly to return down to base shaper rate upon idle or low load detected
shaper_rate_adjust_up_load_low=1.01       # how rapidly to return up to base shaper rate upon idle or low load detected
patrakov commented 1 year ago

I think that the problem was (when I was still a user) not really that the rate controller was too aggressive but that it did not promptly undo mistakes a posteriori. Having both the current achieved rates and their smoothed versions (to retain the past) should provide the necessary data for such a feature.

lynxthecat commented 1 year ago

Downward corrections seem fast enough. But then certainly if the downward correction was too heavy, then I agree it takes time to recover. I've wondered about something completely different like rate step adjustments in which it tries a new rate step, then waits to see how that works before trying a new step or going back to the previous step. But I'm not sure whether this would work. Any thoughts?

Having both the current achieved rates and their smoothed versions (to retain the past) should provide the necessary data for such a feature.

Ah, interesting. Can you expand on that?

lynxthecat commented 1 year ago

Certainly I've wondered if oscillation can be avoided:

image

Maybe a timeout on rate increase after a rate reduction? But that will sacrifice bandwidth.

It seems very challenging to think of something robust that will work in all scenarios.

patrakov commented 1 year ago

Well, purely speculative thoughts only...

When the bufferbloat episode ends due to the controller throttling the shaper, a new phase of the rate increase starts. How steep this rate increase is can depend on the achieved rate in the past (i.e. "let's hope it was just a temporary sag" vs "it looks like the link has slightly but irreversibly degraded"), the magnitude of the previous down-step, and so on. But this is firmly in the land of heuristics that @moeller0 has previously objected to, and the usefulness of that is definitely specific to each link. I guess the only scientific way to tackle this is through machine learning, which is not my area of expertise.

Please also take into account that I still have the "base_rate is meaningless" viewpoint and therefore try to use some long-term statistics on the achieved rate as a replacement.

moeller0 commented 1 year ago

It seems you are thinking the EWMA is, on balance, not worth keeping?

I would not say that, but I certainly would default to just using the most recent value... my point really this is a policy issue and policy needs to be set by each network admin configuring autorate (if reasonable) autorate should supply the toggles and default to sane initial values. But while you are at exploring this, try to test EWMA for shaper increases and for shaper decreases independently to see which side is giving you pain.

I have this feeling that our rate controller might benefit from being a little less jumpy. I'm not best sure how to achieve that though?

Generally making the step sizes smaller should help a bit (but for large changes this will require many steps).

Certainly I've wondered if oscillation can be avoided:

Keep in mind that with our 'organic' loads oscillations can also come directly from the senders... and in your case from the scheduling in the base station.

lynxthecat commented 1 year ago

It seems like I should try testing EWMA more. Any recommendations for testing?

lynxthecat commented 1 year ago

@moeller0 and @patrakov how about this: https://github.com/lynxthecat/cake-autorate/commit/6a8c740b8ec53d14b5db0b807ab53744ee31887e?

I'm not sure why I didn't think about introducing an increment refractory period before.

@patrakov you can easily test both the use of ewma on achieved rates and use of increment refractory period using:

./setup.sh lynxthecat/cake-autorate controller-changes

And you can compare with master using:

./setup.sh lynxthecat/cake-autorate master

Any config files should be preserved between such changes.

lynxthecat commented 1 year ago

To give some more data points, with the achieved rates ewma and the increment refractory periods, I see with cake-autorate:

https://www.waveform.com/tools/bufferbloat?test-id=efa16aa0-289f-4d06-b4b1-96850dde6a1c

and with just setting 20Mbit/s both directions manually in cake:

https://www.waveform.com/tools/bufferbloat?test-id=bfabfa1b-58a9-42a5-ac08-c31c4b0cd599

and with setting 25Mbit/s both directions manually in cake:

https://www.waveform.com/tools/bufferbloat?test-id=211d43ff-15b7-43fc-a8c5-36968aea907f

and with just setting 40MBit/s both directions manually in cake:

https://www.waveform.com/tools/bufferbloat?test-id=87686b73-bacd-4bea-91d7-e4ce50eb9f87

and with setting 60Mbit/s download and 20Mbit/s upload manually in cake:

https://www.waveform.com/tools/bufferbloat?test-id=86bd7f05-7b32-47f9-8d26-b1c1f6c9c049

From the manual tests above I would say that setting cake at around 25Mbit/s in both directions would be around optimal.

cake-autorate does a good job on upload in that waveform reports an upload bandwidth of 24.6Mbit/s, but cake-autorate does not do such a good job on download in that waveform reports a download bandwidth of 14.2Mbit/s and even at this point a fair amount of latency has crept through.

I realise that seeking to track the actual capacity will inevitably result in some latency increase, but I have the feeling that the the controller could perform better in terms of managing the bandwidth/latency trade-off.

Perhaps working with X samples out of the last Y have absolute delta greater than threshold is just not cutting it and we need to try to work with more aggregated statistics or something? I think we need to accept that for variable rate connections like 4G, 5G and Starlink we need to make some trade-offs. If the average RTT is 50ms and waveform reports a 95h percentile RTT of 100ms or even 200ms I think that's OK. But 400ms is not.

waveform works by calculating percentiles under load. I wonder if cake-autorate should try to keep track of latency percentiles and adjust the cake rate such that the latency percentiles ends up in an acceptable range. I mean, supposing the cake-autorate rates are set such that the 95th percentile RTT is not greater than X - wouldn't that be a good/better metric to strive for?

Regardless, it strikes me that experimenting with even radically different approaches may be worthwhile. cake-autorate is stable now. Different controller approaches can be rapidly prototyped and tested by simply tweaking the update_shaper_rate() function:

https://github.com/lynxthecat/cake-autorate/blob/master/cake-autorate.sh#L359

Does anyone have any thoughts or ideas?

moeller0 commented 1 year ago

I realise that seeking to track the actual capacity will inevitably result in some latency increase, but I have the feeling that the the controller could perform better in terms of managing the bandwidth/latency trade-off.

Let me repeat, this is a policy question most of all, every admin needs to decide where in the responsiveness/throughput continuum they want to operate in.

I think we need to accept that for variable rate connections like 4G, 5G and Starlink we need to make some trade-offs. If the average RTT is 50ms and waveform reports a 95h percentile RTT of 100ms or even 200ms I think that's OK. But 400ms is not.

The waveform test operates inside a browser, which is a terrible environment for high precision measurements, I would not trust this test all that much. E.g. when using Safari I see consistently more and higher outliers than when using firefox (on the same computer). That is not to reject the test, but one really needs to be careful when interpreting the results.

But most of all the amount of latency one is willing to accept is, let's wait for it, mostly a policy question ;)...

I wonder if cake-autorate should try to keep track of latency percentiles and adjust the cake rate such that the latency percentiles ends up in an acceptable range. I mean, supposing the cake-autorate rates are set such that the 95th percentile RTT is not greater than X - wouldn't that be a good/better metric to strive for?

For meaningful percentiles we would need quite a number of samples, and on every longer term rate variation we likely would need to start from scratch, this seems like a great method to post-hoc make sense of a set of recorded delay samples, but much less attractive to drive a real-time controller off of.

lynxthecat commented 1 year ago

Let me repeat, this is a policy question most of all, every admin needs to decide where in the responsiveness/throughput continuum they want to operate in.

But most of all the amount of latency one is willing to accept is, let's wait for it, mostly a policy question ;)...

I get that managing the trade-off between latency and bandwidth is a policy decision. But the reason for this issue is because I think the approach taken in the controller can be improved, and I am keen to see if anyone has any ideas for other things to try. We have a nice framework in place now for tweaking the controller and/or trying different approaches.

For meaningful percentiles we would need quite a number of samples, and on every longer term rate variation we likely would need to start from scratch, this seems like a great method to post-hoc make sense of a set of recorded delay samples, but much less attractive to drive a real-time controller off of.

What about moving/rolling percentiles?

https://mjambon.com/2016-07-23-moving-percentile/

moeller0 commented 1 year ago

But the reason for this issue is because I think the approach taken in the controller can be improved, and I am keen to see if anyone has any ideas for other things to try. We have a nice framework in place now for tweaking the controller and/or trying different approaches.

I do not see this... the point is if the achieved rate craters and the latency shoots up, we really only have two options, treat this as serious and react harshly, to conserve as much responsiveness as we can or practice Laisser-faire and hope this was just a glitch. But if we go the second route and eventually realize the 'glitch' does not pass and hence might be more permanent and can not be ignored we still have to react harshly, but now with way more data in flight and in the queues... that is we can rig our ship for responsiveness or throughput but there is a limit how much responsiveness we can preserve. IMHO that means to allow policies that prioritize responsiveness we need to have the controller essentially be ready to drop down the shaper rate quite nervously...

What about moving/rolling percentiles?

I do not think that percentiles are the way to go, as you essentially will end up with a varying latency threshold, but for any use-case you essentially only have a fixed delay budget and it gets hard too predict how to stay within that limit if the local access latency varies a lot. Don#t get me wrong, looking at percentiles is a decent approach to select a fixed threshold, but I do not think doing this automatically is a great idea. However, I think lua-autorate explored that (with collecting CDF over the duration of an hour or so).... so testing this might be as easy as getting this to work.

lynxthecat commented 1 year ago

I really appreciate all the input you've given since the beginning of this project. As you can tell, a character quirk of mine is that I can't let things go, and I imagine I'll be working on this for as long as I have a variable rate connection.

One thing I am trying to get a handle on is just how much setting a reasonable static cake rate seems to help. That makes me wonder if indeed having something like: try 20Mbit/s, then if that fails, try 15Mbit/s, then if that works, try 17.5Mbit/s. Sort of like more try, measure, and react, rather than the aggressive, continually changing approach we have now.

This seems to be really rather a challenging problem so it's very good that we have something that works as well as it does already. I mean, I use it 24/7 and everything works mostly OK. But there is always room for improvement. And @patrakov has never been entirely satisfied, albeit he has a tough connection to work with and seems to be a hard man to please.

Perhaps first I'll try the tweaks you outlined here: https://github.com/lynxthecat/cake-autorate/commit/6a8c740b8ec53d14b5db0b807ab53744ee31887e#commitcomment-122825294, but I'm still waiting for your further input there to help me have a stab at it.

moeller0 commented 1 year ago

I really appreciate all the input you've given since the beginning of this project. As you can tell, a character quirk of mine is that I can't let things go, and I imagine I'll be working on this for as long as I have a variable rate connection.

This is fine and good fun.

One thing I am trying to get a handle on is just how much setting a reasonable static cake rate seems to help. That makes me wonder if indeed having something like: try 20Mbit/s, then if that fails, try 15Mbit/s, then if that works, try 17.5Mbit/s. Sort of like more try, measure, and react, rather than the aggressive, continually changing approach we have now.

That was something we discussed very early on, where one idea I floated was to only change the rate every X seconds. You (IMHO rightfully) commented that this easily can lead to X second long high bufferbloat epochs...

This seems to be really rather a challenging problem so it's very good that we have something that works as well as it does already. I mean, I use it 24/7 and everything works mostly OK. But there is always room for improvement. And @patrakov has never been entirely satisfied, albeit he has a tough connection to work with and seems to be a hard man to please.

He seems to be in a quite peculiar situation with the deck essentially stacked against him (unreliable link with noticeable drop-outs, video conferencing tools that do not scale down in rate sufficiently low and sufficiently snappy, and an ISP that rate limits ICMP). While this situation is unpleasant to be in it made me think of: https://www.youtube.com/watch?v=ue7wM0QC5LE

Perhaps first I'll try the tweaks you outlined here: 6a8c740#commitcomment-122825294, but I'm still waiting for your further input there to help me have a stab at it.

Ah, I am still on vacation and away from my normal computer, so I somehow missed that (just posted a response there)...