Testing adjusted rate change mechanism

lynxthecat / cake-autorate

Eliminate the excess latency and jitter terrorizing your 4G, 5G, LTE, Starlink or other variable rate connection!

https://www.bufferbloat.net

GNU General Public License v2.0

287 stars 24 forks source link

Testing adjusted rate change mechanism #22

Closed lynxthecat closed 2 years ago

lynxthecat commented 2 years ago

@TalalMash and @patrakov please can you test and compare this adjusted code vs the one in main? @TalalMash this is close to the adjustment you tested. Now load is distinguished between low, medium and high, and the bufferbloat handling is adjusted to take that into account.

@moeller0 what do you think of this? Please see this commit:

https://github.com/lynxthecat/CAKE-autorate/commit/69cc0b2816c1e3d8499802a249839613a8163116

moeller0 commented 2 years ago

Why? If the achieved throughput is really low, because the bottleneck rate is well below the shaper rate, that is the situation where reducing the shaper rate precipitously makes most sense, no? After all I presume we will still make sure that shaper_rate >= minimum_rate?

This is the issue I have with low load/throughput values we do not know how to interpret them. The latency-conserving approach IMHO is to believe the achieved rate, so I would not special cast the low load condition, but I have probably not fully understood what your are actually doing with that commit.

lynxthecat commented 2 years ago

But can the rate change that rapidly between such rapid ticks? Maybe it can? So you are thinking of a step decrease in capacity? I can see that being an issue. But I think at least on my line the capacity varies more gently. But that may be subjective. I'm kind of inferring it from the many runs of my code and looking at output lines.

I'm persisting with this hack perhaps for bad reason and it's because I don't like pushing down download bandwidth upon upload bandwidth related bufferbloat. You see now if upload causes bufferbloat the download rate won't get pushed down to minimum because the download load is below 25%. Otherwise on upload bufferbloat the download is pushed all the way to minimum because the achieved rate was zero. That's the issue I'm trying to address.

Are there not also situations when the achieved rate could be lower for some other reason than the true capacity being much lower than the present shaper rate?

I also wondered if it might be useful to distinguish between low medium and high load anyway.

But I may be mistaken here.

Of course the true fix is to implement OWDs but that also brings in its own problems.

moeller0 commented 2 years ago

So you are thinking of a step decrease in capacity? I can see that being an issue. But I think at least on my line the capacity varies more gently.

Well on wireless links conditions can (and do) change rapidly by pretty large amounts, so betting on gentle changes is just that a bet. Depending on your "cell" this bet might pay of more often than not, but a stable controller should not employ such a heuristic unconditionally, IMHO.

I'm persisting with this hack perhaps for bad reason and it's because I don't like pushing down download bandwidth upon upload bandwidth related bufferbloat.

I understand and even share your sentiment, but I think this is the only stable thing to do here, playing games with differentially interpreting achieved thoughput values in relation to the set shaper rate is not robust or reliable.

You see now if upload causes bufferbloat the download rate won't get pushed down to minimum because the download load is below 25%. Otherwise on upload bufferbloat the download is pushed all the way to minimum because the achieved rate was zero. That's the issue I'm trying to address.

Yes, I see, but given the problem in figuring out why a load is low, I think that is a heuristic too far fr my taste (but I am not implementing this, so certainly your decision to make).

Are there not also situations when the achieved rate could be lower for some other reason than the true capacity being much lower than the present shaper rate?

Of course, if there are simply few packets in flight in the respective direction, so the idea of looking at the immediate throughput to deduce which direction is overloaded and hence the likely culprit for latency-under-load-increases makes intuitive sense. I also assume that this will be correct in many cases. But in case this assumption is wrong, we will end up throttling the other direction to the minimum before we might start to throttle the correct direction.

So if you go such a route you might keep state which direction was throttled last and whether that reduced the latency under load, if not try reducing both direction the next round. But that seems awfully complicated for my taste... especially since:

Of course the true fix is to implement OWDs but that also brings in its own problems.

;)

Personally I think accepting lower throughput than possible is the best option as long as RTTs are in play....

lynxthecat commented 2 years ago

As usual your reasoning is solid and many a time it has saved me from Heath Robinson territory, which I am clearly in danger of straying into here.

Out of curiosity, what if upon detection of bufferbloat based on RTT I fired off one of your 'ntpclient' packets and then based on the response used that to ascertain the directionality of the delay and then reacted accordingly? Would reacting to that be too slow? If so, I could use my code in this 'testing' branch to knock down bandwidth based on achieved rate on high load, else on reduced shaper rate on low/medium load, but also save the achieved rate state, fire off the ' ntpclient' packet and if my pre-emptive guess about the reduction was wrong based on the output of 'ntpclient' then knock down the previously deemed 'unloaded' direction by the saved state to reduce shaper based on the saved achieved rate. Or something along those lines.

This is clearly also a lot of added complication, but ntpclient is great for showing delay direction, but I am assuming less good in terms of granularity. So clever mix of high granularity regular RTT and ntpclient seems like an idea?

Then again it has to be said the existing approach in my 'main' branch tends to work pretty well for me on just RTT's. I mean to be honest the upload issue is only really mostly an issue during speed tests. During ordinary usage I am not normally downloading and uploading at the same time. Mostly it is all about download. Heavy uploading is presumably only like uploads to OneDrive.

I think the reason this overall approach works well is that by going back to base rate all the time in either direction, the experience will be lag free unless one or both side is heavily loaded. It is this 'turbo' idea. And mostly just one side will be heavily loaded - and again at that mostly download. Having to accept upload going back to minimum on download related bufferbloat isn't a huge biggie. We try to play things conservatively and allow excursions from the safe harbour for e.g. heavy downloads or heavy uploads.

@TalalMash and @patrakov please chip in if you have any ideas or thoughts.

lynxthecat commented 2 years ago

Sorry to keep pestering you @moeller0 but I have a follow-on question.

In this testing branch I classify according to:

(( $dl_load > $high_load_thr )) && dl_load_condition="high_load"  || { (( $dl_load > $medium_load_thr )) && dl_load_condition="medium_load"; } || dl_load_condition="low_load" 
(( $ul_load > $high_load_thr )) && ul_load_condition="high_load"  || { (( $ul_load > $medium_load_thr )) && ul_load_condition="medium_load"; } || ul_load_condition="low_load"

I have medium_load_thr set to 25, and high_load_thr set to 75. So load is classified as low, medium or high based on whether it crosses the medium_load_thr or the high_load_thr. So load_load is 0-25, medium is 25-75, and high is 75-100.

Then in the shape control I issue:

    case $load_condition in

        # bufferbloat and medium or high load detected so decrease rate and take into account pseudo capacity estimate
        # providing not inside bufferbloat refractory period
        bb_high_load|bb_medium_load)
            if (( $t_next_rate > ($t_last_bufferbloat+$bufferbloat_refractory_period) )); then
                adjusted_achieved_rate=$(( ($achieved_rate*$achieved_rate_adjust_bufferbloat)/1000 )) 
                adjusted_shaper_rate=$(( ($shaper_rate*$shaper_rate_adjust_bufferbloat)/1000 )) 
                shaper_rate=$(( $adjusted_achieved_rate < $adjusted_shaper_rate ? $adjusted_achieved_rate : $adjusted_shaper_rate ))
                t_last_bufferbloat=${EPOCHREALTIME/./}
            fi
            ;;

        # bufferbloat and low load detected so decrease rate without pseudo capacity estimate
        # and make use of pseudo capacity estimate if available
        bb_low_load)
            if (( $t_next_rate > ($t_last_bufferbloat+$bufferbloat_refractory_period) )); then
                shaper_rate=$(( ($shaper_rate*$shaper_rate_adjust_bufferbloat)/1000 )) 
                t_last_bufferbloat=${EPOCHREALTIME/./}
            fi
            ;;
                # high load, so increase rate providing not inside bufferbloat refractory period 
        high_load)  
            if (( $t_next_rate > ($t_last_bufferbloat+$bufferbloat_refractory_period) )); then
                shaper_rate=$(( ($shaper_rate*$shaper_rate_adjust_load_high)/1000 ))
            fi
            ;;
        # medium or low load, so determine whether to decay down towards base rate, decay up towards base rate, or set as base rate
        medium_load|low_load)
            if (($t_next_rate > ($t_last_decay+$decay_refractory_period) )); then
                        if (($shaper_rate > $base_shaper_rate)); then
                    decayed_shaper_rate=$(( ($shaper_rate*$shaper_rate_adjust_load_low)/1000 ))
                    shaper_rate=$(( $decayed_shaper_rate > $base_shaper_rate ? $decayed_shaper_rate : $base_shaper_rate))
                elif (($shaper_rate < $base_shaper_rate)); then
                            decayed_shaper_rate=$(( ((2000-$shaper_rate_adjust_load_low)*$shaper_rate)/1000 ))
                    shaper_rate=$(( $decayed_shaper_rate < $base_shaper_rate ? $decayed_shaper_rate : $base_shaper_rate))
                        fi
                # steady state has been reached
                t_last_decay=${EPOCHREALTIME/./}
            fi
            ;;
    esac

Now I know the distinguishing between bb_high/medium/low is dubious.

But what about just medium_load / high_load.

At the moment on load < 75 I decay. Should I instead make it such that:

bufferbloat -> deal with high_load -> increase medium_load -> hold same rate low_load -> decay rate

Is there any benefit to be had there? Or not really?

Or do you see any other way to improve the rate control inside the case statement?

In my main branch I use:

        # in case of supra-threshold OWD spikes decrease the rate providing not inside bufferbloat refractory period
        bufferbloat)
            if (( $t_next_rate > ($t_last_bufferbloat+$bufferbloat_refractory_period) )); then
                adjusted_achieved_rate=$(( ($achieved_rate*$achieved_rate_adjust_bufferbloat)/1000 )) 
                adjusted_shaper_rate=$(( ($shaper_rate*$shaper_rate_adjust_bufferbloat)/1000 )) 
                shaper_rate=$(( $adjusted_achieved_rate < $adjusted_shaper_rate ? $adjusted_achieved_rate : $adjusted_shaper_rate ))
                t_last_bufferbloat=${EPOCHREALTIME/./}
            fi
            ;;
            # ... otherwise determine whether to increase or decrease the rate in dependence on load
                # high load, so increase rate providing not inside bufferbloat refractory period 
        high_load)  
            if (( $t_next_rate > ($t_last_bufferbloat+$bufferbloat_refractory_period) )); then
                shaper_rate=$(( ($shaper_rate*$shaper_rate_adjust_load_high)/1000 ))
            fi
            ;;
        # low load, so determine whether to decay down towards base rate, decay up towards base rate, or set as base rate
        low_load)
            if (($t_next_rate > ($t_last_decay+$decay_refractory_period) )); then
                        if (($shaper_rate > $base_shaper_rate)); then
                    decayed_shaper_rate=$(( ($shaper_rate*$shaper_rate_adjust_load_low)/1000 ))
                    shaper_rate=$(( $decayed_shaper_rate > $base_shaper_rate ? $decayed_shaper_rate : $base_shaper_rate))
                elif (($shaper_rate < $base_shaper_rate)); then
                            decayed_shaper_rate=$(( ((2000-$shaper_rate_adjust_load_low)*$shaper_rate)/1000 ))
                    shaper_rate=$(( $decayed_shaper_rate < $base_shaper_rate ? $decayed_shaper_rate : $base_shaper_rate))
                        fi
                # steady state has been reached
                t_last_decay=${EPOCHREALTIME/./}
            fi
            ;;
    esac

Your thoughts are super welcome!

patrakov commented 2 years ago

Side note regarding the use of ntpclient packets: I am afraid that NTP clients inherently assume symmetrical latency, and thus will drift if only one direction is loaded/bufferbloated. I would have agreed with your argument if a GPS clock was available and actually used on both sides (or any other reliable clock which synchronization does not depend on the link whose latency we want to measure independently for the two directions).

See also what physicist think of this in Special Relativity: https://www.youtube.com/watch?v=pTn6Ewhb27k (especially the piece that starts at 11:50).

lynxthecat commented 2 years ago

@patrakov hmm I think the clock difference doesn't matter because we just need to look at the deltas. We were already able to use ICMP type 13 packets to ascertain one way delays by looking at the deltas (we just keep clock offsets as part of baseline).

lynxthecat commented 2 years ago

@patrakov check this out here:

https://github.com/lynxthecat/CAKE-autorate/blob/shell-owd/sqm-autorate.sh

This uses hping3 - it works! Any chance you could be convinced to make hping3 an official package for OpenWrt? I lack the know how to do this.

The only thing is that reflectors are picky about ICMP type 13 and also if the clocks change on either side that causes a jump. For the most part it really does work though. I can show you a graph. See here:

This shows how I was able to have simultaneous download and upload on LTE handled completely independently.

https://forum.openwrt.org/t/cake-w-adaptive-bandwidth/108848/1110?u=lynx

patrakov commented 2 years ago

Unfortunately I cannot test the new script, because it would take ages for me to download the SDK required to build the hping3 package. A ready-made mips-24kc package (for TP-Link Archer C7 v2) would help.

lynxthecat commented 2 years ago

Ah that's my point though - I can build hping3 but it's a pain and I lack the skills to make it an official OpenWrt package sadly. So I've parked this. I am also still a little dubious about using these OWDs because of the reflector pickiness and also the way the clocks can change giving jumps. But I am not entirely sure. Still open minded and would love to test more with hping3 if it could be made an official package. I made a thread about it on OpenWrt but there is clearly just not enough appetite.

BTW that's an old script - I abandoned the OWD approach given lack of any single official OpenWrt package that seems to give reliable ICMP type 13 responses.

'ntpclient' works well in one shot mode, hence my queries above.

moeller0 commented 2 years ago

As usual your reasoning is solid and many a time it has saved me from Heath Robinson territory, which I am clearly in danger of straying into here.

Me too, it takes some effort to not go overboard with "obvious" heuristics... but simplicity has its own value here, as it makes predicting the intended behavior much simpler and that alows a more stringent testing of the actual implementation ;)

Out of curiosity, what if upon detection of bufferbloat based on RTT I fired off one of your 'ntpclient' packets and then based on the response used that to ascertain the directionality of the delay and then reacted accordingly? Would reacting to that be too slow? If so, I could use my code in this 'testing' branch to knock down bandwidth based on achieved rate on high load, else on reduced shaper rate on low/medium load, but also save the achieved rate state, fire off the ' ntpclient' packet and if my pre-emptive guess about the reduction was wrong based on the output of 'ntpclient' then knock down the previously deemed 'unloaded' direction by the saved state to reduce shaper based on the saved achieved rate. Or something along those lines.

I think the biggest issue is that you still need to keep probing the NTP server to maintain a robust base-line estimate, and if you only query one NTP server that one really needs to be quite reliable... the whole issue about "voting" across a number of ICMP reflectors applies to NTP-servers as well. In both cases operating one's own dedicated server should allow to reduce the queried set to 1, but for other people's infrastructure, I think querying a larger set seems prudent, no? But once you have such a reliable reflector, there is little use in only querying NTP occasionally (at least tat is my hope, I have not tested whether that will be reliable, but my hunch is for a single autorate instance a reflector rate between 5 and 100Hz should suffice, and I would guess a single NTP server should serve that few requests easily, after all that is part of NTPs reason for existing).

This is clearly also a lot of added complication, but ntpclient is great for showing delay direction, but I am assuming less good in terms of granularity. So clever mix of high granularity regular RTT and ntpclient seems like an idea?

What do you mean by granularity? In my tests calling a single instance takes less than 10ms, that should allow rates up to 100Hz easily (note I am not advocating that, my hunch is that ~10Hz should be good enough).

Then again it has to be said the existing approach in my 'main' branch tends to work pretty well for me on just RTT's.

+1; according to the 80/20 rule of completion versus 20/80 for time spent, I guess your existing code is "plenty good enough" for quite a lot of use-cases.

I mean to be honest the upload issue is only really mostly an issue during speed tests. During ordinary usage I am not normally downloading and uploading at the same time. Mostly it is all about download. Heavy uploading is presumably only like uploads to OneDrive.

This is why the rate reduction of the lower-load direction typically is not all that noticeable in throughput, but not doing it will delay the appropriate response if that lower-load direction truly is responsible for the experienced congestion.

I think the reason this overall approach works well is that by going back to base rate all the time in either direction, the experience will be lag free unless one or both side is heavily loaded. It is this 'turbo' idea. And mostly just one side will be heavily loaded - and again at that mostly download. Having to accept upload going back to minimum on download related bufferbloat isn't a huge biggie. We try to play things conservatively and allow excursions from the safe harbour for e.g. heavy downloads or heavy uploads.

i like that rationale!

@TalalMash and @patrakov please chip in if you have any ideas or thoughts.

Yes, please do.

moeller0 commented 2 years ago

bufferbloat -> deal with high_load -> increase medium_load -> hold same rate low_load -> decay rate

Is there any benefit to be had there? Or not really?

Not sure, probably worth trying out though... the idea of "stay the course" under some conditions has some attraction. Not sure though whether load (achieved throughput) is a sufficiently robust measure to base this on. Then again, that is what the regress to base-rate is for, no? As that is where the user sets the policy? I guess these two issues are slightly different....

moeller0 commented 2 years ago

Side note regarding the use of ntpclient packets: I am afraid that NTP clients inherently assume symmetrical latency, and thus will drift if only one direction is loaded/bufferbloated.

Well, the idea is to use the raw values an NTP client will base its estimates on, the ntpclient's output contains the four required timetstamps. And as @lynxthecat already commented, the code currently maintains drifting baselines....

I would have agreed with your argument if a GPS clock was available and actually used on both sides (or any other reliable clock which synchronization does not depend on the link whose latency we want to measure independently for the two directions).

That is actually what i do at home, but realistically the idea here is not to use this as a super-high precision timesource, but rather as a convenient way to get OWDs easily, and as far as understand NTP tries to counter transient congestion based delay increases and mostly does a good job, allowing absolute time synchronization to a few (dozen) milliseconds over the internet, so plenty good enough for us, given the already established baseline tracking.

See also what physicist think of this in Special Relativity: https://www.youtube.com/watch?v=pTn6Ewhb27k (especially the piece that starts at 11:50).

I am quite confident that we can do just fine with autorate without considering relativistic effects ;) the amount of error we can tolerate for our use case is quite large, as long as local and remote clocks drift only relatively slowly we are fine. Sure, we will not be able to measure the speed of light that way, but that is a stretch goal at best ;)

moeller0 commented 2 years ago

BTW that's an old script - I abandoned the OWD approach given lack of any single official OpenWrt package that seems to give reliable ICMP type 13 responses.

'ntpclient' works well in one shot mode, hence my queries above.

But it should be easy to harness you new design to just create a test function for nping, no? As I keep repeating things get easier if all reflectors pretend to deliver OWDs as the main loop will be identical for OWDs and RTTs (by virtue of never special casing RTTs, it will just get the same value for both OWDs and be none the wiser)

lynxthecat commented 2 years ago

I thought nping was shown to do a bad job? Or am I mistaken?

Do you think ntpclient calls should be done in isolation rather than mixed in with RTTs? What about my idea of upon bufferbloat, calling ntpclient and seeing what the reusult is, and then using that info to adjust the relevant direction? Would that mean reacting too slowly to the bufferbloat? The ntpclient calls could be called in the background every 15s, with special one-shot calls in addition to the pings upon bufferbloat detection just to get the directionality.

I have something of an aversion to the one-shot binary calling because of the overhead. I did think we could just instantiate 15x parallel instances (I think min interval is 15s?) of ntpclient to get 1Hz results coming in. Maybe that's madness though.

moeller0 commented 2 years ago

I thought nping was shown to do a bad job? Or am I mistaken?

According to our testing so far nping is quite slow, but your new code structure should make it simple to include a function using nping, allowing to actually test how this performs in real life....

Do you think ntpclient calls should be done in isolation rather than mixed in with RTTs?

That would be the simplest, but probably requires a dedicated NTP server at the other end, normal NTP servers will consider our polling rates, at best rude or even hostile and will probably cut us out. To reduce the per server rate to something acceptable we will need something in the order of magnitude of 100 servers to poll, which will be quite a flock of processes to handle (in singleshot mode most of these will be sleeping, but still each reflector loop needs its own process so it can run in the background).

What about my idea of upon bufferbloat, calling ntpclient and seeing what the reusult is, and then using that info to adjust the relevant direction?

This will not really be a temporal issue, assuming it takes 50ms for getting the RTT, then say 40 for NTP (the NTP-overhead is less than 10ms, but it takes RTT+NTP-overhead for the results to be available), which will delay the response only for a shoer while. But this is a bit of an optimization to opportunistically reduce the NTP-query frequency, but I am not 100% confident this avoids situations where we NTP query on every second RTT probe, resulting in the desire to having either a dedicated NTP server or querying a large set to avoid being rejected from the server when we need the information most. But I have not tried this, so all my assumptions might be off here.

Would that mean reacting too slowly to the bufferbloat? The ntpclient calls could be called in the background every 15s, with special one-shot calls in addition to the pings upon bufferbloat detection just to get the directionality.

That is sort of nice, but as I said above, I do not see this guaranteeing never probing the server at too high a rate for too long a duration...

I have something of an aversion to the one-shot binary calling because of the overhead. I did think we could just instantiate 15x parallel instances (I think min interval is 15s?) of ntpclient to get 1Hz results coming in. Maybe that's madness though.

But if you want to probe effectively at around 150ms with a per server interval of 15seconds (which some servers will probably already consider offensive/obnoxious you need 100 parallel ntpclient instances, at which point having all these binaries open at the same time will be costly as well.... More over, I can not seem to get ntclient in debug mode (required to see the timestamps) to actually give me more than a single measurement at all... heck, not even in non-debug mode right now....