Track connection capacity

richb-hanover commented 1 year ago

A wacky thought: I have a 75/75 mbps fiber connection. But I have reason to believe that there's congestion in my ISP's head end, and that during busy times my download (especially) is limited. But I don't want to run speed tests by hand, or do a lot of rigamarole to track the speeds.

If only there were some automatic process that tracks the instantaneous link speeds... Hey! Isn't that what cake-autorate does?

I know the fn_parse_autorate_log.m code will take a (manually triggered) log file and produce a detailed plot. But that can't/shouldn't run continuously, and I'd still have to do work by hand.

My challenge: How could we preserve some representation of link speeds every minute or so? I would probably want some kind of log file rotation to avoid filling up the "disk")

Thanks!

lynxthecat commented 1 year ago

I think we already have this capability in the 'testing' branch.

Logs look like this now:

DATA; 2023-01-31-20:04:46; 1675195486.112139; 1675195486.111562; 10457; 282; 52; 1; 1675195486.092880; 9.9.9.9; 1597; 18950; 42350; 8320; 23399; 30600; 18950; 42350; 8320; 23399; 30600; 0; 0; dl_low; ul_idle; 20000; 20000
DATA; 2023-01-31-20:04:46; 1675195486.123505; 1675195486.122957; 10457; 282; 52; 1; 1675195486.110220; 9.9.9.10; 1597; 19477; 26000; 6413; 6523; 30600; 19477; 26000; 6413; 6523; 30600; 0; 0; dl_low; ul_idle; 20000; 20000
DATA; 2023-01-31-20:04:46; 1675195486.179373; 1675195486.178844; 10457; 282; 52; 1; 1675195486.161580; 1.1.1.1; 1598; 17843; 26600; 7183; 8757; 30600; 17843; 26600; 7183; 8757; 30600; 0; 0; dl_low; ul_idle; 20000; 20000
DATA; 2023-01-31-20:04:46; 1675195486.221477; 1675195486.220962; 10457; 282; 52; 1; 1675195486.204280; 9.9.9.11; 1598; 18767; 22900; 6945; 4132; 30600; 18767; 22900; 6945; 4132; 30600; 0; 0; dl_low; ul_idle; 20000; 20000
LOAD; 2023-01-31-20:04:46; 1675195486.251636; 1675195486.251229; 9046; 266; 20000; 20000

And they are auto rotated as per config:

# ** Take care with these settings to ensure you won't run into OOM issues on your router ***
# every write the cumulative write time and bytes associated with each log line are checked
# and if either exceeds the configured values below, the log log file is rotated
log_to_file=1              # enable (1) or disable (0) output logging to file (/tmp/cake-autorate.log)
log_file_max_time_mins=10  # maximum time between log file rotations
log_file_max_size_KB=2000  # maximum KB (i.e. bytes/1024) worth of log lines between log file rotations

With these logs you can track both latency and bandwidth used. You can change the log file location so you could log to USB or a cloud mount.

richb-hanover commented 1 year ago

One confounding factor in monitoring link speeds is that a lot of the time there is no/low traffic, so cake-autorate doesn't have much information about the actual link speed at those times. But when traffic picks up, cake-autorate will discover the current max link speed.

My intuition for solving this problem is to record the up/down rate just prior to the load condition that indicates bb (say, dl_high_bb) as an indication of the link speed at that instant of time. Might that work? Or am I off-base? Thanks.

For my future reference, here are the columns headings for that DATA; line above...

DATA_HEADER; LOG_DATETIME; LOG_TIMESTAMP; PROC_TIME_US; DL_ACHIEVED_RATE_KBPS; UL_ACHIEVED_RATE_KBPS; DL_LOAD_PERCENT; UL_LOAD_PERCENT; RTT_TIMESTAMP; REFLECTOR; SEQUENCE; DL_OWD_BASELINE; DL_OWD_US; DL_OWD_DELTA_EWMA_US; DL_OWD_DELTA_US; DL_ADJ_DELAY_THR; UL_OWD_BASELINE; UL_OWD_US; UL_OWD_DELTA_EWMA_US; UL_OWD_DELTA_US; UL_ADJ_DELAY_THR; SUM_DL_DELAYS; SUM_UL_DELAYS; DL_LOAD_CONDITION; UL_LOAD_CONDITION; CAKE_DL_RATE_KBPS; CAKE_UL_RATE_KBPS"

DATA; 2023-01-31-20:04:46; 1675195486.112139; 1675195486.111562; 10457; 282; 52; 1; 1675195486.092880; 9.9.9.9; 1597; 18950; 42350; 8320; 23399; 30600; 18950; 42350; 8320; 23399; 30600; 0; 0; dl_low; ul_idle; 20000; 20000

lynxthecat commented 1 year ago

Yes it would work and is a possibility. By default we conservatively move the shaper rate back down to base rate with time to deal with links like LTE where I believe the connection takes time to pick up from nothing, but this functionality can be disabled such that the shaper rate is maintained based on the last known good rate subject to no bufferbloat events. @patrakov likes this approach better. It can be set up by setting appropriate alpha values:

shaper_rate_adjust_down_load_low=1     # how rapidly to return down to base shaper rate upon idle or low load detected 
shaper_rate_adjust_up_load_low=1       # how rapidly to return up to base shaper rate upon idle or low load detected

richb-hanover commented 1 year ago

Thanks for this feedback. I'm actually too busy with other projects to spend any time on this, but now that I know it's not an outrageous (or completely flawed) approach, I'll let it percolate in my subconscious.

moeller0 commented 1 year ago

Only time we know something certain about the link speed is when our traffic load exceeds the current link capacity our controller will push down the shaper rate, so we basically could log the achieved speeds from such events and store these into a different cyclic log buffer (or approximate this by cycling through two buffers like we do with the logs) this would work well for download, but would slightly overestimate the achievable speed for the upload direction. Maybe we could just write out a special file to contain that data and read it from an exec script for collectd so the rates are graphed in luci-app-statistics?

lynxthecat commented 1 year ago

That's a really interesting idea. So @moeller0 that would reflect the up-to-date estimate of the link capacity and allow some graphing for those inclined? Should we output a new data type like 'CAPEST' for that too?

moeller0 commented 1 year ago

Well, this would not really be up-to-date in that we only would log the most recent "reliable" capacity estimates, but "most recent" could well be from a long time away (think e.g. in the morning the last reliable estimate might be from the evening before). I think that would probably not matter all that since would log the time of measurement as well so could report how recent that data is...

richb-hanover commented 1 year ago

Now that #154 is complete, I am comfortable with the knowledge that this code's performance is max'd out and won't get any faster. (Great job!)

I suspect there remains a strong use case for considerably lower-rate pinging that wouldn't tax the CPU at all. I'm thinking of cable modems and (maybe) my fiber ISP that suffers from congestion at certain hours of the day.

But to prove that, I'd want to provide numbers (and that's the point of this issue). And I'm still up to my elbows in another project - so I'll have to live with doubt for a little while longer. Thanks again.

lynxthecat commented 1 year ago

@moeller0 I seek your wisdom yet again now that I have experimented with @richb-hanover's idea to track connection capacity in cake-autorate.

I tried implementing connection capacity estimation as follows:

# bufferbloat detected, so decrease the rate providing not inside bufferbloat refractory period
*bb*)
    if (( t_start_us > (t_last_bufferbloat_us["${direction}"]+bufferbloat_refractory_period_us) ))
    then
adjusted_achieved_rate_kbps=$(( (achieved_rate_kbps["${direction}"]*achieved_rate_adjust_down_bufferbloat)/1000 )) 
adjusted_shaper_rate_kbps=$(( (shaper_rate_kbps["${direction}"]*shaper_rate_adjust_down_bufferbloat)/1000 )) 
shaper_rate_kbps["${direction}"]=
if (( adjusted_achieved_rate_kbps > min_shaper_rate_kbps["${direction}"] && adjusted_achieved_rate_kbps < adjusted_shaper_rate_kbps ))
then
    shaper_rate_kbps["${direction}"]="${adjusted_achieved_rate_kbps}"
    ### estimated connection capacity in direction: '${direction}' is '${achieved_rate_kbps}' ###
else
    shaper_rate_kbps["${direction}"]="${adjusted_shaper_rate_kbps}"
fi
t_last_bufferbloat_us["${direction}"]="${EPOCHREALTIME/./}"
    fi
    ;;

And it apparently gave meaningful values. Every time I ran a speed test that saturated a direction, I would get a line or two with estimated connection capacity in the relevant direction.

The values seemed fairly sensible, albeit an important caveat for consideration is that the connection capacity estimation is not a reflection of the total bandwidth you'd see in a speed test without cake when accepting massive bufferbloat; it's something smaller. It's more like the present capacity of the connection available for handling relatively bufferbloat free transfer.

This brings up in my mind two questions.

Firstly, if you think tracking these estimates is worthwhile, in what way should we output these values?

As a reminder, here our our present headers:

"DATA_HEADER; LOG_DATETIME; LOG_TIMESTAMP; PROC_TIME_US; DL_ACHIEVED_RATE_KBPS; UL_ACHIEVED_RATE_KBPS; DL_LOAD_PERCENT; UL_LOAD_PERCENT; RTT_TIMESTAMP; REFLECTOR; SEQUENCE; DL_OWD_BASELINE; DL_OWD_US; DL_OWD_DELTA_EWMA_US; DL_OWD_DELTA_US; DL_ADJ_DELAY_THR; UL_OWD_BASELINE; UL_OWD_US; UL_OWD_DELTA_EWMA_US; UL_OWD_DELTA_US; UL_ADJ_DELAY_THR; SUM_DL_DELAYS; SUM_UL_DELAYS; DL_LOAD_CONDITION; UL_LOAD_CONDITION; CAKE_DL_RATE_KBPS; CAKE_UL_RATE_KBPS"

"LOAD_HEADER; LOG_DATETIME; LOG_TIMESTAMP; PROC_TIME_US; DL_ACHIEVED_RATE_KBPS; UL_ACHIEVED_RATE_KBPS; CAKE_DL_RATE_KBPS; CAKE_UL_RATE_KBPS"

"REFLECTOR_HEADER; LOG_DATETIME; LOG_TIMESTAMP; PROC_TIME_US; REFLECTOR; MIN_SUM_OWD_BASELINES_US; SUM_OWD_BASELINES_US; SUM_OWD_BASELINES_DELTA_US; SUM_OWD_BASELINES_DELTA_THR_US; MIN_DL_DELTA_EWMA_US; DL_DELTA_EWMA_US; DL_DELTA_EWMA_DELTA_US; DL_DELTA_EWMA_DELTA_THR; MIN_UL_DELTA_EWMA_US; UL_DELTA_EWMA_US; UL_DELTA_EWMA_DELTA_US; UL_DELTA_EWMA_DELTA_THR"

I am thinking we should add these into the DATA_HEADER to facilitate plotting the tracked connection capacity. But an alternative could be to create a new header and output values against those.

Secondly, ought we to feed the estimated connection capacity into the main control somehow?

For example, we could implement a configurable timeout defaulting to 60 seconds in which the shaper bandwidths are not allowed to increase beyond the estimated connection capacity.

Or we could have two alpha rates in dependence upon how close we are to the estimated connection capacity.

lynxthecat commented 1 year ago

@patrakov maybe you have some thoughts on the above?

patrakov commented 1 year ago

Just a reminder: earlier, I proposed a change (now officially rejected) that, for the download direction, blindly trusts the achieved rate pre-shaper as something that the link definitely can support, even during bufferbloat periods, for the purpose of never setting the shaper below let's say 90% of that. @lynxthecat can you compare?

https://forum.openwrt.org/t/cake-w-adaptive-bandwidth/135379/2640?u=patrakov

https://u.pcloud.link/publink/show?code=kZHbpCVZqWqKFQEzHrkPOiiHSxwUmjtaTWX7

moeller0 commented 1 year ago

Firstly, if you think tracking these estimates is worthwhile, in what way should we output these values?

Our logfile duration is likely too short to be useful here, so I would push that job off to collectd, and propose to store 4 numbers shaper rate and load for both direction averaged over a reasonable amount of time. With collectd's default sampling being 30 seconds IIRC maybe return the average rates for 30 seconds intervals. That means we would need to calculate these values every 30 seconds.... doing this for the load is easy, just get a load sample every 30 seconds and calculate and store the differences between consecutive samples (devil in the details: what to do if a period ends with a stall?). For the set rate that becomes trickier, but I guess we could simply aggregate the number of bytes maximally transferable (according to the shaper setting) in a given interval, by aggregating over all "segments" in our statistics interval.... Or we could simply report max and min over the same period as that will require only two additional variables per direction. The real challenge however is that our sampling of the real achievable rate is rather spotty: we really only know that the achieved rate was at "capacity" when our controller is about to reduce the shaper rate due to overload... (but that only really works for download, for upload the achieved rate might over estimate the true bottleneck rate and simply fill the bottlebeck's egress queue). So depending on a link's useage we might have 0 useful samples in a full day...

So no matter how we slice and dice it, using the "organic" offered load as part of our controller is IMHO a decent approach for that controller, but it is rather underwhelming as estimator of instantaneous link capacity (which I think is what @richb-hanover is after here). Sure on a reasonable busy link we might be able to get something that might be good enough, but I really am unsure about this.

lynxthecat commented 1 year ago

So @moeller0 if I read correctly it seems to me you think outputting the achieved rate on bufferbloat events, or endeavouring to leverage the same in respect of shaper rate control, is not an avenue worth pursuing?

@patrakov I need time to carefully think and work through your points. I will get back on this.

patrakov commented 1 year ago

@lynxthecat sure, take your time and perform the necessary experiments. In any case, I would be interested in seeing whether it helps on your link.

lynxthecat commented 1 year ago

Would you be able to summarize the rate control logic you tested? I had a very quick look, but it seems a summary would be helpful.

blindly trusts the achieved rate pre-shaper as something that the link definitely can support, even during bufferbloat periods, for the purpose of never setting the shaper below let's say 90% of that

I can't wrap my head around what that means in terms of how the shaper rate is set. Would you be able to copy/paste the code from the update_shaper_rate (formerly get_next_shaper_rate) function?

rany2 commented 1 year ago

Our logfile duration is likely too short to be useful here, so I would push that job off to collectd, and propose to store 4 numbers shaper rate and load for both direction averaged over a reasonable amount of time. With collectd's default sampling being 30 seconds IIRC maybe return the average rates for 30 seconds intervals.

I'm in favor of this, I really don't like how the logfile is already extremely cluttered and this is not helping whatsoever. collectd is designed for this and on the plus side, many plugins exist that could turn this into a nice plot graph and even Luci integration.

lynxthecat commented 1 year ago

@rany2 have you considered @bairhys's: https://github.com/bairhys/prometheus-cake-autorate-exporter

rany2 commented 1 year ago

@lynxthecat Have never heard of it but surely it would be nicer to have some kind of graphs available directly from the system, instead of relying on some central collection server?

moeller0 commented 1 year ago

So @moeller0 if I read correctly it seems to me you think outputting the achieved rate on bufferbloat events, or endeavouring to leverage the same in respect of shaper rate control, is not an avenue worth pursuing?

I really depends on what kind of coverage we want/need. One can argue that without load a link's capacity is irrelevant anyway. My concern is more that what I would like to see is a nice graph showing the capacity over time (for all time bins), but the only way we will get something like this is by making sure the link is sufficiently busy often enough... say by periodically running a speed/capacity test ;) (at which point we simply can take that speedtest's results as the measure of achievable rate...).

Maybe I am overly concerned and the proposed plot will be useful enough to merit generating it. As I said, I would push the duty of turning this into a graph up to luci-app-statistics...

lynxthecat commented 1 year ago

@moeller0 understood, but I am still left wondering where in our data output format we should place estimated capacity (new dedicated HEADER or tag into DATA and if so where in DATA?), and also whether it might usefully be fed into the control of the shaper rates.

@rany2 how easy would it be to feed the cake data into collectd or otherwise to make integration with LuCi possible? Obviously it'd be amazingly cool to go to something like http://192.168.1.1/cgi-bin/luci/admin/status/realtime and see cake-autorate plots.

moeller0 commented 1 year ago

Just a reminder: earlier, I proposed a change (now officially rejected) that, for the download direction, blindly trusts the achieved rate pre-shaper as something that the link definitely can support, even during bufferbloat periods, for the purpose of never setting the shaper below let's say 90% of that. @lynxthecat can you compare?

What the current code does is to take the minimum of "current shaper rate factor1" and "last achieved rate factor2". Your proposal is to jettison the first term. I think this is wrong, because the achieved rate necessarily measured the past and if say we have a step like reduction from 100 to 10 "speed units" the last achieved rate will (to simplify) 100 which clearly after the step is incorrect... My mental model is we base our decision mainly on the current shaper rate, but that the achieved rate only into account if that would cause a steeper rate reduction... exactly as the achieved rate is be necessity looking into the past.

lynxthecat commented 1 year ago

Wait, @moeller0 and @patrakov let's make another issue for that.

moeller0 commented 1 year ago

@moeller0 ok, but I am still left wondering where in our data output format we should place estimated capacity (new dedicated HEADER or tag into DATA and if so where in DATA?), and also whether it might usefully be fed into the control of the shaper rates.

I would simply write these four values to file, as long as we only do this every 30-60 seconds it will be in the noise... or alternatively directly store it into the RRD database... but I guess we need a statistics script anyway to generate the plots from the rrd database....

lynxthecat commented 1 year ago

@moeller0 so we have clear alternatives:

output dedicated lines with new headers like BW_ESTIM_DL and BW_ESTIM_UL and write out values - if so how?
amalgmate into the existing large DATA lines - if so where?

And also the question about whether we might feed this information into the shaper rate control, e.g. 60 second timeout before allowing shaper rate to increase beyond this value.

moeller0 commented 1 year ago

I do not think such records make much sense inside the current log files, the information is already deducible from the DATA records, but the main issue these logs do not cover the amount of time that @richb-hanover is after (which I assume to be at least a full day with reasonable resolution, maybe even a week and a month...)

And also the question about whether we might feed this information into the shaper rate control, e.g. 60 second timeout before allowing shaper rate to increase beyond this value.

We already (I think) scale down faster then up, so fixing the upper ceiling for longer periods will help on some links, but will waste throughput on others... it will also make us prone to persist at minimum rate for longer... Again, I can imagine link's where this is helpful, but also links that are better served with a nimbler control loop... So e.g. things like a DOCSIS segment that gets a bit "tight" around primetime will probably work well with the proposed ceiling, but on a LTE/5G link that might not help all that much...

lynxthecat commented 1 year ago

I do not think such records make much sense inside the current log files, the information is already deducible from the DATA records, but the main issue these logs do not cover the amount of time that @richb-hanover is after (which I assume to be at least a full day with reasonable resolution, maybe even a week and a month...)

I see what you are mean now. So for example a fancy plotting stats routine could look at the existing data to identify achieved rate at the time bufferbloat is detected. But it's a bit complicated because our logic:

# bufferbloat detected, so decrease the rate providing not inside bufferbloat refractory period
*bb*)
    if (( t_start_us > (t_last_bufferbloat_us["${direction}"]+bufferbloat_refractory_period_us) ))
    then
adjusted_achieved_rate_kbps=$(( (achieved_rate_kbps["${direction}"]*achieved_rate_adjust_down_bufferbloat)/1000 )) 
adjusted_shaper_rate_kbps=$(( (shaper_rate_kbps["${direction}"]*shaper_rate_adjust_down_bufferbloat)/1000 )) 
shaper_rate_kbps["${direction}"]=
if (( adjusted_achieved_rate_kbps > min_shaper_rate_kbps["${direction}"] && adjusted_achieved_rate_kbps < adjusted_shaper_rate_kbps ))
then
    shaper_rate_kbps["${direction}"]="${adjusted_achieved_rate_kbps}"
    ### estimated connection capacity in direction: '${direction}' is '${achieved_rate_kbps}' ###
else
    shaper_rate_kbps["${direction}"]="${adjusted_shaper_rate_kbps}"
fi
t_last_bufferbloat_us["${direction}"]="${EPOCHREALTIME/./}"
    fi
    ;;

does more than that. I mean those estimations only occur in certain circumstances. It would be surely a bit of a pfaff to try to recreate that from the existing data records?

That's why I'm wondering if we should not include such estimations in the DATA records or in dedicated records.

We already (I think) scale down faster then up, so fixing the upper ceiling for longer periods will help on some links, but will waste throughput on others... it will also make us prone to persist at minimum rate for longer... Again, I can imagine link's where this is helpful, but also links that are better served with a nimbler control loop... So e.g. things like a DOCSIS segment that gets a bit "tight" around primetime will probably work well with the proposed ceiling, but on a LTE/5G link that might not help all that much...

OK, understood. I'll mentally park this for now.

lynxthecat / cake-autorate

Track connection capacity #115

Firstly, if you think tracking these estimates is worthwhile, in what way should we output these values?

Secondly, ought we to feed the estimated connection capacity into the main control somehow?