lynxthecat / cake-autorate

Eliminate the excess latency and jitter terrorizing your 4G, 5G, LTE, Starlink or other variable rate connection!
https://www.bufferbloat.net
GNU General Public License v2.0
284 stars 24 forks source link

High CPU usage for CAKE-autorate.sh #17

Closed richb-hanover closed 2 years ago

richb-hanover commented 2 years ago

I'm starting a new issue from #16 that began life as a trouble report, but morphed into observations about performance.

At @lynxthecat suggestion, I adjusted defaults.sh to slow the ping rate using:

ping_reflector_interval=0.2 # (milliseconds)
main_loop_tick_duration=1000 # (milliseconds)

It did improve the utilization: Here's a video of htop while idle, then under load, and then recovering from the load. https://youtu.be/pUZ7MpN85K4

PS I can't really tell whether this code addresses latency. I have a fixed speed fiber link through my ISP at 25/25 mbps, so my link speed doesn't vary much. So I just cannot say whether it makes a difference.

moeller0 commented 2 years ago

Mmmh, if you meed flock your wroters will spparently in a race to write and flock usage then can result in out of order writing, so read data needs to be sorted by timestamp by the reader....

lynxthecat commented 2 years ago

Actually from testing I am not seeing corruption in respect of the data transfer through the fifo.

@richb-hanover I wonder if you could post the data you have in your Excel for me to look at?

richb-hanover commented 2 years ago

I should have said: the raw data came from the lines I posted in the previous message.

richb-hanover commented 2 years ago

I'm not sure what you mean by "flock the writes". That said, I wonder whether it's possible to write the start time as data in the ping packet, and then retrieve it from the response packet...

PS The updated ...Skipping... line is:

((($t_start - "${timestamp//[[\[\].]}")>500000)) && echo "WARNING: encountered response from [" $reflector "] $t_start "${timestamp//[[\[\].]}" is > 500ms old. Skipping." && continue
lynxthecat commented 2 years ago

@richb-hanover is this just after a period of inactivity? Because the pingers are put to sleep after 60s of inactivity and when they are woken up the skipping flushes out some old data on resume, if that makes sense. Or do you see it in general?

If in general it could well just be that it is normal/expected behaviour because the processing rate is not keeping up with the incoming data and so it has to skip past old data values.

Explanation: ping results enter at 1/(ping interval * number of reflectors) so with 0.1s and 4 reflectors that is 40Hz. If the processing rate cannot keep up with this rate, it will start to lag behind in processing the results as they come in (in the main while read loop). So the timestamps of the results it sees start to look too old, i.e. > 500ms before the present time, and when the main while loop encounters these out of data lines it just skips past them like a clutch to catch up with the new data that is coming in.

So when you reduce the number of reflectors and/or increase the ping interval, the amount of processing required is significantly lower and presumably then reduced to a point where the Archer C7 can keep up with the data rate, and this results in fewer or zero skips.

Does that make sense?

richb-hanover commented 2 years ago

@richb-hanover is this just after a period of inactivity? Because the pingers are put to sleep after 60s of inactivity and when they are woken up the skipping flushes out some old data on resume, if that makes sense. Or do you see it in general?

That's a good description of the algorithm. I don't think it quite matches my observations. Here's my test process:

Here is the latest test output:

WARNING: encountered response from [ 8.8.4.4 ] 1647697141585781 1647697141081998 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.1.1.1 ] 1647697141758522 1647697141228268 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.0.0.1 ] 1647697141853458 1647697141333720 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.4.4 ] 1647697142013683 1647697141483457 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.8.8 ] 1647697142123925 1647697141598919 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.4.4 ] 1647697142234548 1647697141685495 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.0.0.1 ] 1647697142246598 1647697141736935 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.1.1.1 ] 1647697142345599 1647697141830188 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.0.0.1 ] 1647697142468277 1647697141936564 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.0.0.1 ] 1647697142644847 1647697142137290 is > 500ms old. Skipping.
lynxthecat commented 2 years ago

I believe this is because when you run betterspeedtest.sh your CPU utilisation shoots up and there are not sufficient spare cycles to keep up with processing the incoming ping results (entering at 40 lines per second: 4 reflectors * 10 ping results per second), necessitating skipping. So I think you would see the same if you were to engage any other arbitrary process that causes the CPU usage to spike.

So I think it should be more or less entirely solved by just dropping the rate to something more sane for the C7 like 500ms and 2 reflectors? That way the main processing loop should have no trouble handling the rate of data coming in (say 4 lines per second), i.e. even with the CPU spike caused by betterspeedtest.sh (and the resulting data flows), there should be sufficient CPU cycles to handle the data that comes in to the main processing loop.

Does this make sense?

moeller0 commented 2 years ago
  • (immediately - within 10 seconds) switch to a third window and run betterspeedtest.sh - the ...Skipping... messages come out almost immediately (Hmmm... actually, about 1 or 2 seconds) after the download phase starts

I agree with @lynxthecat running netperf on the router, especially an old one like the C7 is asking for CPU overload. It can be seen as a stress test, but IMHO it is harsh to expect hardware that old to deal with both too high an autorate sample rate (40Hz) and the additional load of sourcing/sinking speedtest packets...

richb-hanover commented 2 years ago

OK. That sounds right. To confirm, I will run the speed test on my laptop via wifi (so the data passes through the router, instead of running the test on the router) and report back.

PS You asked this a while back but I never responded: I also see high CPU use using the Lua implementation from perhaps a week ago. Here's the report from that repo... https://github.com/sqm-autorate/sqm-autorate/issues/147

moeller0 commented 2 years ago

I will run the speed test on my laptop via wifi (so the data passes through the router, instead of running the test on the router) and report back.

If you could hook up the laptop via ethernet first that would help, as wifi processing puts additional load on the router. Yes, finally this needs to work over wifi, but for debugging/understanding it might make sense to start with wired, no?

moeller0 commented 2 years ago

Mmmh, if you meed flock your wroters will spparently in a race to write and flock usage then can result in out of order writing, so read data needs to be sorted by timestamp by the reader....

lynxthecat commented 2 years ago

I think from my testing I may not need flock. Seems to be working just fine without. So I guess the writes are all atomic.

lynxthecat commented 2 years ago

After having implemented a few performance enhancements, here is what I see on my RT3200 now at 40Hz:

image

But that may not be a good measure. Using 'time' it looks like usage is around 25% - 30%.

Marctraider commented 2 years ago

@lynxthecat Testing new script now, with very low intervals. I see the config script got a nice rehaul, and you seem to have automated the amount of deflectors depending on how much you fill into the reflector ip field. (I think?)

Is it also possible to have a icmp payload option to reduce that (For whoever is willing?) to conserve some extra bandwidth?

Just a simple 0 or 1 variable that will set the absolute lowest possible icmp payload size maybe?

root@redundant:~/CAKE-autorate# ping -s 16 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 16(44) bytes of data.
24 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=4.91 ms
24 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=4.88 ms

The lowest payload i can go.

I can modify script myself including variables but just an idea.

With a ping interval of 100 per second we would normally be at 384000 Bytes per minute, or 23~MB per hour, for my VPS that is basically doubled as its connection is mirroring on two separate lines. Ofcourse practically it won't be as much because the script will go to sleep lot of the time.

As for testing (top cpu usage): 1137004 root 20 0 7020 3768 3200 S 5.3 0.1 0:05.30 CAKE-autorate.s 1137034 root 20 0 7280 3348 2484 S 4.7 0.1 0:04.67 CAKE-autorate.s

Looks like when i change reflector_ping_interval=0.02 # (seconds) to reflector_ping_interval=0.01 # (seconds) process ping jumps from 1%~ to like 50% cpu usage. That looks like a lot of extra cpu usage for that relatively small jump?

Marctraider commented 2 years ago

Sufficed to say, with reflector_ping_interval=0.015 # (seconds), and icmp payload on 16KB+8KB (icmp header) the script works extremely well.

image

CPU Load is more than acceptable, this was with 50 TCP streams simultaneously, hitting close to cap before bufferbloat.

Also removing fast line when doing this barely makes latency budge, maybe once in a while 20/25ms spike once.

idle:

image

lynxthecat commented 2 years ago

@Marctraider so far so good, I am pretty happy with all of this. And especially that your removing the fast line during a transfer, resulting in the usable capacity dropping whilst packets are in transit, results in next to no latency budge. That seems like a super stress test. I think this means the script is behaving correctly.

Things seem to be going in the right direction with this script.

There are still a couple of outstanding CPU gains for me to work on. For example I discovered recently that the $(printf) call to convert RTTs from float to integer in microseconds is costly. I am going to replace that with bash pattern matching, and just use one match with multiple capture groups rather than two separate matches for sequence and RTT as I do presently. This should reduce the CPU usage per reflector a little. But since it's per reflector, that will add up.

Curious about CPU load jumping up so much from: reflector_ping_interval=0.02 to 0.01. These are seriously low numbers. Bear in mind that the main loop has to process every ping result line. That is 100Hz per reflector at 0.01. I wonder what is killing the CPU so much in jumping from 0.02 to 0.01. I mean, sure, the frequency doubles from 50Hz per reflector to 100Hz per reflector. But why the CPU jump from 1% to 50%? Perhaps the 'tc qdisc change' calls? Could you try commenting those out with 0.01 to see if that is the culprit. Or perhaps it is just the common fifo writing / reading that grinds to a halt at a certain rate? Or perhaps it could be 'update loads' since that requires reading in the rx_bytes and tx_bytes files to work out the load - perhaps at a certain reading rate the system struggles.

I wonder to what extent lowering from 0.02 to 0.01 will improve performance for you.

Are you just using one reflector (your VPS)? So 0.02 and 0.01 then means 50Hz and 100Hz in terms of the rate the main processing loop needs to process incoming ping results.

Could you please try comparing performance with the 'fping' flavour here:

https://github.com/lynxthecat/CAKE-autorate/tree/fping

I have not worked on that code as much - it was more just to test fping. I'd be curious if this allows you to reach 0.01 without the big CPU jump. Generally I like the idea of maintaining separate ping streams because in the general use case it facilitates better control over individual reflector paths. So for example if one reflector goes bad I could stop that ping stream and keep the others going. Whereas with fping (which allows pinging in round robin and just spits out output lines from all the reflectors) it is then necessary to kill fping in its entirety and re-call with modified reflectors.

@moeller0 any thoughts on the above from networking perspective since this is straying beyond my area of expertise?

Regarding:

Is it also possible to have a icmp payload option to reduce that (For whoever is willing?) to conserve some extra bandwidth?

Yes no problem I will just add this as a user-configuration option.

@moeller0 what would you say is the optimum for the general case and for @Marctraider?

richb-hanover commented 2 years ago

These are interesting observations, and a couple follow-on thoughts came to mind.

Thanks for listening!

moeller0 commented 2 years ago

$(printf)

Are you sure it is printf or the sub-shell invocation? If the latter you culd try printf -v my_printfed_var to assign the value to a variable without the subshell?

That is 100Hz per reflector at 0.01.

Many reflectors are rate limited, querying individually reflectors is asking for trouble.... Don't do that...

Perhaps the 'tc qdisc change' calls?

Aren't these guarded by the refractory period after a change?

Is it also possible to have a icmp payload option to reduce that (For whoever is willing?) to conserve some extra bandwidth?

For this keep in mind ethernet's minimal packet size, so beow a certain value reducing the packet size will not reduce the bandwidth... also the ICMP size needs to be large enough to fit the timespec of the sender (IIRC that is larger on 64bit systems than on 32bit systems)...

@moeller0 what would you say is the optimum for the general case and for @Marctraider?

I would assume that the default size is what most code out there is likely to be tested with and tuned for, so I would tend to stick to that as the default... I would, however, not ping at 100Hz which also will reduce the load from the ping traffic....

  • re: printf() I wondered about, and am not surprised to hear the printf() calls are expensive. Might it be helpful to wrap those calls in an if statement so they're not evaluated at all if debugging/printing isn't turned on?

+1, that sounds sane independent of the printf cost....

  • Am I correct that you're talking about 2 msec vs 1 msec polling and its effect on CPU usage? A frequent observation in queuing theory (like networking) is that there's a knee in the performance curve where delay increases abruptly as you increase traffic. In this case, I suspect the fixed overhead of some operation begins to dominate, and the processor spends "all its time" on those fixed, unavoidable overheads.

It makes oddles of sense to test how well/tight the experience with this script depends on the RTT sampling rate... My assumption is that probably 10-20Hz should be sufficient and 100Hz is not going to be sufficiently better to justify the noticeable increase in cost.... but I have not tested that so ...

  • (I like the fact that the algorithm only starts pinging when there's interesting traffic, so that will cut it down.)

The cost of that however is that the baseline estimates get stale.... on a volume limited link still better than constant pinging, but on a truly unlimited link not so much (I would peferr to simply thing the pings out to 1/sec, then 1/minute). Estimating the load is not that hard multiply the ping size by its rate and report that to the user.

  • re: User options. I dislike them. The seem easy to implement, but... They're extra code that needs to be tested in combination with every other option. They still can break. They need to be documented. People need to understand that documentation, use the option correctly. But most importantly, a lot of times, the algorithm can "do the right thing" without asking the human, so it's worth experimenting with a couple approaches and simply picking the best one. (In the case of the icmp payload, is there ever a case in production where less bandwidth would be a bad choice?)

Yes, if either the reflector puts the packet into a slow path or drops unusual packet sizes completely or when the size is to small to fit the timespec as in that case the ping will not return an RTT...

lynxthecat commented 2 years ago

Are you sure it is printf or the sub-shell invocation? If the latter you culd try printf -v my_printfed_var to assign the value to a variable without the subshell?

I could also try that, but you see I thought I could and should reduce CPU usage anyway by combining the separate matches:

[[ $seq_rtt =~ time=+([0-9.]*)[[:space:]]+ms+ ]]; rtt=${BASH_REMATCH[1]}

# If output line of ping does not contain any RTT then skip onto the next one
[ -z "$rtt" ] && continue

[[ $seq_rtt =~ icmp_seq=([0-9]*) ]]; seq=${BASH_REMATCH[1]}

by using multiple capture groups - three in total. One for the seq and two for the RTT - e.g. RTT is everything before . and if there is a match on anything after . then add that to the RTT. Because ping outputs rtt like 12.4, 99.9 but then 100 or 124.

I need the matching anyway to extract seq and rtt in a reliable way since the output of ping shifts output depending on whether an IP is given or an address like google.com. So then I can just use the output of the matching plus bash arithmetic to build up the RTT in microseconds. Also matching gives better reliability if bad output lines.

That is 100Hz per reflector at 0.01.

Many reflectors are rate limited, querying individually reflectors is asking for trouble.... Don't do that...

I wouldn't dream of this for public reflector but @Marctraider is using this on his own reflector.

Perhaps the 'tc qdisc change' calls?

Aren't these guarded by the refractory period after a change?

Perhaps I need to have another think about how I handle refractory periods handled within the case statement on load_condition:

https://github.com/lynxthecat/CAKE-autorate/blob/main/CAKE-autorate.sh#L48

At the moment I have:

# in case of supra-threshold OWD spikes decrease the rate providing not inside bufferbloat refractory period
bufferbloat)
    (( $t_next_rate > ($t_last_bufferbloat+$bufferbloat_refractory_period) )) && 
    cur_rate=$(( ($rate*$rate_adjust_bufferbloat)/1000 )) && 
    t_last_bufferbloat=${EPOCHREALTIME/./}
;;
# ... otherwise determine whether to increase or decrease the rate in dependence on load
# high load, so increase rate providing not inside bufferbloat refractory period 
high_load)  
    (( $t_next_rate > ($t_last_bufferbloat+$bufferbloat_refractory_period) )) && 
    cur_rate=$(( ($cur_rate*$rate_adjust_load_high)/1000 ))
;;
# low load, so determine whether to decay down towards base rate, decay up towards base rate, or set as base rate
low_load)
    if (($t_next_rate > ($t_last_decay+$decay_refractory_period) )); then

So an increase in load without bufferbloat is not subject to any refractory period. Should it be? My logic of not having this is that as you said earlier @moeller0, any increase is already limited by the rate being > the high rate threshold (e.g. 75%). But for sure at the moment this means on sustained load increase 'tc qdisc change' could be called every tick.

Also at the moment a bufferbloat incident will block subsequent bufferbloat reductions, but not decay reductions. Perhaps it should? Certainly a decay refractory period should not block a bufferbloat reduction because that should be immediate (unless a bufferbloat reduction has already recently occurred and we are still within the bufferbloat refractory period).

Not sure how best to manage these refractory periods.

It makes oddles of sense to test how well/tight the experience with this script depends on the RTT sampling rate... My assumption is that probably 10-20Hz should be sufficient and 100Hz is not going to be sufficiently better to justify the noticeable increase in cost.... but I have not tested that so ...

@Marctraider what is the optimum ping interval you have identified in your testing using your own VPS?

lynxthecat commented 2 years ago

@moeller0 this gives 1% per reflector rather than 2% per reflector in top on my RT3200:

# ping reflector, maintain baseline and output deltas to a common fifo
monitor_reflector_path() 
{
    local reflector=$1
    local rtt_baseline=$2

    while read -r  timestamp _ _ _ reflector seq_rtt
    do
        # If no match then skip onto the next one
        [[ $seq_rtt =~ icmp_seq=([0-9]+).*time=([0-9]+)\.?([0-9]+)?[[:space:]]ms ]] || continue

        seq=${BASH_REMATCH[1]}

        rtt=${BASH_REMATCH[3]}000

        rtt=$((${BASH_REMATCH[2]}000+${rtt:0:3}))

        reflector=${reflector//:/}

        rtt_delta=$(( $rtt-$rtt_baseline ))

        alpha=$alpha_baseline_decrease
        (( $rtt_delta >=0 )) && alpha=$alpha_baseline_increase

        rtt_baseline=$(( ( (1000-$alpha)*$rtt_baseline+$alpha*$rtt )/1000 ))

        printf '%s %s %s %s %s %s\n' "$timestamp" "$reflector" "$seq" "$rtt_baseline" "$rtt" "$rtt_delta" > /tmp/CAKE-autorate/ping_fifo

    done< <(ping -D -i $reflector_ping_interval $reflector & echo $! >/tmp/CAKE-autorate/${reflector}_ping_pid)
}

What do you think? I think this is perhaps as good as it gets for the per-reflector process.

Marctraider commented 2 years ago

@lynxthecat So far 0.015s is optimal i think given the restrains that the 'bug' introduces. I dont think i wanna go any lower wasting more energy than necessary. 10% total of script and children processes should be the max acceptable under load.

I already tried disable debugging again to see if 0.01 would alleviate the sudden massive ping cpu usage, but im pretty sure that didnt fix it.

Ofcourse my own reflector can be icmp spammed to the limits of its 1/1Gbit full duplex 99,99% uptime vps and 30TB or bandwidth a month :P

So tweaking wise, i would with the current usage of the total package not even want to go lower than 0.01, but still curious why this happens.

In the fortunate case thst the script might become even more efficient, i wouldnt mind going 0.01 or even 0.005.

I can try the fping method.

I only changed the bufferbloat infractions from 2 to 1, the buffer of 4 stays the same, rtt threshold on 15, and ofcourse only 1 reflector in the field. (is it right to assume the script auto changes reflector count based on how many entries?)

Achieved bandwidth is excellent and as expected.

Another good question might be, since I control my own endpoint, would it benefit me to apply qos to server side instead? Or both? (As in egress server side is ingress on client side, and vice versa, client side egress is server side ingress, so both sides basically only control the egress.)

What about ECN (BBR) on both sides?

moeller0 commented 2 years ago

What do you think?

This is ugly as hell.... I guess if this is really only half as costly as doing printf -v rtt ... might be worth it, but I feel dirty reading this ;)

Also I really think the rtt should be split into ul_delay and dl_delay (as well as baseline and delta) and the record should be ended wit a repetition of timestamp as an indicator for potential partial overwrites....

@lynxthecat So far 0.015s is optimal i think given the restrains that the 'bug' introduces. I dont think i wanna go any lower wasting more energy than necessary. 10% total of script and children processes should be the max acceptable under load.

The question is is 15ms noticeably better than 30ms?

In the fortunate case thst the script might become even more efficient, i wouldnt mind going 0.01 or even 0.005.

But why? There are going to be diminishing returns on pushing this ever lower, the question should be how large can this be made without control loop tightness suffering. So the optimization goal here should be as large as possible ;)

Another good question might be, since I control my own endpoint, would it benefit me to apply qos to server side instead?

Yes egress shaping tends to be simpler so shaping on the VPS for your download direction seems sane, but youmay need to potentially reduce the ICMP probe rate a bit...

What about ECN (BBR) on both sides?

That is orthogonal to getting the shaper set correctly, ECN can help but not in fighting bufferbloat (only i that congestion signaling reaches the endpoints slightly faster via ECN instead of dropping, as after a dropped packet two more packets need to arrive at the receiver before that trigger a 2 dupACK response, while a single CE mark will achieve similar results)

Marctraider commented 2 years ago

Hey you pinged the wrong person but thats fine, I'm sure at some point there will be diminishing returns, but I haven't gone through testing this yet. Would be easier if I could create my own graphs based on tests but I don't really have an infrastructure going for that at the moment. Also the lines can vary in quality and load depending on the time of day.

Frankly the question for me would be, whats the optimal lower interval rate versus bandwidth (With lower ICMP payload) and the fact that my line is reasonably wide when it comes to bandwidth, and bandwidth limitations are of no concern, the only other aspect is cpu load itself.

As for @lynxthecat, I'll asusume he'll want to figure out whats causing an increase of CPU from single digits to 50+% on the ping process whenever interval drops from 0.02 (or even 0.15) to 0.01. Because it doesn't make much sense.

moeller0 commented 2 years ago

As for @lynxthecat, I'll asusume he'll want to figure out whats causing an increase of CPU from single digits to 50+% on the ping process whenever interval drops from 0.02 (or even 0.15) to 0.01. Because it doesn't make much sense.

That could be anything... wild guess: the ping binary might switch between sleep and busy polling depending on the requested interval (as short sleeps are notoriously imprecise)...

lynxthecat commented 2 years ago

It looks like you may be right @moeller0:

root@OpenWrt:~/CAKE-autorate# time /usr/bin/ping -c 1000 -i 0.02 192.168.1.2 > /dev/null
real    0m 20.62s
user    0m 0.06s
sys     0m 0.21s
root@OpenWrt:~/CAKE-autorate# time /usr/bin/ping -c 1000 -i 0.01 192.168.1.2 > /dev/null
real    0m 10.14s
user    0m 1.95s
sys     0m 4.75s

Here is the same with fping:

root@OpenWrt:~# time fping -c 1000 -p 20 192.168.1.2 > /dev/null

192.168.1.2 : xmt/rcv/%loss = 1000/1000/0%, min/avg/max = 0.960/2.08/8.84
real    0m 19.98s
user    0m 0.31s
sys     0m 0.00s
root@OpenWrt:~# time fping -c 1000 -p 10 192.168.1.2 > /dev/null

192.168.1.2 : xmt/rcv/%loss = 1000/1000/0%, min/avg/max = 1.000/2.05/6.99
real    0m 10.11s
user    0m 0.11s
sys     0m 0.18s
moeller0 commented 2 years ago

Ah, there is also the thing that -i < 0.2 requires super user privileges IIRC, so there might be extra checks involved?

Marctraider commented 2 years ago

It looks like you may be right @moeller0:

root@OpenWrt:~/CAKE-autorate# time /usr/bin/ping -c 1000 -i 0.02 192.168.1.2 > /dev/null
real    0m 20.62s
user    0m 0.06s
sys     0m 0.21s
root@OpenWrt:~/CAKE-autorate# time /usr/bin/ping -c 1000 -i 0.01 192.168.1.2 > /dev/null
real    0m 10.14s
user    0m 1.95s
sys     0m 4.75s

Here is the same with fping:

root@OpenWrt:~# time fping -c 1000 -p 20 192.168.1.2 > /dev/null

192.168.1.2 : xmt/rcv/%loss = 1000/1000/0%, min/avg/max = 0.960/2.08/8.84
real    0m 19.98s
user    0m 0.31s
sys     0m 0.00s
root@OpenWrt:~# time fping -c 1000 -p 10 192.168.1.2 > /dev/null

192.168.1.2 : xmt/rcv/%loss = 1000/1000/0%, min/avg/max = 1.000/2.05/6.99
real    0m 10.11s
user    0m 0.11s
sys     0m 0.18s

I will try the same on my system.

lynxthecat commented 2 years ago

Note to self - I need to implement this proposal:

What I would like to see is for all CPUs/HT-siblings 100-%idle... loadavg is not the right measure here... I would use the cpuN lines from cat /proc/stat

root@turris:~# cat /proc/stat 
cpu  5909064 26936 1697407 210303430 1809247 0 1440728 0 0 0
cpu0 3066461 13859 842812 105338697 943297 0 388277 0 0 0
cpu1 2842603 13077 854595 104964733 865950 0 1052451 0 0 0
intr 1361849891 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 718231763 0 0 111179 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20486552 15267694 147090191 0 0 0 0 0 2811330 0 14517374 0 2 2 0 0 0 0 0 0 0 40 0 0 4 0 3 33 0 0 0 0 0 0 0 0 0 0 0 1 0 113784447 140749112
ctxt 1038967495
btime 1647964976
processes 2324144
procs_running 1
procs_blocked 0
softirq 1549519881 0 187760591 164873342 579513036 5016228 0 297775348 214348509 0 100232827

for that purpose, and I would ignore the governor/power scheduler effects... (or I would try to also add information about the CPUs frequencies, but that gets approximate pretty quickly so I think ignoring that is just fine)...

https://forum.openwrt.org/t/bufferbloat-continuous-measurement-script/124217/12?u=lynx

Marctraider commented 2 years ago

Ah i might have found an issue:

reflector_ping_interval=0.005 has 10-15% cpu usage, while 0.01 / 0.010 has 50%.

So odd values like 0.015 and 0.005 seem to work as expected? Weird!

Update: Nope, its not that. After restarting the script several times changing this value, it appears just to happen at random (50% for ping process), but not always? WEIRD. Maybe a lower interval like 0.02 just makes it not happen but increasingly likely to happen as you increase the interval further?

Just to clear things up; Having other ping processes from other scripts running in the background should be of no concern right?

moeller0 commented 2 years ago

Just to clear things up; Having other ping processes from other scripts running in the background should be of no concern right?

Well, you only have that many CPUs and CPU cycles available, the more processes compete for the CPUs the more waiting will happen, if ping resorts to busy wait that might cause such issues I could make myself believe ;)

Marctraider commented 2 years ago

There's no issue with available CPU usage in my case though. This definitely leans toward some bug in the script still. Just curious why it happens randomly on restart of the service.

lynxthecat commented 2 years ago

It looks like you may be right @moeller0:

root@OpenWrt:~/CAKE-autorate# time /usr/bin/ping -c 1000 -i 0.02 192.168.1.2 > /dev/null
real    0m 20.62s
user    0m 0.06s
sys     0m 0.21s
root@OpenWrt:~/CAKE-autorate# time /usr/bin/ping -c 1000 -i 0.01 192.168.1.2 > /dev/null
real    0m 10.14s
user    0m 1.95s
sys     0m 4.75s

Here is the same with fping:

root@OpenWrt:~# time fping -c 1000 -p 20 192.168.1.2 > /dev/null

192.168.1.2 : xmt/rcv/%loss = 1000/1000/0%, min/avg/max = 0.960/2.08/8.84
real    0m 19.98s
user    0m 0.31s
sys     0m 0.00s
root@OpenWrt:~# time fping -c 1000 -p 10 192.168.1.2 > /dev/null

192.168.1.2 : xmt/rcv/%loss = 1000/1000/0%, min/avg/max = 1.000/2.05/6.99
real    0m 10.11s
user    0m 0.11s
sys     0m 0.18s

I will try the same on my system.

Did you try this? I thought we identified the cause as above?

See, check this out:

root@OpenWrt:~# time /usr/bin/ping -c 1000 -i 0.01 192.168.1.2 > /dev/null
real    0m 11.11s
user    0m 2.23s
sys     0m 5.15s
root@OpenWrt:~# time /usr/bin/ping -c 1000 -i 0.015 192.168.1.2 > /dev/null
real    0m 15.79s
user    0m 0.10s
sys     0m 0.21s
root@OpenWrt:~# time /usr/bin/ping -c 1000 -i 0.005 192.168.1.2 > /dev/null
real    0m 5.03s
user    0m 0.96s
sys     0m 2.80s

The bash implementation uses iputils-ping and thus if iputils-ping eats up more cycles for certain rates so too will my script * the number of reflectors. I think 'fping' does not suffer from this phenomenon in the same way.

Marctraider commented 2 years ago

Mhh, i will try the fping branch!

lynxthecat commented 2 years ago

If it helps then you can easily make the substitution in the main branch - just swap out the ping call for the corresponding fping call:

https://github.com/lynxthecat/CAKE-autorate/blob/main/CAKE-autorate.sh#L153

The fping approach in the fping branch was just an experiment and lacks control over individual reflectors and in my testing for normal rates wasn't significantly less computationally expensive. So I didn't make the transition to that approach.

I could still be convinced to swap out iputils-ping for fping in main though, if fping performs better. That is, one instance of fping per reflector rather than one instance of iputils-ping per reflector. Retaining control over individual reflector streams seems important.

Actually the main branch has major changes in it not present in the fping branch that I think you might benefit from. For example, I have now set it up so that the rx and tx loads are monitored in an asynchronous fashion from the processing of ping results. This should help you because I don't think in your case of extreme high pings we should be working with loads calculated between ping responses. Does this make sense?

I'd really like to know how you get on with the latest in the main branch given all the recent changes.

lynxthecat commented 2 years ago

Hey @Marctraider are you still using this? I wonder how you've been getting on if so.

I use it 24/7 and for long enough have just been tweaking very minor aspects of the code.

For my connection and setup the code works well enough for everything I need.

Marctraider commented 2 years ago

@lynxthecat Hey there! Sorry I've been AWOL due to IRL stuff!

The plan was to run the script 24/7 but upon encountering some weird behavior I have not yet managed to extrapolate, I've delayed the integration of the script into my network infrastructure.

The problem was as follows: Upon activating the script (And seeminly seems to function well) I got erratic ping spikes from my other scripts (The basic interface logging ping instances) which started jumping around for no good reason, while the connection is mostly IDLE.

I'm not sure why. Let me test again and get back to you (In case i might be able to solve it)

One thing that can be excluded is insufficient CPU usage and bandwidth at the very least.

Also unclear for now whether the connection really starts being affected negatively, or whether it is isolated to the script or ping sessions only. I suppose simply pinging from outside of the router (from a random client) will easily shed some light to that question.

lynxthecat commented 2 years ago

Many thanks. Your feedback is appreciated because it helps to robustify this script (which seems to have received a fair amount of attention).

Marctraider commented 2 years ago

Currently one of my links appear deteriorated, I want to test in a different setting so I'll check back tomorrow. Ugh!

Marctraider commented 2 years ago

@lynxthecat

The script works fine now, also on idle i see no weird behavior i was seeing a while back. CPU usage is very nice as well.

Just some remaining questions/features:

Otherwise, no issues found!

lynxthecat commented 2 years ago

Thanks a lot for testing @Marctraider. I really appreciate it. The code seems pretty settled now.

If you just set one reflector in the array it will simply hit this line:

https://github.com/lynxthecat/CAKE-autorate/blob/main/CAKE-autorate.sh#L282

And keep using that reflector. I don't think you would need to adjust anything else.

I will add in the idea about the ICMP payload size. Can you let me know what you use and why you changed it?

@moeller0 I know I have asked this before but I forgot where we left it regarding ICMP payload size? I will set the default to whatever ping uses by default. Is there any sense in which we should calculate this automatically?

Marctraider commented 2 years ago

Pretty settled indeed!

The reason I can think of is bandwidth preservation and simply efficiency. (56 bytes seems default, can easily drop it down to 20 i believe which is more than two times less bandwidth wasted)

I assume it would also be very easy to implement, i suppose it only requires a variable at the ping parameters and a configuration variable in config.sh?

Doesn't seem to increase code complexity and could just be an optional override parameter.

lynxthecat commented 2 years ago

Seems to make sense to me (in terms of simplicity of implementation).

Does this translate to a reduction in bandwidth use though? I mean aren't the bytes encapsulated into larger transmission unit (padded if necessary) and then sent? Perhaps there is a post compression or encapsulation for WireGuard effect? I'm hazy on this.

https://en.wikipedia.org/wiki/Ethernet_frame#Payload

https://networkengineering.stackexchange.com/questions/34189/minimum-ethernet-frame-is-64-bytes-why-the-payload-must-be-padded-to-at-least-4

Marctraider commented 2 years ago

Hrm, frankly I had not thought of that.

I will do some wireshark/tcpdump tests later on this.

Marctraider commented 2 years ago

Well after having some big networking issues, namely my dedicated server's network card seem to start crapping out on me (they finally fixed it yesterday but I've been trying to debug my poor connection for days now)

I've been reworking my whole infrastructure and once that's done I will do some retesting with the script as it was not behaving properly. Ugh!

Marctraider commented 2 years ago

I have another issue which requires resolving, before I can make the script behave well.

Currently for unexplained reasons, my tunnel (wg0) gets 'choked' by approx. 30 Mbit regardless of what bandwidth limit I set on my cake qdisc, even before the script is involved. My total bandwidth with iPerf3 is 180Mbit but once I apply SQM to the interface (Even with a cap of 250Mbit or higher) the total bandwidth gets capped to around 150~Mbit/s.

Also suspect the tunnel is slightly deteriorated.

Unknown why this happens at present moment, investigating...

lynxthecat commented 2 years ago

Any ideas @moeller0? @Marctraider this is definitely not CPU saturation given the way CAKE seems to rely on a single core? Sometimes 'irqbalance' helps with CPU-related issues like this.

Marctraider commented 2 years ago

Mhh, I'll see if swapping around IRQ's manually will do the job.

Edit: Yea my bad, looks like a CPU issue indeed. But not to worry, I will then run cake on my dedicated server endpoint, which has a 3.3Ghz i5-2600K 4.core, versus a 2Ghz Celeron 2-core. Seems to fix it!

I suppose its better to pace the incoming bandwidth from my endpoint anyhow, rather than at the very last link!

lynxthecat commented 2 years ago

Thanks for the update. Any further developments on whether lowering packet size makes a tangible difference?

Marctraider commented 2 years ago

No I think I'll leave that request for now. I'm not experienced enough with inner and outer layer packets to get a definitive answer to this one.

Bandwidth usage shouldn't really be much of a problem these days, and the sleep function should solve exactly that.

Thanks though!

lynxthecat commented 2 years ago

I think CPU usage has been brought down to acceptably low levels and significant further improvement would require switching from bash. But anyone reading this feel free to reopen if you can see any further area for improvement.