lynxthecat / cake-autorate

Eliminate the excess latency and jitter terrorizing your 4G, 5G, LTE, Starlink or other variable rate connection!
https://www.bufferbloat.net
GNU General Public License v2.0
284 stars 24 forks source link

High CPU usage for CAKE-autorate.sh #17

Closed richb-hanover closed 2 years ago

richb-hanover commented 2 years ago

I'm starting a new issue from #16 that began life as a trouble report, but morphed into observations about performance.

At @lynxthecat suggestion, I adjusted defaults.sh to slow the ping rate using:

ping_reflector_interval=0.2 # (milliseconds)
main_loop_tick_duration=1000 # (milliseconds)

It did improve the utilization: Here's a video of htop while idle, then under load, and then recovering from the load. https://youtu.be/pUZ7MpN85K4

PS I can't really tell whether this code addresses latency. I have a fixed speed fiber link through my ISP at 25/25 mbps, so my link speed doesn't vary much. So I just cannot say whether it makes a difference.

lynxthecat commented 2 years ago

@dim-geo any thoughts about possible performance optimisations in this bash implementation (my most sophisticated yet). This gives best performance for me out of all branches on my LTE connection.

I'm asking you because you pushed those awk enhancements to the original shell script.

The main processing aspects of this approach are in the main loop inside 'CAKE-autorate.sh' and the tail loop inside 'monitor_reflector_path.sh'.

richb-hanover commented 2 years ago

I ask the same question as I asked about the Lua implementation: Where's the Storage?

That is, I observe that CPU utilization and load average jump up after a burst of traffic, but that they decay gradually back to a baseline over a minute or more.

Why is that? I would have expected that the utilization/load average would drop back to baseline instantly, or at worst, within a second after the traffic ceases.

What keeps the load average high for a period of time? Where does the "work" come from? Are there ping packets queued up in the network stack (or - Horrors - in buffers?) that take time to "work off"? Is there some other processing that gets deferred?

Note: This may be a different question from "optimizing the code". (Hypothetical example) If the code queues three packets per second, but only processes one packet per second, even infinitely fast code won't solve that problem.

Thanks.

lynxthecat commented 2 years ago

BTW confirming overall efficacy, other than myself at least one other user (Marctraider) sees good results with this bash implementation and his use case is pretty challenging - see here:

https://github.com/lynxthecat/sqm-autorate/issues/13#issuecomment-1060008177

The video you sent still shows very high CPU utilisation does it not? I mean it seems to show CPU @ 100%. Is that related to the bash script processes? I can't see the percentage occupied by the bash script processes.

The bash implementation should not equate to more processing on higher load. In terms of its mechanics for each reflector it sets up a ping process with interval 'ping_reflector_interval' that outputs to a file, and that file is read in with a 'tail -f', and each incoming ping result is processed, which involves maintaining an array of deltas, touching a delay_file is a sufficient number of samples in the delta array are greater than a threshold, updating a baseline, and then that ping file is reset, and there is a main loop that runs every 'main_loop_tick_duration' and checks to see the load and how many reflectors are delayed to decide how to change the CAKE bandwidth.

Tail loop is here:

https://github.com/lynxthecat/sqm-autorate/blob/CAKE-autorate/monitor_reflector_path.sh#L71

Main loop is here:

https://github.com/lynxthecat/sqm-autorate/blob/CAKE-autorate/CAKE-autorate.sh#L158

So there isn't actually all that much processing going on. I suspect the 'tail -f' aspect could be optimized somehow, but that would seem to require advanced knowledge of how bash works with pipe buffers etc. The use of 'tail -f' and resetting the file is a hack to avoid the processing of ping results lagging behind the incoming ping results or a file to grow infinitely.

It does stop the ping processes if baseline has been coasting for 60s, and then it restarts them when the load picks up again. There should be no pings and reduced CPU utilisation in this 'sleep state'.

As a thought experiment, imagine:

ping -i 0.1 1.1.1.1 | while read ping_line do; echo $ping_line; sleep 1; done

The processing on the right is slower than the processing on the left, so the output will lag behind. I wanted to avoid this, hence I set up ping to output to a file, and the file read in with 'tail -f' and reset after each read in, to ensure the processing does not lag behind and file does not grow. This is a hack, and perhaps that can be optimized. For example perhaps it should reset the file very X runs, but I am not sure. From my own router I am not seeing any skipped pings with the reset implemented as is.

richb-hanover commented 2 years ago

The processing on the right is slower than the processing on the left, so the output will lag behind. I wanted to avoid this, hence I set up ping to output to a file, and the file read in with 'tail -f' and reset after each read in, to ensure the processing does not lag behind and file does not grow. This is a hack, and perhaps that can be optimized.

I'm glad to see you have thought this. That's the kind of thing I was suggesting... Could it be happening anywhere else?

Marctraider commented 2 years ago

Not sure if it matters, but the router in question that I have is an Intel NUC with Intel(R) Celeron(R) J4005 CPU 2.6Ghz (Turbo) It must be noted though that it runs on 2.6GHz all the time, most power saving disabled. Not sure if that skews results.

Idle: image

Max Load: image

I see the cpu usage is indeed quite high on load for the script, but personally not too worried as I still have some breathing room and the router has no other function.

Still, could defo see optimization.

lynxthecat commented 2 years ago

@Marctraider as you say there is some breathing room, but as the author of this script to me that high load is totally unacceptable. Where is this load coming from? @richb-hanover I am perplexed because for the reasons I gave above, I cannot see why the script work would increase upon high load (save for comparison with the sleep state where there are no incoming ping results). The ping intervals and main loop interval is fixed, so why would the script occupy more CPU cycles on a high load?

@Marctraider can you confirm the low load is also when the ping processes are active? That is, if you just run a google internet speed test, then stop it, then wait 10 seconds, do you still see high CPU cycles?

By the way, I have 'irqbalance' enabled. That may make a big difference here by spreading things over the cores.

Please can you try uncommenting this line:

https://github.com/lynxthecat/sqm-autorate/blob/CAKE-autorate/functions.sh#L25

And let me know if the sleeps are always positive and what values they are? They are always positive for me, and pretty sizeable. Even with pings every 0.1s, 4 reflectors and main loop duration 200ms.

Marctraider commented 2 years ago

I manually assign interrupts and process affinities usually, (either through taskset or through CPUAffinity variable in systemd files, have never managed to get irqbalance to work, somehow its not acting on anything.

I actually thing the OP is right:

image

There's so much CPU usage it overflows to the second core (basically 1.5x load on my 2 cores) which i also problematic for me as on full load, the userspace app dealing with the redundant connection also almost fully saturating a full core. So I often go over my CPU budget in certain situations.

This high cpu usage stays for at least 30/60 seconds until it stabilizes back to a few % after a bandwidth test.

Also strange i get good performance with iperf3 regardless, but not with openspeedtest, but this might be because there's less interrupts going on as the local interface isnt used and the router is directly iperfing to my vps endpoint, when i run a openspeedtest it obviously goes through a local client, saturating more cpu on the router because of the local ethernet interface (eno1), generating even more interrupt cpu cycles.

lynxthecat commented 2 years ago

Does this test script eat up CPU cycles on your system:

#!/bin/bash

ping -i 0.1 1.1.1.1 > /tmp/ping_output&

tail -f /tmp/ping_output | while read ping_line

do

        echo $ping_line

        > /tmp/ping_output

done

Also, could you try sourcing 'monitor_reflector_path.sh' using:

. ./monitor_reflector_path.sh

And then just manually launch:

ping_reflector 1.1.1.1&
monitor_reflector_path 1.1.1.1&

Does the latter eat up CPU cycles?

Marctraider commented 2 years ago

At first glance, basically nothing to 1%~ average, but now I run it for 30 or more seconds I start to see the cpu usage of the script climbing up gradually. (Now at 7%). If this keeps up I can only conclude it will eventually max out and start to slow down because of no more cpu cycles left.

(Result from your initial request, i will test the rest after dinner)

lynxthecat commented 2 years ago

Ah great. We have a possible reason for the issue now. So just to confirm the simple ping/tail test script results in increased cpu usage with time? This is very revealing.

Maybe it is a truncation issue? What happens if you replace the > to reset with:

echo "" > /tmp/ping_output

@richb-hanover or anyone else why would my test script above result in cpu usage creeping up with time?

richb-hanover commented 2 years ago

Sorry - I only answered some of your questions. And I realize there are new messages since I wrote this (and forgot to press Submit...) I'll look at the new questions later today.

The video you sent still shows very high CPU utilisation does it not? I mean it seems to show CPU @ 100%. Is that related to the bash script processes? I can't see the percentage occupied by the bash script processes.

root@LBWO-C7v2:~# top -b -n 1 | head -20
Mem: 54388K used, 69004K free, 1212K shrd, 4504K buff, 15484K cached
CPU:  50% usr  50% sys   0% nic   0% idle   0% io   0% irq   0% sirq
Load average: 1.74 0.65 0.23 1/72 10465
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
19511 19483 root     S     2060   2%  17% bash ./CAKE-autorate.sh
19438 19416 root     S     2060   2%  17% bash ./CAKE-autorate.sh
 1580     1 root     S     4376   4%   0% /usr/sbin/hostapd -s -g /var/run/hostapd/global
 1581     1 root     S     4280   3%   0% /usr/sbin/wpa_supplicant -n -s -g /var/run/wpa_supplicant/global
 2071     1 root     S     4096   3%   0% /usr/sbin/uhttpd -f -h /www -r LBWO-C7v2 -x /cgi-bin -u /ubus -t 60 -T 30 -k 20 -A 1 -n 3 -N 100 -R -p 0.0.0.0:80 -p [::]:80 -C /etc/uhttpd.crt -K /etc/uhttpd.key -s 0.0.0.0:443 -s [::]:443
 1274     1 root     S     2316   2%   0% /sbin/rpcd -s /var/run/ubus/ubus.sock -t 30
19344 26560 root     S     2080   2%   0% bash ./CAKE-autorate.sh
19357 19347 root     S     2060   2%   0% bash ./CAKE-autorate.sh
19382 19369 root     S     2060   2%   0% bash ./CAKE-autorate.sh
10465 19511 root     R     2060   2%   0% [bash]
19416 19344 root     S     2056   2%   0% bash ./CAKE-autorate.sh
19483 19344 root     S     2056   2%   0% bash ./CAKE-autorate.sh
19369 19344 root     S     2052   2%   0% bash ./CAKE-autorate.sh
19347 19344 root     S     2052   2%   0% bash ./CAKE-autorate.sh
 1643     1 root     S     1800   1%   0% /sbin/netifd
    1     0 root     S     1632   1%   0% /sbin/procd
ping_reflector_interval=0.1 # (milliseconds)
main_loop_tick_duration=200 # (milliseconds)
Marctraider commented 2 years ago

So just to confirm the simple ping/tail test script results in increased cpu usage with time? This is very revealing.

@lynxthecat Confirmed.

What happens if you replace the > to reset with:

You mean as in so?:

#!/bin/bash

ping -i 0.1 10.0.0.1 > /tmp/ping_output&

tail -f /tmp/ping_output | while read ping_line

do

        echo $ping_line

        echo "" > /tmp/ping_output

done

In that case it starts with higher initial CPU usage, and also growing over time. (Started around 15%, now already at 40%

Output:


tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

64 bytes from 10.0.0.1: icmp_seq=1857 ttl=64 time=4.79 ms
tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

tail: /tmp/ping_output: file truncated

@richb-hanover

I note that there are nine (9) CAKE-autorate.sh instances running. Is this expected?

The main script calls the other scripts (and in their turn run other programs like ping) which would be child processes of the parent process, so I think that is to be expected.

lynxthecat commented 2 years ago

Yes @Marctraider exactly - there are a bunch of child processes created so multiple entries are expected. I am shocked by even just:

#!/bin/bash

ping -i 0.1 10.0.0.1 > /tmp/ping_output&

tail -f /tmp/ping_output | while read ping_line
do
        echo $ping_line
        echo "" > /tmp/ping_output
done

Resulting in a CPU spike with time. If that does, then the script will also.

Can you confirm that removing the 'echo' line results in CPU usage maintaining very low? If so, the question is how to deal with the reset / file growth problem in a way that does not result in increased CPU usage with time.

Marctraider commented 2 years ago

Yes @Marctraider exactly - there are a bunch of child processes created so multiple entries are expected. I am shocked by even just:

#!/bin/bash

ping -i 0.1 10.0.0.1 > /tmp/ping_output&

tail -f /tmp/ping_output | while read ping_line
do
        echo $ping_line
        echo "" > /tmp/ping_output
done

Resulting in a CPU spike with time. If that does, then the script will also.

Can you confirm that removing the 'echo' line results in CPU usage maintaining very low? If so, the question is how to deal with the reset / file growth problem in a way that does not result in increased CPU usage with time.

echo "" > /tmp/ping_output

Results in no growing cpu usage, and very little cpu in general (0.3%~ avg)

lynxthecat commented 2 years ago

@Marctraider OK great. So the question for me now is how to pipe to a file or fifo in bash without the file size growing and skipping as necessary?

I have been experimenting with this:

#!/bin/bash

pipe=/tmp/testpipe

mkfifo $pipe

ping -i 0.1 1.1.1.1 > $pipe&

while read ping_line <$pipe
do
        if [ -z $(read -t 0.01) ]; then
                echo $last_line
        fi
        last_line=$ping_line
done

Basically I need some mechanism to ensure that processing does not lag behind incoming pings, and that pings are skipped as necessary.

How is this achieved though?

The problem can be summarized like this. If you pipe ping with interval 0.1s to a while loop with sleep 1s, how do you skip pings as necessary to only ever echo the latest ping? To explain, consider this:

#!/bin/bash

ping -i 0.1 1.1.1.1 | while read ping_line
do
        echo $ping_line
        sleep 1
done

The problem with the above is that processing takes 1s, but pings arrive at 0.1s. How to skip to the last line though?

This is the problem we need to resolve.

Marctraider commented 2 years ago

Hrm.. I will try and think with you on this, but I'm not the best when it comes to scripting .__.

Ideally you'd want to fetch as much possible pings without just randomly skipping when not necessary or you'll lose valuable data points. Something with a counter to see how many are held in buffer?

Not entirely related to this, but assuming cpu usage is fixed; Is it possible to have even lower than .1 (100ms) of ping intervals with regular ping? Or is that the limit?

lynxthecat commented 2 years ago

Even lower is possible. And since you have your own server, you can go as low as you like without worrying about getting into trouble.

Marctraider commented 2 years ago

Even lower is possible. And since you have your own server, you can go as low as you like without worrying about getting into trouble.

Yea certainly. I just recall some ping utilities where 0.1 was the max iirc. But i could be mistaken.

So you are saying that basically pings stack up faster than the script can handle, thus starting to buffer and buffer with more cpu usage as result?

lynxthecat commented 2 years ago

No it's just I am simulating the problem.

#!/bin/bash

ping -i 0.1 1.1.1.1 | while read ping_line
do
        echo $ping_line
        sleep 1
done

The sleep 1 is just to simulate a delay in processing the incoming pings. We somehow need to skip to the latest ping available, if that makes sense? Such that we only echo the latest available ping each time. Naturally so long as the processing is fast enough then we don't need to skip any pings, but we do need to manage the situation in case of CPU spikes.

I hacked around this by resetting the file within 'tail -f', but this appears to give rise to high CPU usage for some reason I do not understand.

lynxthecat commented 2 years ago

Something like this helps:

#!/bin/bash

file=/tmp/test

ping -i 0.1 1.1.1.1 | while read line; do echo $line > $file; done&
sleep 1

while true
do
        cat $file
        sleep 1
done

But I need to replace the while loop with something that executes on new data.

So I'm not sure this solves the problem entirely either.

Marctraider commented 2 years ago

No it's just I am simulating the problem.

#!/bin/bash

ping -i 0.1 1.1.1.1 | while read ping_line
do
        echo $ping_line
        sleep 1
done

The sleep 1 is just to simulate a delay in processing the incoming pings. We somehow need to skip to the latest ping available, if that makes sense? Such that we only echo the latest available ping each time. Naturally so long as the processing is fast enough then we don't need to skip any pings, but we do need to manage the situation in case of CPU spikes.

I hacked around this by resetting the file within 'tail -f', but this appears to give rise to high CPU usage for some reason I do not understand.

Possible to clear the file with another command instead?

lynxthecat commented 2 years ago

That's what I tried with the echo "" > but you said the cpu issue remained. Maybe we can try something else?

richb-hanover commented 2 years ago

I'd like to look back to a comment from @Marctraider

Not entirely related to this, but assuming cpu usage is fixed; Is it possible to have even lower than .1 (100ms) of ping intervals with regular ping? Or is that the limit?

As we continue our research, I hope we explore the other direction: how slowly can we send those pings while still maintaining good latency and speed control? Finding a good lower limit to the ping rate would have these advantages:

Do we have any information about how rapidly/how much LTE or Cable Modem speeds vary in real life? Maybe we could collect it ourselves...

We could collect long-term data (over the course of hours? maybe even days?) that record how often the CAKE-autorate algorithm needs to adjust the parameters, and the effect those changes made on latency (or the effect on latency, had those changes not been made...)

Marctraider commented 2 years ago

I think it is almost directly related to how fast your ping interval is truthfully, since the scripts ability to react to bufferbloat is completely dependent on ICMP response back and forth. The smaller the interval between pings the faster you can act on that first big bufferbloat 'spike'. Sure if you don't mind up to a full second of pure bufferbloat (With a icmp interval of a full second), that's afaik not what QoS/SQM is about, but to contain this to a minimum.

(For a gamer or something a full second of bufferbloat could mean a full second of delay and subsequently warping/hitching for the duration. Voip would be pretty heavily affected as well). For me personally, I would never use this or any script while gaming either, as i play on somewhat competitive level. As soon as i run a game on my desktop, my router goes into low latency mode and I set a very conservative fixed cake configuration which is guaranteed to hold my minimum speed)

What would be the reason that a 0.1 is even troublesome, I bet our CPU's (Well, most of them) can handle 100's of pings a second without much trouble. And payload can be kept to a minimum as well to keep bandwidth in check.

What you request is basically already in there, lots of options to tweak to minimize ping intervals yet have decent stable connection, by sharply decreasing and gently increasing bandwidth until RTT crosses some threshold.

I do like the idea of some auto calibration, I think the lua version has some mechanism like that.

lynxthecat commented 2 years ago

I think perhaps the solution may be something along these lines:

#!/bin/bash

pipe=/tmp/test
mkfifo $pipe

ping -i 0.1 1.1.1.1 > $pipe&
sleep 1

while read ping_line <$pipe
do
        echo $ping_line

        dd if=$pipe iflag=nonblock of=/dev/null
done

But that does not work. I can't work out how to 'reset' the fifo buffer.

moeller0 commented 2 years ago

As we continue our research, I hope we explore the other direction: _how slowly can we send those pings while still maintaining good latency and speed control?

There are two issues at play here: 1) ping rate in not overwhelming reflectors, I think we can solve that by interleaving pining of multiple reflectors this will allow to keep the individual reflector rate low. 2) delay in getting new delay estimates and hence delay in detecting and responding to congestion.

As a data point, normal unix ping will allow intervals down to 0.2 seconds for non-root users while defaulting to 1 second intervals; that indicates to me that per-reflector rate somewhere between 0.2-5Hz (5 to 0.2 second interval) might be acceptable if we select from a sufficiently large set of reflectors. (Or better yet, operate our own reflectors at which point higher rates are an option, and here I mean that each user of autorate will operate her/his own VPS reflector).

For 2 I initially thought once per second would be enough, but @lynxthecat's access link demonstrates that might be a bit coarse and results in relative long delay spikes... Getting this down to 0.1 seconds seems feasible to me.

  • It avoids the "tragedy of the commons",

I do not think that tragedy actually exists... IMHO this is just a way to dress up "insufficiently strict rule enforcement" in terms that make it appear something else then regulatory failure, often pushed by parties that have vested interest in seeing regulation fail, but I digress. Your point stands though, if the reflectors encounter too high load I expect them to rate-limit/de-prioritize ICMP packets (basically enforcing their rules, so not regulatory failure) the autorate code needs to expect and react to that in a considerate way (changing reflectors and giving the unhappy reflector time to col-off).

Do we have any information about how rapidly/how much LTE or Cable Modem speeds vary in real life? Maybe we could collect it ourselves...

I think to do that one would need to continuously saturate the link for a considerable amount of time and measure the packets traversing the link.... just looking at how often the code triggers will not give the "veridical" rate-changes of the link.

lynxthecat commented 2 years ago

I think we want a last in first out buffer that is flushed on every read.

This one issue remains otherwise everything works well with my bash implementation.

Specifically this form:

#!/bin/bash

ping -i 0.1 1.1.1.1 > /tmp/ping_output&

tail -f /tmp/ping_output | while read ping_line

do

        echo $ping_line

        > /tmp/ping_output

done

is what is presently implemented. It works, but results in increasing CPU usage with time for some reason I do not understand.

I tried:

#!/bin/bash

> /tmp/ping_output

stdbuf -oL ping -i 0.1 1.1.1.1 > /tmp/ping_output&

tail -f /tmp/ping_output | while read ping_line

do
        while read -t 0.01 last_line; do ping_line=$last_line; done
        echo $ping_line
done

But this also fails. I am rather stumped by this.

If fifo can be relied upon to block ping whilst processing and not reading, I suppose that may work too.

moeller0 commented 2 years ago

Again, why not simply run ping in single-shot mode (-c 1) and restart it again? On an admittedly not too weak dual core arm a9 router I see:

rroot@turris:~# time /usr/bin/ping -D -c 1 192.168.42.201
PING 192.168.42.201 (192.168.42.201) 56(84) bytes of data.
[1646741801.236544] 64 bytes from 192.168.42.201: icmp_req=1 ttl=64 time=0.321 ms

--- 192.168.42.201 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.321/0.321/0.321/0.000 ms
real    0m 0.00s
user    0m 0.00s
sys     0m 0.00s

so if the rounding is correct, less than 5-10 milliseconds "cost" of setting-up, calling, and executing ping (to a local target, so we see the rest of pings static cost better) so with a desired interval of 100ms there is no need to keep ping running...

EDIT: would be nice to get similar measurements for less beefy routers.

lynxthecat commented 2 years ago

Perhaps I should revert to this. But I haven't quite given up yet.

Seems I've hit a self-imposed brick wall because I stubbornly want the 'tail -f' / ping or fifo block/unblock approach to work. So near, yet so far.

lynxthecat commented 2 years ago

Set against my obsessive nature, since an intellectually satisfying and elegant solution escapes me, I have put the questions here:

https://forum.openwrt.org/t/for-bash-experts-pipe-buffering/122119/24

and here:

https://stackoverflow.com/questions/71396520/capture-latest-write-to-file-fifo-buffer-and-then-flush-the-file-fifo-buffer

Hopefully something will come up, and failing which we can fall back to the one shot ping approach.

lynxthecat commented 2 years ago

@Marctraider does this result in CPU increasing with time for you:

#!/bin/bash

pipe=/tmp/test
mkfifo $pipe

ping -D -i 0.1 1.1.1.1 > $pipe&

cat $pipe > /dev/null&

while true
do
        head -1 $pipe
done

This may be a possible solution.

lynxthecat commented 2 years ago

@richb-hanover @Marctraider I have pushed new commits to 'CAKE-autorate' based on the bash named pipe solution - does it resolve the CPU usage issue? Now ping results are output to a bash named pipe (fifo), which is continually flushed to /dev/null, but it seems individual values can be pulled out from the fifo using 'head -1' in a while loop. I hope this resolves the CPU issue associated with the 'tail -f' + truncation.

I retested single-shot pings, but calling ping repeatedly in one shot mode also results in a CPU usage spike on my RT3200.

Marctraider commented 2 years ago

Look like this fixes it, I see in all situations 2-5% usage at best, both fully loaded and idle.

Great! I will keep testing upcoming days for oddities but it looks like you nailed it. 👍

richb-hanover commented 2 years ago

Alas... Even with this new version, I still see high usage. Here's a Youtube showing a few percent when the router is idle (and not running CAKE-autorate.sh)

Starting it consumes ~100% of the CPU, and the load average keeps creeping up.

https://youtu.be/jqqe4PhySJs

Would it help to get an ArcherC7 for your own testing? (I could think about ways to make that happen)

Marctraider commented 2 years ago

image

A few hours after the script running, doing all sorts of things in the meanwhile including high bandwidth streams (4K 100+Mbps)

I will run it overnight and so some more testing and see if it still behaves after that.

moeller0 commented 2 years ago

Starting it consumes ~100% of the CPU, and the load average keeps creeping up.

Interesting, it appears mostly to be SIRQ (if I recognized the color of the bars correctly) and does not reflect in a large amount of CPU% assigned to either bash or the ping processes themselves. I wonder whether that is related to the FIFO, or any other things the script might do?

In case you repeat that test, could you also get the output of tc -s qdisc while the script is running, as well as the output of df -h (to see whether /tmp might be too large). Finally, it would be intersteing ot see a speedtest with the script running... I assume that will be attrocious.

lynxthecat commented 2 years ago

Also, I have made a few more significant changes given the outcome of the discussion at:

https://forum.openwrt.org/t/for-bash-experts-pipe-buffering/122119/

Now the while loop for each reflector actually keeps up with every ping (not every other ping), and so I fear this may actually result in a little increased CPU usage, but the increase is associated with actually processing more of the ping results (should catch all of them, and can be verified by uncommenting the 'echo ping line' in 'monitor_reflector_path'.

@Marctraider it would be good to verify that the creeping CPU usage effect has still been ameliorated.

I am hoping there is some low hanging fruit in terms of getting CPU usage down, but I am operating at my limit in terms of bash scripting - it already has quite a few fancy features such as background process management, fifo use, etc.

Do any of you see any such low hanging fruit offering easy reductions to CPU usage? @dim-geo mysteriously helped out with my original 'ash' script and got CPU usage down, so perhaps he has some ideas for this new bash implementation in the 'CAKE-autorate' branch.


@richb-hanover thanks very much for the kind offer about the router. I am definitely keen to try to optimise this but maybe let's see first if we can get it down with a few easy fixes on the code.

And thank you for taking the time to make that video. It certainly looks bad, but I am confused about why the CPU usage shows 100% but the CAKE-autorate process does not come up having 80% or so. What sort of processor / kernel version does this use I wonder? Is there a way we can profile the CPU usage to see what is actually eating up the CPU cycles?

About the Archer C7 and SIRQ, is this thread relevant:

https://forum.openwrt.org/t/archer-c7-5ghz-performance-sirq-99/63104/19

See especially this:

https://forum.openwrt.org/t/archer-c7-5ghz-performance-sirq-99/63104/13?u=lynx

The sirqs are almost certainly being generated by network TX or RX. You can see by

cat /proc/sofitirqs Do it before and after a iperf run, copy/paste the two outputs into an excel sheet using the text import paste function and check the difference to see where they're being generated. Depending on the direction of your iperf transfer most of the softirqs will likely either be in NET_TX or NET_RX.

So could it literally just be the pings that are causing the CPU usage?

Could you test what happens with just launching 4 concurrent background pings @ with 0.1s interval?

Marctraider commented 2 years ago

So 12+ hours later the script still only consumes a few % CPU so no increase in load over time. (The version from yesterday)

I havent checked the new one yet.

The Archer C7 seems to have a pretty weak CPU and hardware in general, so i dont expect it to have such low load as I do, but 100% seems excessive. Since I also dont see cpu usage per process in the video, maybe the whole cpu usage readout is buggy? Or indeed could be somehow hardware/software/network related.

Marctraider commented 2 years ago

image

lynxthecat commented 2 years ago

@Marctraider I am very grateful for you confirming. How is the performance in terms of rate control for you? For me I am very happy with the way it is handling my CAKE rate (which spans from about 20 to about 70 Mbit/s for me). Are you using 4 samples and 1 sample counts as reflector delayed and 10ms threshold or have you tweaked these further?

The new version will likely perform slightly better because the ping frequency (actually processed) is higher and also I now interleave pings so they are better spread out. I expect CPU utilisation to increase a little, but hopefully it will remain low. And I certainly don't want to see the CPU creeping up with time like it did before with 'tail -f' + truncation. BTW on your system is /tmp just ordinary ram? I am still curious what caused that CPU usage creep.

Also, is your load constant or does it stop? I have been playing around with the ping sleep function. Now when the base rate is maintained for X seconds the pingers go to sleep and the minimum rate is set. You can see the pingers have gone to sleep using 'ps |grep ping' and you see 'T' rather than 'S' or you can use tcpdump and inspect the interface.

I really think with some optimisation/robustification this is looking very promising as a generic solution for all.

Marctraider commented 2 years ago

You're very welcome.

I haven't done any extensive tweaking and min/maxing but i'm lucky to have a practical scenario where I can literally drop my internet speed at will so I can basically 'simulate' a sudden drop in available bandwidth, without 'switching' connections which is really nice and will definitely help with tweaking latency spikes.

So far I'm really happy about its performance and result (so far). Anywhere between 5 and 10% CPU is still acceptable for me so in my book I'm fine with some additional cycles used if that means more optimization.

As for /tmp, its Debian default (So I assume just disk). df -h does not indicate a mountpoint directly to /tmp so i assume it just goes under / (Maybe this was also a problem? I assume memory access is faster. On the other hand I assume the OS just buffers/cache this and doesn't continuously read/write to disk either way.

root@redundant:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            1.9G     0  1.9G   0% /dev
tmpfs           375M  724K  374M   1% /run
/dev/mmcblk0p2   28G  9.4G   17G  37% /
tmpfs           1.9G     0  1.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/mmcblk0p1  511M  3.5M  508M   1% /boot/efi
/dev/sda1       110G   24K  104G   1% /mnt/backup
tmpfs           375M     0  375M   0% /run/user/0

As for load, it looks like i see two processes of CAKE-autorate.sh at an average of 1% and 2% continuously. (On baseline 175Mbit qdisc) right after a bandwidth test. Shortly after that (assuming the pings stop) i just see one process at roughly 1-2%.

I see now that after a while the tc qdisc goes to minimum set of 90Mbit, this is correct yes?

I will replace the script with the new version tonight.

lynxthecat commented 2 years ago

It would be good to verify /tmp/CAKE-autorate is on ram for most users. If it would sometimes end up on a disk like an HDD I imagine that might hurt performance a little.

Concerning minimum yes. It is designed to go to sleep providing the base load is maintained for ping_sleep_thr seconds - it stops the pinger processes and sets bandwidth to minimum (for safety to deal with any step loads) and then when load hits the load threshold it turns the pingers back on, does not change load until next loop and then allows decaying back to base or if load is sustained increasing with the normal set rate for load increase. The logic for this made my head spin last night and the main loop inside CAKE-autorate is now a little complicated. It may be that can be simplified to help speed up at least the main loop process.

I just feel better sleeping at night knowing my router isn't firing off pings and pointlessly maintaining baselines and processing delays, etc when the connection is not even in use.

This is what you should see (going from sleep to active):

root@OpenWrt:~/CAKE-autorate# ps |grep ping
20171 root      1248 S    grep ping
31825 root       764 T    /usr/bin/ping -D -i 0.1 1.1.1.1
31838 root       764 T    /usr/bin/ping -D -i 0.1 1.0.0.1
31856 root       764 T    /usr/bin/ping -D -i 0.1 8.8.8.8
31870 root       764 T    /usr/bin/ping -D -i 0.1 8.8.4.4
root@OpenWrt:~/CAKE-autorate# ps |grep ping
20536 root      1248 S    grep ping
31825 root       764 S    /usr/bin/ping -D -i 0.1 1.1.1.1
31838 root       764 S    /usr/bin/ping -D -i 0.1 1.0.0.1
31856 root       764 S    /usr/bin/ping -D -i 0.1 8.8.8.8
31870 root       764 S    /usr/bin/ping -D -i 0.1 8.8.4.4

The new version handles the fifo buffer slightly differently so I am hoping it is still OK. It also interleaves pings. And on my router I am seeing every ping line captured with 4 reflectors and 0.1s pings, so it looks good.

I have spent quite a bit of time dealing with the problem of router CPU spikes (e.g. owing to VPN or something else) - ideally ping results are skipped if the processing does not keep up, but it still works with the latest values and not old values. There may be a need to apply a 'dd flush' on the fifo at certain points, but I am not sure yet.

lynxthecat commented 2 years ago

Can you see any optimizations for this loop:

https://github.com/lynxthecat/sqm-autorate/blob/CAKE-autorate/monitor_reflector_path.sh#L52

Since this is where a lot of the processing is done.

lynxthecat commented 2 years ago

By the way:

echo $(date +%s.%N) $ping_line

here:

https://github.com/lynxthecat/sqm-autorate/blob/CAKE-autorate/monitor_reflector_path.sh#L73

Can be used to verify that the pings are being processed in the right way by inspecting the ping lines and comparing timestamps against the system timestamps. The times should be fairly in sync.

Marctraider commented 2 years ago
root@redundant:~# ps
    PID TTY          TIME CMD
 695957 pts/1    00:00:00 bash
 716593 pts/1    00:00:00 ps
root@redundant:~#

No results for pings, but that is not right as i can clearly see the pings going in 'top', also my own pings from another script are running. But I'm sure it works fine, i could indeed check with tcpdump but there's also other pings to the same gateway ongoing at the moment lol. (Unless this is only visible with the latest branch update)

Can't see anything in script to optimize at the moment, but again the usage is really really low, and my script runs on low intervals at the moment.

richb-hanover commented 2 years ago

Field report for https://github.com/lynxthecat/CAKE-autorate/commit/6069321f5cbeaf577e42b42a6768c1335fefeb3c on my Archer C7v2...

Using reflector_ping_interval=0.1:

Using reflector_ping_interval=0.2:

Also: During the download phase of betterspeedtest.sh the script continually write lines like this, over and over...

WARNING: encountered response from [ 8.8.8.8 ] that is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.4.4 ] that is > 500ms old. Skipping.
WARNING: encountered response from [ 1.0.0.1 ] that is > 500ms old. Skipping.
WARNING: encountered response from [ 1.1.1.1 ] that is > 500ms old. Skipping.

This did not happen during the upload phase of the speedtest

lynxthecat commented 2 years ago

Thank you for testing @richb-hanover. The skipping occurs when the main processing loop cannot keep up with the rate of incoming data. It seems like with the Archer C7 even just ordinary traffic at your line rate causes huge CPU spikes. Did this also occur with the ping interval set to 0.2? How about reducing the reflectors to just 2 and use 0.5?

Do you have a sense of how these different scripts you are testing are performing in terms of controlling CAKE bandwidth (i.e. offering bandwidth when needed but preventing significant bufferbloat?).

richb-hanover commented 2 years ago

Thank you for testing @richb-hanover. The skipping occurs when the main processing loop cannot keep up with the rate of incoming data. It seems like with the Archer C7 even just ordinary traffic at your line rate causes huge CPU spikes.

I'm not sure that's the explanation. I changed the code to display $t_start and $other_value, and got these results:

WARNING: encountered response from [ 1.1.1.1 ] 1647651329344444 1647651328837871 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.8.8 ] 1647651329425467 1647651328909500 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.8.8 ] 1647651329628267 1647651329110978 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.0.0.1 ] 1647651329815810 1647651329279158 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.8.8 ] 1647651329821885 1647651329309592 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.1.1.1 ] 1647651329960633 1647651329436894 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.8.8 ] 1647651330065726 1647651329515942 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.4.4 ] 1647651330078407 1647651329572759 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.0.0.1 ] 1647651330201532 1647651329677802 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.4.4 ] 1647651330924455 1647651330375886 is > 500ms old. Skipping.

The differences are greater than 500,000, but slightly. However, I never see 500+msec lag in the router - the added latency is generally 5-10 msec.

NB If I create a two-second blast of data, 10 lines of 500 ms old. Skipping get generated when the ping interval is 0.2.

Did this also occur with the ping interval set to 0.2? How about reducing the reflectors to just 2 and use 0.5?

No. Either setting ping interval to 0.2 or using only two reflectors doesn't generate those messages.

Actually, not quite true. Sometimes, either of those combinations causes a single Skipping message to be generated.

Do you have a sense of how these different scripts you are testing are performing in terms of controlling CAKE bandwidth (i.e. offering bandwidth when needed but preventing significant bufferbloat?).

I can't say. (I have fixed speed at 25mbps - the latency always stays between 7 & 15 msec)

richb-hanover commented 2 years ago

Another thought: the delta's between the values are all slightly greater than 500,000 (Excel is my friend...)

The third column is the delta. If that value (for some unexplained reason) too large by 500,000, they might be reasonable latencies: 6.57 msec, 15.96 msec, 17.29 msec, etc.

\ Does that provide a hint? Could one of the values be coming from a ping sent a half-second before? \</wild speculation>

1.64765E+15 1.64765E+15 506570
1.64765E+15 1.64765E+15 515960
1.64765E+15 1.64765E+15 517290
1.64765E+15 1.64765E+15 536660
1.64765E+15 1.64765E+15 512290
1.64765E+15 1.64765E+15 523740
1.64765E+15 1.64765E+15 549780
1.64765E+15 1.64765E+15 505650
1.64765E+15 1.64765E+15 523730
1.64765E+15 1.64765E+15 548570
lynxthecat commented 2 years ago

Thanks @richb-hanover. There is a glitch here it seems. Perhaps I need to flock the writes.