Closed richb-hanover closed 2 years ago
@dim-geo any thoughts about possible performance optimisations in this bash implementation (my most sophisticated yet). This gives best performance for me out of all branches on my LTE connection.
I'm asking you because you pushed those awk enhancements to the original shell script.
The main processing aspects of this approach are in the main loop inside 'CAKE-autorate.sh' and the tail loop inside 'monitor_reflector_path.sh'.
I ask the same question as I asked about the Lua implementation: Where's the Storage?
That is, I observe that CPU utilization and load average jump up after a burst of traffic, but that they decay gradually back to a baseline over a minute or more.
Why is that? I would have expected that the utilization/load average would drop back to baseline instantly, or at worst, within a second after the traffic ceases.
What keeps the load average high for a period of time? Where does the "work" come from? Are there ping packets queued up in the network stack (or - Horrors - in buffers?) that take time to "work off"? Is there some other processing that gets deferred?
Note: This may be a different question from "optimizing the code". (Hypothetical example) If the code queues three packets per second, but only processes one packet per second, even infinitely fast code won't solve that problem.
Thanks.
BTW confirming overall efficacy, other than myself at least one other user (Marctraider) sees good results with this bash implementation and his use case is pretty challenging - see here:
https://github.com/lynxthecat/sqm-autorate/issues/13#issuecomment-1060008177
The video you sent still shows very high CPU utilisation does it not? I mean it seems to show CPU @ 100%. Is that related to the bash script processes? I can't see the percentage occupied by the bash script processes.
The bash implementation should not equate to more processing on higher load. In terms of its mechanics for each reflector it sets up a ping process with interval 'ping_reflector_interval' that outputs to a file, and that file is read in with a 'tail -f', and each incoming ping result is processed, which involves maintaining an array of deltas, touching a delay_file is a sufficient number of samples in the delta array are greater than a threshold, updating a baseline, and then that ping file is reset, and there is a main loop that runs every 'main_loop_tick_duration' and checks to see the load and how many reflectors are delayed to decide how to change the CAKE bandwidth.
Tail loop is here:
https://github.com/lynxthecat/sqm-autorate/blob/CAKE-autorate/monitor_reflector_path.sh#L71
Main loop is here:
https://github.com/lynxthecat/sqm-autorate/blob/CAKE-autorate/CAKE-autorate.sh#L158
So there isn't actually all that much processing going on. I suspect the 'tail -f' aspect could be optimized somehow, but that would seem to require advanced knowledge of how bash works with pipe buffers etc. The use of 'tail -f' and resetting the file is a hack to avoid the processing of ping results lagging behind the incoming ping results or a file to grow infinitely.
It does stop the ping processes if baseline has been coasting for 60s, and then it restarts them when the load picks up again. There should be no pings and reduced CPU utilisation in this 'sleep state'.
As a thought experiment, imagine:
ping -i 0.1 1.1.1.1 | while read ping_line do; echo $ping_line; sleep 1; done
The processing on the right is slower than the processing on the left, so the output will lag behind. I wanted to avoid this, hence I set up ping to output to a file, and the file read in with 'tail -f' and reset after each read in, to ensure the processing does not lag behind and file does not grow. This is a hack, and perhaps that can be optimized. For example perhaps it should reset the file very X runs, but I am not sure. From my own router I am not seeing any skipped pings with the reset implemented as is.
The processing on the right is slower than the processing on the left, so the output will lag behind. I wanted to avoid this, hence I set up ping to output to a file, and the file read in with 'tail -f' and reset after each read in, to ensure the processing does not lag behind and file does not grow. This is a hack, and perhaps that can be optimized.
I'm glad to see you have thought this. That's the kind of thing I was suggesting... Could it be happening anywhere else?
Not sure if it matters, but the router in question that I have is an Intel NUC with Intel(R) Celeron(R) J4005 CPU 2.6Ghz (Turbo) It must be noted though that it runs on 2.6GHz all the time, most power saving disabled. Not sure if that skews results.
Idle:
Max Load:
I see the cpu usage is indeed quite high on load for the script, but personally not too worried as I still have some breathing room and the router has no other function.
Still, could defo see optimization.
@Marctraider as you say there is some breathing room, but as the author of this script to me that high load is totally unacceptable. Where is this load coming from? @richb-hanover I am perplexed because for the reasons I gave above, I cannot see why the script work would increase upon high load (save for comparison with the sleep state where there are no incoming ping results). The ping intervals and main loop interval is fixed, so why would the script occupy more CPU cycles on a high load?
@Marctraider can you confirm the low load is also when the ping processes are active? That is, if you just run a google internet speed test, then stop it, then wait 10 seconds, do you still see high CPU cycles?
By the way, I have 'irqbalance' enabled. That may make a big difference here by spreading things over the cores.
Please can you try uncommenting this line:
https://github.com/lynxthecat/sqm-autorate/blob/CAKE-autorate/functions.sh#L25
And let me know if the sleeps are always positive and what values they are? They are always positive for me, and pretty sizeable. Even with pings every 0.1s, 4 reflectors and main loop duration 200ms.
I manually assign interrupts and process affinities usually, (either through taskset or through CPUAffinity variable in systemd files, have never managed to get irqbalance to work, somehow its not acting on anything.
I actually thing the OP is right:
There's so much CPU usage it overflows to the second core (basically 1.5x load on my 2 cores) which i also problematic for me as on full load, the userspace app dealing with the redundant connection also almost fully saturating a full core. So I often go over my CPU budget in certain situations.
This high cpu usage stays for at least 30/60 seconds until it stabilizes back to a few % after a bandwidth test.
Also strange i get good performance with iperf3 regardless, but not with openspeedtest, but this might be because there's less interrupts going on as the local interface isnt used and the router is directly iperfing to my vps endpoint, when i run a openspeedtest it obviously goes through a local client, saturating more cpu on the router because of the local ethernet interface (eno1), generating even more interrupt cpu cycles.
Does this test script eat up CPU cycles on your system:
#!/bin/bash
ping -i 0.1 1.1.1.1 > /tmp/ping_output&
tail -f /tmp/ping_output | while read ping_line
do
echo $ping_line
> /tmp/ping_output
done
Also, could you try sourcing 'monitor_reflector_path.sh' using:
. ./monitor_reflector_path.sh
And then just manually launch:
ping_reflector 1.1.1.1&
monitor_reflector_path 1.1.1.1&
Does the latter eat up CPU cycles?
At first glance, basically nothing to 1%~ average, but now I run it for 30 or more seconds I start to see the cpu usage of the script climbing up gradually. (Now at 7%). If this keeps up I can only conclude it will eventually max out and start to slow down because of no more cpu cycles left.
(Result from your initial request, i will test the rest after dinner)
Ah great. We have a possible reason for the issue now. So just to confirm the simple ping/tail test script results in increased cpu usage with time? This is very revealing.
Maybe it is a truncation issue? What happens if you replace the > to reset with:
echo "" > /tmp/ping_output
@richb-hanover or anyone else why would my test script above result in cpu usage creeping up with time?
Sorry - I only answered some of your questions. And I realize there are new messages since I wrote this (and forgot to press Submit...) I'll look at the new questions later today.
The video you sent still shows very high CPU utilisation does it not? I mean it seems to show CPU @ 100%. Is that related to the bash script processes? I can't see the percentage occupied by the bash script processes.
Yes - I suspect about half of the CPU usage is the bash scripts.
I note that there are nine (9) CAKE-autorate.sh instances running. Is this expected?
root@LBWO-C7v2:~# top -b -n 1 | head -20
Mem: 54388K used, 69004K free, 1212K shrd, 4504K buff, 15484K cached
CPU: 50% usr 50% sys 0% nic 0% idle 0% io 0% irq 0% sirq
Load average: 1.74 0.65 0.23 1/72 10465
PID PPID USER STAT VSZ %VSZ %CPU COMMAND
19511 19483 root S 2060 2% 17% bash ./CAKE-autorate.sh
19438 19416 root S 2060 2% 17% bash ./CAKE-autorate.sh
1580 1 root S 4376 4% 0% /usr/sbin/hostapd -s -g /var/run/hostapd/global
1581 1 root S 4280 3% 0% /usr/sbin/wpa_supplicant -n -s -g /var/run/wpa_supplicant/global
2071 1 root S 4096 3% 0% /usr/sbin/uhttpd -f -h /www -r LBWO-C7v2 -x /cgi-bin -u /ubus -t 60 -T 30 -k 20 -A 1 -n 3 -N 100 -R -p 0.0.0.0:80 -p [::]:80 -C /etc/uhttpd.crt -K /etc/uhttpd.key -s 0.0.0.0:443 -s [::]:443
1274 1 root S 2316 2% 0% /sbin/rpcd -s /var/run/ubus/ubus.sock -t 30
19344 26560 root S 2080 2% 0% bash ./CAKE-autorate.sh
19357 19347 root S 2060 2% 0% bash ./CAKE-autorate.sh
19382 19369 root S 2060 2% 0% bash ./CAKE-autorate.sh
10465 19511 root R 2060 2% 0% [bash]
19416 19344 root S 2056 2% 0% bash ./CAKE-autorate.sh
19483 19344 root S 2056 2% 0% bash ./CAKE-autorate.sh
19369 19344 root S 2052 2% 0% bash ./CAKE-autorate.sh
19347 19344 root S 2052 2% 0% bash ./CAKE-autorate.sh
1643 1 root S 1800 1% 0% /sbin/netifd
1 0 root S 1632 1% 0% /sbin/procd
ping_reflector_interval=0.1 # (milliseconds)
main_loop_tick_duration=200 # (milliseconds)
So just to confirm the simple ping/tail test script results in increased cpu usage with time? This is very revealing.
@lynxthecat Confirmed.
What happens if you replace the > to reset with:
You mean as in so?:
#!/bin/bash
ping -i 0.1 10.0.0.1 > /tmp/ping_output&
tail -f /tmp/ping_output | while read ping_line
do
echo $ping_line
echo "" > /tmp/ping_output
done
In that case it starts with higher initial CPU usage, and also growing over time. (Started around 15%, now already at 40%
Output:
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
64 bytes from 10.0.0.1: icmp_seq=1857 ttl=64 time=4.79 ms
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
tail: /tmp/ping_output: file truncated
@richb-hanover
I note that there are nine (9) CAKE-autorate.sh instances running. Is this expected?
The main script calls the other scripts (and in their turn run other programs like ping) which would be child processes of the parent process, so I think that is to be expected.
Yes @Marctraider exactly - there are a bunch of child processes created so multiple entries are expected. I am shocked by even just:
#!/bin/bash
ping -i 0.1 10.0.0.1 > /tmp/ping_output&
tail -f /tmp/ping_output | while read ping_line
do
echo $ping_line
echo "" > /tmp/ping_output
done
Resulting in a CPU spike with time. If that does, then the script will also.
Can you confirm that removing the 'echo' line results in CPU usage maintaining very low? If so, the question is how to deal with the reset / file growth problem in a way that does not result in increased CPU usage with time.
Yes @Marctraider exactly - there are a bunch of child processes created so multiple entries are expected. I am shocked by even just:
#!/bin/bash ping -i 0.1 10.0.0.1 > /tmp/ping_output& tail -f /tmp/ping_output | while read ping_line do echo $ping_line echo "" > /tmp/ping_output done
Resulting in a CPU spike with time. If that does, then the script will also.
Can you confirm that removing the 'echo' line results in CPU usage maintaining very low? If so, the question is how to deal with the reset / file growth problem in a way that does not result in increased CPU usage with time.
Results in no growing cpu usage, and very little cpu in general (0.3%~ avg)
@Marctraider OK great. So the question for me now is how to pipe to a file or fifo in bash without the file size growing and skipping as necessary?
I have been experimenting with this:
#!/bin/bash
pipe=/tmp/testpipe
mkfifo $pipe
ping -i 0.1 1.1.1.1 > $pipe&
while read ping_line <$pipe
do
if [ -z $(read -t 0.01) ]; then
echo $last_line
fi
last_line=$ping_line
done
Basically I need some mechanism to ensure that processing does not lag behind incoming pings, and that pings are skipped as necessary.
How is this achieved though?
The problem can be summarized like this. If you pipe ping with interval 0.1s to a while loop with sleep 1s, how do you skip pings as necessary to only ever echo the latest ping? To explain, consider this:
#!/bin/bash
ping -i 0.1 1.1.1.1 | while read ping_line
do
echo $ping_line
sleep 1
done
The problem with the above is that processing takes 1s, but pings arrive at 0.1s. How to skip to the last line though?
This is the problem we need to resolve.
Hrm.. I will try and think with you on this, but I'm not the best when it comes to scripting .__.
Ideally you'd want to fetch as much possible pings without just randomly skipping when not necessary or you'll lose valuable data points. Something with a counter to see how many are held in buffer?
Not entirely related to this, but assuming cpu usage is fixed; Is it possible to have even lower than .1 (100ms) of ping intervals with regular ping? Or is that the limit?
Even lower is possible. And since you have your own server, you can go as low as you like without worrying about getting into trouble.
Even lower is possible. And since you have your own server, you can go as low as you like without worrying about getting into trouble.
Yea certainly. I just recall some ping utilities where 0.1 was the max iirc. But i could be mistaken.
So you are saying that basically pings stack up faster than the script can handle, thus starting to buffer and buffer with more cpu usage as result?
No it's just I am simulating the problem.
#!/bin/bash
ping -i 0.1 1.1.1.1 | while read ping_line
do
echo $ping_line
sleep 1
done
The sleep 1 is just to simulate a delay in processing the incoming pings. We somehow need to skip to the latest ping available, if that makes sense? Such that we only echo the latest available ping each time. Naturally so long as the processing is fast enough then we don't need to skip any pings, but we do need to manage the situation in case of CPU spikes.
I hacked around this by resetting the file within 'tail -f', but this appears to give rise to high CPU usage for some reason I do not understand.
Something like this helps:
#!/bin/bash
file=/tmp/test
ping -i 0.1 1.1.1.1 | while read line; do echo $line > $file; done&
sleep 1
while true
do
cat $file
sleep 1
done
But I need to replace the while loop with something that executes on new data.
So I'm not sure this solves the problem entirely either.
No it's just I am simulating the problem.
#!/bin/bash ping -i 0.1 1.1.1.1 | while read ping_line do echo $ping_line sleep 1 done
The sleep 1 is just to simulate a delay in processing the incoming pings. We somehow need to skip to the latest ping available, if that makes sense? Such that we only echo the latest available ping each time. Naturally so long as the processing is fast enough then we don't need to skip any pings, but we do need to manage the situation in case of CPU spikes.
I hacked around this by resetting the file within 'tail -f', but this appears to give rise to high CPU usage for some reason I do not understand.
Possible to clear the file with another command instead?
That's what I tried with the echo "" > but you said the cpu issue remained. Maybe we can try something else?
I'd like to look back to a comment from @Marctraider
Not entirely related to this, but assuming cpu usage is fixed; Is it possible to have even lower than .1 (100ms) of ping intervals with regular ping? Or is that the limit?
As we continue our research, I hope we explore the other direction: how slowly can we send those pings while still maintaining good latency and speed control? Finding a good lower limit to the ping rate would have these advantages:
Do we have any information about how rapidly/how much LTE or Cable Modem speeds vary in real life? Maybe we could collect it ourselves...
We could collect long-term data (over the course of hours? maybe even days?) that record how often the CAKE-autorate algorithm needs to adjust the parameters, and the effect those changes made on latency (or the effect on latency, had those changes not been made...)
I think it is almost directly related to how fast your ping interval is truthfully, since the scripts ability to react to bufferbloat is completely dependent on ICMP response back and forth. The smaller the interval between pings the faster you can act on that first big bufferbloat 'spike'. Sure if you don't mind up to a full second of pure bufferbloat (With a icmp interval of a full second), that's afaik not what QoS/SQM is about, but to contain this to a minimum.
(For a gamer or something a full second of bufferbloat could mean a full second of delay and subsequently warping/hitching for the duration. Voip would be pretty heavily affected as well). For me personally, I would never use this or any script while gaming either, as i play on somewhat competitive level. As soon as i run a game on my desktop, my router goes into low latency mode and I set a very conservative fixed cake configuration which is guaranteed to hold my minimum speed)
What would be the reason that a 0.1 is even troublesome, I bet our CPU's (Well, most of them) can handle 100's of pings a second without much trouble. And payload can be kept to a minimum as well to keep bandwidth in check.
What you request is basically already in there, lots of options to tweak to minimize ping intervals yet have decent stable connection, by sharply decreasing and gently increasing bandwidth until RTT crosses some threshold.
I do like the idea of some auto calibration, I think the lua version has some mechanism like that.
I think perhaps the solution may be something along these lines:
#!/bin/bash
pipe=/tmp/test
mkfifo $pipe
ping -i 0.1 1.1.1.1 > $pipe&
sleep 1
while read ping_line <$pipe
do
echo $ping_line
dd if=$pipe iflag=nonblock of=/dev/null
done
But that does not work. I can't work out how to 'reset' the fifo buffer.
As we continue our research, I hope we explore the other direction: _how slowly can we send those pings while still maintaining good latency and speed control?
There are two issues at play here: 1) ping rate in not overwhelming reflectors, I think we can solve that by interleaving pining of multiple reflectors this will allow to keep the individual reflector rate low. 2) delay in getting new delay estimates and hence delay in detecting and responding to congestion.
As a data point, normal unix ping will allow intervals down to 0.2 seconds for non-root users while defaulting to 1 second intervals; that indicates to me that per-reflector rate somewhere between 0.2-5Hz (5 to 0.2 second interval) might be acceptable if we select from a sufficiently large set of reflectors. (Or better yet, operate our own reflectors at which point higher rates are an option, and here I mean that each user of autorate will operate her/his own VPS reflector).
For 2 I initially thought once per second would be enough, but @lynxthecat's access link demonstrates that might be a bit coarse and results in relative long delay spikes... Getting this down to 0.1 seconds seems feasible to me.
- It avoids the "tragedy of the commons",
I do not think that tragedy actually exists... IMHO this is just a way to dress up "insufficiently strict rule enforcement" in terms that make it appear something else then regulatory failure, often pushed by parties that have vested interest in seeing regulation fail, but I digress. Your point stands though, if the reflectors encounter too high load I expect them to rate-limit/de-prioritize ICMP packets (basically enforcing their rules, so not regulatory failure) the autorate code needs to expect and react to that in a considerate way (changing reflectors and giving the unhappy reflector time to col-off).
Do we have any information about how rapidly/how much LTE or Cable Modem speeds vary in real life? Maybe we could collect it ourselves...
I think to do that one would need to continuously saturate the link for a considerable amount of time and measure the packets traversing the link.... just looking at how often the code triggers will not give the "veridical" rate-changes of the link.
I think we want a last in first out buffer that is flushed on every read.
This one issue remains otherwise everything works well with my bash implementation.
Specifically this form:
#!/bin/bash
ping -i 0.1 1.1.1.1 > /tmp/ping_output&
tail -f /tmp/ping_output | while read ping_line
do
echo $ping_line
> /tmp/ping_output
done
is what is presently implemented. It works, but results in increasing CPU usage with time for some reason I do not understand.
I tried:
#!/bin/bash
> /tmp/ping_output
stdbuf -oL ping -i 0.1 1.1.1.1 > /tmp/ping_output&
tail -f /tmp/ping_output | while read ping_line
do
while read -t 0.01 last_line; do ping_line=$last_line; done
echo $ping_line
done
But this also fails. I am rather stumped by this.
If fifo can be relied upon to block ping whilst processing and not reading, I suppose that may work too.
Again, why not simply run ping in single-shot mode (-c 1
) and restart it again?
On an admittedly not too weak dual core arm a9 router I see:
rroot@turris:~# time /usr/bin/ping -D -c 1 192.168.42.201
PING 192.168.42.201 (192.168.42.201) 56(84) bytes of data.
[1646741801.236544] 64 bytes from 192.168.42.201: icmp_req=1 ttl=64 time=0.321 ms
--- 192.168.42.201 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.321/0.321/0.321/0.000 ms
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
so if the rounding is correct, less than 5-10 milliseconds "cost" of setting-up, calling, and executing ping (to a local target, so we see the rest of pings static cost better) so with a desired interval of 100ms there is no need to keep ping running...
EDIT: would be nice to get similar measurements for less beefy routers.
Perhaps I should revert to this. But I haven't quite given up yet.
Seems I've hit a self-imposed brick wall because I stubbornly want the 'tail -f' / ping or fifo block/unblock approach to work. So near, yet so far.
Set against my obsessive nature, since an intellectually satisfying and elegant solution escapes me, I have put the questions here:
https://forum.openwrt.org/t/for-bash-experts-pipe-buffering/122119/24
and here:
Hopefully something will come up, and failing which we can fall back to the one shot ping approach.
@Marctraider does this result in CPU increasing with time for you:
#!/bin/bash
pipe=/tmp/test
mkfifo $pipe
ping -D -i 0.1 1.1.1.1 > $pipe&
cat $pipe > /dev/null&
while true
do
head -1 $pipe
done
This may be a possible solution.
@richb-hanover @Marctraider I have pushed new commits to 'CAKE-autorate' based on the bash named pipe solution - does it resolve the CPU usage issue? Now ping results are output to a bash named pipe (fifo), which is continually flushed to /dev/null, but it seems individual values can be pulled out from the fifo using 'head -1' in a while loop. I hope this resolves the CPU issue associated with the 'tail -f' + truncation.
I retested single-shot pings, but calling ping repeatedly in one shot mode also results in a CPU usage spike on my RT3200.
Look like this fixes it, I see in all situations 2-5% usage at best, both fully loaded and idle.
Great! I will keep testing upcoming days for oddities but it looks like you nailed it. 👍
Alas... Even with this new version, I still see high usage. Here's a Youtube showing a few percent when the router is idle (and not running CAKE-autorate.sh)
Starting it consumes ~100% of the CPU, and the load average keeps creeping up.
Would it help to get an ArcherC7 for your own testing? (I could think about ways to make that happen)
A few hours after the script running, doing all sorts of things in the meanwhile including high bandwidth streams (4K 100+Mbps)
I will run it overnight and so some more testing and see if it still behaves after that.
Starting it consumes ~100% of the CPU, and the load average keeps creeping up.
Interesting, it appears mostly to be SIRQ (if I recognized the color of the bars correctly) and does not reflect in a large amount of CPU% assigned to either bash or the ping processes themselves. I wonder whether that is related to the FIFO, or any other things the script might do?
In case you repeat that test, could you also get the output of tc -s qdisc
while the script is running, as well as the output of df -h
(to see whether /tmp might be too large). Finally, it would be intersteing ot see a speedtest with the script running... I assume that will be attrocious.
Also, I have made a few more significant changes given the outcome of the discussion at:
https://forum.openwrt.org/t/for-bash-experts-pipe-buffering/122119/
Now the while loop for each reflector actually keeps up with every ping (not every other ping), and so I fear this may actually result in a little increased CPU usage, but the increase is associated with actually processing more of the ping results (should catch all of them, and can be verified by uncommenting the 'echo ping line' in 'monitor_reflector_path'.
@Marctraider it would be good to verify that the creeping CPU usage effect has still been ameliorated.
I am hoping there is some low hanging fruit in terms of getting CPU usage down, but I am operating at my limit in terms of bash scripting - it already has quite a few fancy features such as background process management, fifo use, etc.
Do any of you see any such low hanging fruit offering easy reductions to CPU usage? @dim-geo mysteriously helped out with my original 'ash' script and got CPU usage down, so perhaps he has some ideas for this new bash implementation in the 'CAKE-autorate' branch.
@richb-hanover thanks very much for the kind offer about the router. I am definitely keen to try to optimise this but maybe let's see first if we can get it down with a few easy fixes on the code.
And thank you for taking the time to make that video. It certainly looks bad, but I am confused about why the CPU usage shows 100% but the CAKE-autorate process does not come up having 80% or so. What sort of processor / kernel version does this use I wonder? Is there a way we can profile the CPU usage to see what is actually eating up the CPU cycles?
About the Archer C7 and SIRQ, is this thread relevant:
https://forum.openwrt.org/t/archer-c7-5ghz-performance-sirq-99/63104/19
See especially this:
https://forum.openwrt.org/t/archer-c7-5ghz-performance-sirq-99/63104/13?u=lynx
The sirqs are almost certainly being generated by network TX or RX. You can see by
cat /proc/sofitirqs Do it before and after a iperf run, copy/paste the two outputs into an excel sheet using the text import paste function and check the difference to see where they're being generated. Depending on the direction of your iperf transfer most of the softirqs will likely either be in NET_TX or NET_RX.
So could it literally just be the pings that are causing the CPU usage?
Could you test what happens with just launching 4 concurrent background pings @ with 0.1s interval?
So 12+ hours later the script still only consumes a few % CPU so no increase in load over time. (The version from yesterday)
I havent checked the new one yet.
The Archer C7 seems to have a pretty weak CPU and hardware in general, so i dont expect it to have such low load as I do, but 100% seems excessive. Since I also dont see cpu usage per process in the video, maybe the whole cpu usage readout is buggy? Or indeed could be somehow hardware/software/network related.
@Marctraider I am very grateful for you confirming. How is the performance in terms of rate control for you? For me I am very happy with the way it is handling my CAKE rate (which spans from about 20 to about 70 Mbit/s for me). Are you using 4 samples and 1 sample counts as reflector delayed and 10ms threshold or have you tweaked these further?
The new version will likely perform slightly better because the ping frequency (actually processed) is higher and also I now interleave pings so they are better spread out. I expect CPU utilisation to increase a little, but hopefully it will remain low. And I certainly don't want to see the CPU creeping up with time like it did before with 'tail -f' + truncation. BTW on your system is /tmp just ordinary ram? I am still curious what caused that CPU usage creep.
Also, is your load constant or does it stop? I have been playing around with the ping sleep function. Now when the base rate is maintained for X seconds the pingers go to sleep and the minimum rate is set. You can see the pingers have gone to sleep using 'ps |grep ping' and you see 'T' rather than 'S' or you can use tcpdump and inspect the interface.
I really think with some optimisation/robustification this is looking very promising as a generic solution for all.
You're very welcome.
I haven't done any extensive tweaking and min/maxing but i'm lucky to have a practical scenario where I can literally drop my internet speed at will so I can basically 'simulate' a sudden drop in available bandwidth, without 'switching' connections which is really nice and will definitely help with tweaking latency spikes.
So far I'm really happy about its performance and result (so far). Anywhere between 5 and 10% CPU is still acceptable for me so in my book I'm fine with some additional cycles used if that means more optimization.
As for /tmp, its Debian default (So I assume just disk). df -h does not indicate a mountpoint directly to /tmp so i assume it just goes under / (Maybe this was also a problem? I assume memory access is faster. On the other hand I assume the OS just buffers/cache this and doesn't continuously read/write to disk either way.
root@redundant:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 1.9G 0 1.9G 0% /dev
tmpfs 375M 724K 374M 1% /run
/dev/mmcblk0p2 28G 9.4G 17G 37% /
tmpfs 1.9G 0 1.9G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/mmcblk0p1 511M 3.5M 508M 1% /boot/efi
/dev/sda1 110G 24K 104G 1% /mnt/backup
tmpfs 375M 0 375M 0% /run/user/0
As for load, it looks like i see two processes of CAKE-autorate.sh at an average of 1% and 2% continuously. (On baseline 175Mbit qdisc) right after a bandwidth test. Shortly after that (assuming the pings stop) i just see one process at roughly 1-2%.
I see now that after a while the tc qdisc goes to minimum set of 90Mbit, this is correct yes?
I will replace the script with the new version tonight.
It would be good to verify /tmp/CAKE-autorate is on ram for most users. If it would sometimes end up on a disk like an HDD I imagine that might hurt performance a little.
Concerning minimum yes. It is designed to go to sleep providing the base load is maintained for ping_sleep_thr seconds - it stops the pinger processes and sets bandwidth to minimum (for safety to deal with any step loads) and then when load hits the load threshold it turns the pingers back on, does not change load until next loop and then allows decaying back to base or if load is sustained increasing with the normal set rate for load increase. The logic for this made my head spin last night and the main loop inside CAKE-autorate is now a little complicated. It may be that can be simplified to help speed up at least the main loop process.
I just feel better sleeping at night knowing my router isn't firing off pings and pointlessly maintaining baselines and processing delays, etc when the connection is not even in use.
This is what you should see (going from sleep to active):
root@OpenWrt:~/CAKE-autorate# ps |grep ping
20171 root 1248 S grep ping
31825 root 764 T /usr/bin/ping -D -i 0.1 1.1.1.1
31838 root 764 T /usr/bin/ping -D -i 0.1 1.0.0.1
31856 root 764 T /usr/bin/ping -D -i 0.1 8.8.8.8
31870 root 764 T /usr/bin/ping -D -i 0.1 8.8.4.4
root@OpenWrt:~/CAKE-autorate# ps |grep ping
20536 root 1248 S grep ping
31825 root 764 S /usr/bin/ping -D -i 0.1 1.1.1.1
31838 root 764 S /usr/bin/ping -D -i 0.1 1.0.0.1
31856 root 764 S /usr/bin/ping -D -i 0.1 8.8.8.8
31870 root 764 S /usr/bin/ping -D -i 0.1 8.8.4.4
The new version handles the fifo buffer slightly differently so I am hoping it is still OK. It also interleaves pings. And on my router I am seeing every ping line captured with 4 reflectors and 0.1s pings, so it looks good.
I have spent quite a bit of time dealing with the problem of router CPU spikes (e.g. owing to VPN or something else) - ideally ping results are skipped if the processing does not keep up, but it still works with the latest values and not old values. There may be a need to apply a 'dd flush' on the fifo at certain points, but I am not sure yet.
Can you see any optimizations for this loop:
https://github.com/lynxthecat/sqm-autorate/blob/CAKE-autorate/monitor_reflector_path.sh#L52
Since this is where a lot of the processing is done.
By the way:
echo $(date +%s.%N) $ping_line
here:
https://github.com/lynxthecat/sqm-autorate/blob/CAKE-autorate/monitor_reflector_path.sh#L73
Can be used to verify that the pings are being processed in the right way by inspecting the ping lines and comparing timestamps against the system timestamps. The times should be fairly in sync.
root@redundant:~# ps
PID TTY TIME CMD
695957 pts/1 00:00:00 bash
716593 pts/1 00:00:00 ps
root@redundant:~#
No results for pings, but that is not right as i can clearly see the pings going in 'top', also my own pings from another script are running. But I'm sure it works fine, i could indeed check with tcpdump but there's also other pings to the same gateway ongoing at the moment lol. (Unless this is only visible with the latest branch update)
Can't see anything in script to optimize at the moment, but again the usage is really really low, and my script runs on low intervals at the moment.
Field report for https://github.com/lynxthecat/CAKE-autorate/commit/6069321f5cbeaf577e42b42a6768c1335fefeb3c on my Archer C7v2...
Using reflector_ping_interval=0.1
:
betterspeedtest.sh
made CPU spike to 100% and Load Average go to 2.0+Using reflector_ping_interval=0.2
:
betterspeedtest.sh
made CPU spike to 85-95% during download and Load Average go to 2.0+Also: During the download phase of betterspeedtest.sh
the script continually write lines like this, over and over...
WARNING: encountered response from [ 8.8.8.8 ] that is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.4.4 ] that is > 500ms old. Skipping.
WARNING: encountered response from [ 1.0.0.1 ] that is > 500ms old. Skipping.
WARNING: encountered response from [ 1.1.1.1 ] that is > 500ms old. Skipping.
This did not happen during the upload phase of the speedtest
Thank you for testing @richb-hanover. The skipping occurs when the main processing loop cannot keep up with the rate of incoming data. It seems like with the Archer C7 even just ordinary traffic at your line rate causes huge CPU spikes. Did this also occur with the ping interval set to 0.2? How about reducing the reflectors to just 2 and use 0.5?
Do you have a sense of how these different scripts you are testing are performing in terms of controlling CAKE bandwidth (i.e. offering bandwidth when needed but preventing significant bufferbloat?).
Thank you for testing @richb-hanover. The skipping occurs when the main processing loop cannot keep up with the rate of incoming data. It seems like with the Archer C7 even just ordinary traffic at your line rate causes huge CPU spikes.
I'm not sure that's the explanation. I changed the code to display $t_start and $other_value, and got these results:
WARNING: encountered response from [ 1.1.1.1 ] 1647651329344444 1647651328837871 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.8.8 ] 1647651329425467 1647651328909500 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.8.8 ] 1647651329628267 1647651329110978 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.0.0.1 ] 1647651329815810 1647651329279158 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.8.8 ] 1647651329821885 1647651329309592 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.1.1.1 ] 1647651329960633 1647651329436894 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.8.8 ] 1647651330065726 1647651329515942 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.4.4 ] 1647651330078407 1647651329572759 is > 500ms old. Skipping.
WARNING: encountered response from [ 1.0.0.1 ] 1647651330201532 1647651329677802 is > 500ms old. Skipping.
WARNING: encountered response from [ 8.8.4.4 ] 1647651330924455 1647651330375886 is > 500ms old. Skipping.
The differences are greater than 500,000, but slightly. However, I never see 500+msec lag in the router - the added latency is generally 5-10 msec.
NB If I create a two-second blast of data, 10 lines of 500 ms old. Skipping get generated when the ping interval is 0.2.
Did this also occur with the ping interval set to 0.2? How about reducing the reflectors to just 2 and use 0.5?
No. Either setting ping interval to 0.2 or using only two reflectors doesn't generate those messages.
Actually, not quite true. Sometimes, either of those combinations causes a single Skipping message to be generated.
Do you have a sense of how these different scripts you are testing are performing in terms of controlling CAKE bandwidth (i.e. offering bandwidth when needed but preventing significant bufferbloat?).
I can't say. (I have fixed speed at 25mbps - the latency always stays between 7 & 15 msec)
Another thought: the delta's between the values are all slightly greater than 500,000 (Excel is my friend...)
The third column is the delta. If that value (for some unexplained reason) too large by 500,000, they might be reasonable latencies: 6.57 msec, 15.96 msec, 17.29 msec, etc.
\
1.64765E+15 1.64765E+15 506570
1.64765E+15 1.64765E+15 515960
1.64765E+15 1.64765E+15 517290
1.64765E+15 1.64765E+15 536660
1.64765E+15 1.64765E+15 512290
1.64765E+15 1.64765E+15 523740
1.64765E+15 1.64765E+15 549780
1.64765E+15 1.64765E+15 505650
1.64765E+15 1.64765E+15 523730
1.64765E+15 1.64765E+15 548570
Thanks @richb-hanover. There is a glitch here it seems. Perhaps I need to flock the writes.
I'm starting a new issue from #16 that began life as a trouble report, but morphed into observations about performance.
At @lynxthecat suggestion, I adjusted
defaults.sh
to slow the ping rate using:It did improve the utilization: Here's a video of
htop
while idle, then under load, and then recovering from the load. https://youtu.be/pUZ7MpN85K4PS I can't really tell whether this code addresses latency. I have a fixed speed fiber link through my ISP at 25/25 mbps, so my link speed doesn't vary much. So I just cannot say whether it makes a difference.