Optimize code to minimize CPU consumption

rany2 commented 1 year ago

What I've tried that didn't work:

Replace all bash arithmetic expansion and regex with awk

Any suggestions on what to try next? @moeller0 @lynxthecat

rany2 commented 1 year ago

I have a feeling something about the bash read loop uses a ton of CPU.

moeller0 commented 1 year ago

I think the report was not "large CPU consumption per se" but higher CPU consumption than before (with an somewhat undefined "before" state). I have a hunch the change to do a full regular expression parsing in concurrent_read_integer() might have increased the CPU cost some (but at the same time increasing correctness and robustness), and that the report might compare the state before that change with 2.0. But that is mostly speculation on my part.

rany2 commented 1 year ago

I thought the same but I've replaced the regex with awk in all functions including concurrent_read_integer along with getting rid of arithmetic expansion. I'm completely lost on what this issue might be, but I know the monitor processes are the ones using the CPU.

rany2 commented 1 year ago

I think by getting rid of the bash arithmetic expansion and regex and replacing it with awk that I got rid of the possibility of that being the issue. It must be something else

moeller0 commented 1 year ago

I do not believe that going from bash to awk is going to amortize the switch to another binary for short pieces of code like in the parser. @Lynx took great care to handle as much as possible directly in bash, which I assume to be the correct decision. My point is more that a fully fledged regular expression based input parsing is going to be more expensive than a simple only check the few things that are really essential via the simplest check possible...

but I know the monitor processes are the ones using the CPU.

These do all the munging of the pinger output into the internal data format autorate uses that essentially abstracts over different delay sources... so are expected to be heavy...

moeller0 commented 1 year ago

I think by getting rid of the bash arithmetic expansion and regex and replacing it with awk that I got rid of the possibility of that being the issue.

I guess you actually did some measurements to figure out the relative costs of both approaches, so how much does the CPU cost differ between the two, and which version do intend to carry forward?

rany2 commented 1 year ago

I guess you actually did some measurements to figure out the relative costs of both approaches, so how much does the CPU cost differ between the two, and which version do intend to carry forward?

There was no difference whatsoever. They both hovered at approximately the same range ~20% on my machine.

rany2 commented 1 year ago

These do all the munging of the pinger output into the internal data format autorate uses that essentially abstracts over different delay sources... so are expected to be heavy...

Right but I do not expect something like this to take up 20% of my CPU, especially not on an X86 machine. It just seems so odd that this relatively basic text processing is so resource intensive.

moeller0 commented 1 year ago

OK, so which one will you carry forward?

rany2 commented 1 year ago

I'll keep the bash version for readability. No need to spawn an additional process and increase complexity if bash already supports it and seems to be just as good as awk performance wise.

rany2 commented 1 year ago

I fear I'm at a deadlock now... at loss of ideas. My best guess now is that moving away from shell is the only solution to this issue or possibly trying to add support to another interpreter like zsh (so we could attempt to support both bash & zsh).

I've heard zsh has better performance than bash and less vodoo code.

moeller0 commented 1 year ago

Again, the report was not "absolute too high CPU cost" but "noticeable increase n CPU cost", the solution to that seems to me rather to figure out which of the recent changes increased the CPU cost than contemplating extreme measures like changing the language of implementation... It was always clear that bash is not the ideal vehicle performance-wise, so a certain high level of CPU cost is acceptable. Even on OpenWrt bash allows rather rapid development, as no compiler is needed and the "run-time" is available in opkg. The lua-autorate effort embraced a better interpreter, but ran into issues about requiring extra packages and OpenWrt changes that turned out hard to come by and robbing them of impulse.

Let's hope @gadolf will report a bit more precision information about the before and after states that might allow git-bisect the changes responsible for the added cost...

rany2 commented 1 year ago

I've always intended to try and get this sorted. @Gadolf's report of increased load consumption was not the driver behind this issue. Rather I was always off put by the ~20% CPU utilization.

Edit: I'll attempt to get zsh to work on it, porting to zsh is not as difficult as rewriting everything from scratch.. so I think it's worth a shot.

moeller0 commented 1 year ago

I've always intended to try and get this sorted. @Gadolf's report of increased load consumption was not the driver behind this issue. Rather I was always off put by the ~20% CPU utilization.

Well that are two separate issues a reported apparent regression, and the generally high cost of implementing a control loop in an interpreter... IMHO the first one is more important to tackle than the second, everyone using autorate has already made peace with the relatively high CPU cost.

Edit: I'll attempt to get zsh to work on it, porting to zsh is not as difficult as rewriting everything from scratch.. so I think it's worth a shot.

This appears rather heavy handed... it is your time though, so the best of luck .

P.S.: The traditional approach would be to isolate which step is the costly one and then trying to figure out how to optimize or avoid that step.

lynxthecat commented 1 year ago

Isn't there a way to profile the bash code to see which lines are consuming the most cycles? Is it related to the writing and reading of files to pass data between processes? I had wondered at one point about trying to setup a kind of data sharing highway between processes using coproc, but going to such a length would only make sense if the writing and reading of files is indeed causing excessive consumption.

rany2 commented 1 year ago

I believe you could set the PS4 prompt and then create a script to analyze the PS4 output to see which lines took the most time but with bash the process isn't very easy and I honestly wouldn't bother.

rany2 commented 1 year ago

It's simple enough with a basic script but I tried with cake-autorate and it's just too tedious.

moeller0 commented 1 year ago

So I guess I am missing something here, but I would simply for testing comment out the lines one by one or on logical sets (resulting in an ineffective controller, but this is only about figuring out the expensive parts) and use static assignments of the necessary variables (that are not read due to being commented out) and then see which of the operations is the most expensive.

rany2 commented 1 year ago

@moeller0 I've made some smaller scripts to emulate portions of cake-autorate, it appears the most expensive part of the script is the file writing operations. This explains why the monitor is the most CPU intensive, it has the most open, truncate, and write operations and that eats up at the CPU not arithemetic/checks/regex. I have a few ideas on how to optimize it which I'll try out.

rany2 commented 1 year ago

Slight question, what's the point of the large amount of pingers? Couldn't we do with just one pinger because that seems to get rid of the CPU consumption issue all together? (it's still there but much much less so, and the interval could now be set really low)

moeller0 commented 1 year ago

Intersting, on tmpfs these should not be all that costly, so I guess writing to a real disk would be even more painful...

One obvious change would be to write all reflector specific values into a single file, so only two file writes instead of six...

Or only write these out before terminating the monitor process if they are only used to persist over stopped reflectors.

moeller0 commented 1 year ago

GNU ping will only allow to query one reflector, so will busybox ping, only fping and the coming tsping will query multiple reflectors round-robin. So to be able to do sanity checking with GNU ping the ability to run multiple concurrent single-reflector pingers seems valuable. However, if you use fping, you only should have one pinger and one ping_monitor process....

Why large amounts of reflectors? Measuring against any reflector will measure the queueing level along the whole path, if e.g. the datacenter housing our instance of 1.1.1.1 has a congested link to the rest of the internet and we only query 1.1.1.1 we would throttle the cake shapers even if the the rest of the internet would be reachable without congestion. So the theory behind multiple reflectors is that over a large enough set, if all reflectors indicate congestion that this is most likely caused by the one link we want to control: our access link. Why interleave pingers instead of querying all at the same time? This is to get decent temporal fidelity without over-taxing individual reflectors. Nobody owes us a veridical ping response, let alone a timely one, and reflectors can and do rate limit and deprioritize ICMP packet generation if the request rate gets too high. Interleaving allows us high effective sampling rate with acceptable per-reflector rates. But at the cost of a somewhat larger set of reflectors in the active set...

rany2 commented 1 year ago

Thank you for taking the time to reply, much appreciated!

lynxthecat commented 1 year ago

So this confirms my suspicion above that writes and reads are eating up CPU cycles. How about my idea then of somehow setting up an information highway between processes using coproc or similar such that one process can set a variable in information highway that can be read from another process? So there is provided a common variable space that is accessible between all processes. So one process can set a variable within the common space and another process can read that variable from the common space by requesting value for that variable and reading response. I think this could be setup using coproc.

moeller0 commented 1 year ago

I do not think this is going to work the way you hope. Really the easiest way forward is trying to amortize the file cost over more content, like writing and reading an array with all fields in one go and instantaneously you shaved of 4 writes. However I think such tests should not be run on toy versions of autorate but on a production version with just enough changes to avoid the operation in question, microbenchmarks are informative, but come with their own specific challenges...

moeller0 commented 1 year ago

According to the manpage coprocs are essentially background proceesses automatically wrapped in a bi-directional FIFO to share information/messages. Probably useful, but not a way to make global variables...

rany2 commented 1 year ago

I've actually made a proof of concept that uses coprocs to have "global" variables between different subshells. It does work, this is what I have in mind to fix this high CPU consumption actually.

rany2 commented 1 year ago

Definitely our use will be very abstracted by the lib though :)

lynxthecat commented 1 year ago

It can be done @moeller0. I've seen a fully fledged working example on stackoverflow or a similar site. I've been trying to find it but haven't found it yet. A process could write to set variable. And another process could read to obtain variable. The way coprocs are implemented makes this possible for sure.

I do agree we should try to see if we can minimize the amount that such data sharing is needed. But since we do need it sometimes and since I think coproc mechanism might offer efficiency gain it seems worth trying.

If I find the link to the example I saw before I will post it.

moeller0 commented 1 year ago

Well, go wild, but I would feel better by first exploiting the obvious simple things like simply reducing the write operation to one third... these are really low hanging fruit if the analysis is correct.

A process could write to set variable. And another process could read to obtain variable.

Well that "variable" according to my reading of this: https://www.gnu.org/software/bash/manual/html_node/Coprocesses.html

is just a 'A coprocess is a shell command preceded by the coproc reserved word. A coprocess is executed asynchronously in a subshell, as if the command had been terminated with the ‘&’ control operator, with a two-way pipe established between the executing shell and the coprocess.'

But hey, let my puzzlement not stop you, as I said before I have too little time to actually help in any meaningful way, so I should stop getting in the way ;)

lynxthecat commented 1 year ago

There's an example of this out there. I wasn't dreaming it! With coprocs you read and write to FIFOs so you can do whatever you want with sending receiving data. Another idea I had by the way was to use coprocs for all processes. Since we often setup FIFOs but with coprocs it's all already done for us.

moeller0 commented 1 year ago

With coprocs you read and write to FIFOs so you can do whatever you want with sending receiving data.

Well in my reading you should be able do the same without coprocs by simply instantiating the FIFOs between the processes manually, but the question is about how many consumers and producers you want to share the data between, I guess.

My goal would be to keep things simple and only optimize/complicate stuff where it is worth doing it. As I said I would first try to simply reduce the number of apparently costly write operations....

rany2 commented 1 year ago

@lynxthecat Just FYI we will need to get rid of our other coprocs to have it work. I ended up implementing it partially (for some files) only to hit this issue: https://unix.stackexchange.com/a/531199

lynxthecat commented 1 year ago

Are you sure that was the issue? Because I thought from bash 5 and onwards multiple coprocs works? I searched again today for some time, but I cannot find the example that I came across earlier of leveraging a coproc to facilitate setting and reading shared variables across different processes.

rany2 commented 1 year ago

I'm sure you can't use more than two coproceses. Anyway if you want you could see what I implemented in my master branch, it does work but the log_stderr thing needs to be dropped as two coprocs cannot run at once.

rany2 commented 1 year ago

Actually I'm pretty sure we could do the log_stderr thing without coproceses, not sure why you opted for it. (It's more complex this way..)

lynxthecat commented 1 year ago

Sure - perhaps just a regular backgrounded function with fd?

Your attempt look promising. By the way, with:

https://github.com/lynxthecat/cake-autorate/compare/master...rany2:cake-autorate:master_draft#diff-ede52f00cfef420b52c6155b28d9b007efe4f1f29a727adebd00853cfec922caR99

Suppose two processes initiate that at once, how do you ensure that the read gives the correct response and not a response from another get attempt?

rany2 commented 1 year ago

The message response format is key=value, so if the response does not have the key= at the start it will retry. The function keeps calling itself indefinitely until it gets the valid reply.

lynxthecat commented 1 year ago

Ah yes. Sweet. Have you tested to see if that saves cycles vs regular write and read?

rany2 commented 1 year ago

I haven't tested yet but seeing we are saving on a ton of syscalls, I'm 90% sure it would.

lynxthecat commented 1 year ago

Looks super cute to me, in any case.

rany2 commented 1 year ago

Thanks!

rany2 commented 1 year ago

@lynxthecat It feels nice to use but just for info, it doesn't work well. Having a single process managing it forces you to try far too many times in very rapid succession to get anything close to "realtime" which uses way too many resources. Best approach is to just have it write to a single file.

Of course you might think we could use a lock but then we kind of defeated the whole purpose, unfortunately.

lynxthecat commented 1 year ago

Ah, I see. Seems like we will just have to accept the computational expense of the present writes to /tmp/ and read and re-read with checks strategy.

@moeller0 had the cool idea of sending information from one process to another by passing information in the form of a FIFO line to be read in while loop like 'COMMAND, X'. Perhaps there may be an application for that idea one way or another in this or other aspects of cake-autorate.

rany2 commented 1 year ago

@lynxthecat that's what my coproc test did in this case, but the issue is that if multiple processes were writing at once then the result is mangled. So it cannot work without a locking mechanism and I don't think the lock could work with anything other than some file-based locking option.

rany2 commented 1 year ago

FYI I think concurrent_read_integer is flawed. Instead of a regex check it should have check for a known start and end sequence, your current implementation would allow a part of a number that wasn't written completely.

rany2 commented 1 year ago

So basically any file that you write to would start and end with xxx and concurrent_read_integer checks if it starts and ends with xxx and strips it.

lynxthecat commented 1 year ago

Or like a checksum? There was a detailed discussion about how a read in this context will either be null or a good value owing to the underlying mechanisms at play in bash - see:

https://forum.openwrt.org/t/cake-w-adaptive-bandwidth/135379/1842

So I feel fairly confident that the present read and re-read with checks mechanism will do the right thing? There may still be the potential for optimisation though?

rany2 commented 1 year ago

Either way it's easier to just add some known start and stop boundary so you can check for it. Less expensive than a regex match IMO.

lynxthecat commented 1 year ago

In extremis, if colo of OpenWrt is correct then the regex could just be replaced with:

[[ -n ${value} ]] && value=$(( value ))

But is the following regex really super costly(?):

        if [[ ${value} =~ ^([-])?([0]+)?([0-9]+)$ ]]; then

            # Strip out any leading zeros and employ arithmetic context
            value=$((${BASH_REMATCH[1]}${BASH_REMATCH[3]}))

Here is a comparison on my RT3200:

root@OpenWrt:~# time for ((i=1; i<=10000; i++)); do [[ ${value} =~ ^([-])?([0]+)?([0-9]+)$ ]] && value=$((${BASH_REMATCH
[1]}${BASH_REMATCH[3]})); done

real    0m1.889s
user    0m0.985s
sys     0m0.313s

time for ((i=1; i<=10000; i++)); do [[ -n ${value} ]] && value=$(( value )); done

real    0m0.321s
user    0m0.253s
sys     0m0.006s

lynxthecat / cake-autorate

Optimize code to minimize CPU consumption #154