Open avih opened 1 month ago
Well, that was an interesting one. Some code introduced in 2012 was buggy and resulted in a user-visible problem. A workaround was added in 2013 and the actual problem was fixed in 2018, but the workaround was left in place. Until today.
Details are in the commit message.
There are prereleases: anything after PRE-5442 has the fix.
Thanks.
So can I assume that waitpid
is called very frequently (every command?) in the sh command loop, and when it was implemented with a timeout of 1ms, then it basically gave up its time slice, which would take at least anywhere between 1ms to 15ms (depending on the current system's load and timeBeginPeriod value), so each loop iteration which is 2-3 commands took ~30ms ?
For what it's worth, for me the latest pre-release (5442) is still about x1.5 slower with a background process compared to without bg jobs, but it's at least not x10000 (x10K) slower like it was before. Though still, x1.5 slowdown is not negligible.
I'm guessing much of the time is spent in the two ckmalloc
in each waitpid_child
. Maybe keep the allocations between invocations, and only enlarge them as needed? or some other solution which wouldn't require so frequent (additional) mallocs?
Or maybe realloc only if pid_nr
changed from the previous call? (and free if it's 0 and there is an allocation).
Or, with more potential implications and complications, abort early in waitpid unless some X ms have elapsed since the last time it was entered (where X is 1 or 2 ms for instance).
Yes, waitpid
is called every time a command is executed.
I've tried to improve matters by these changes:
ckmalloc
by getting rid of the pidlist
array: the PID of the exiting process can be found from its handle.proclist
array between calls. It will grow as big as it needs to be and is never freed. In most cases it'll probably only ever have a few elements.Making measurements of this stuff in my Windows virtual machines is a pain. Not only does Windows randomly run things when least expected, the timings are affected by what's happening on the host and other VMs. I don't much trust the raw numbers so the overhead figures are only quoted to the nearest percentage point.
Background tasks 1 2 3 10
--------------------------------------------------
PRE-5442 19% 23% 28%
Drop pidlist 16% 21% 25%
Static proclist 13% 17% 21%
Single job loop 12% 17% 21% 46%
Those measurements are from my 32-bit Window 8.1 VM. For 64-bit:
Background tasks 1 2 3
------------------------------------------
PRE-5442 37% 41% 46%
PRE-5443 31% 35% 38%
In both cases the figures show a slight improvement in all cases.
As was probably to be expected the overhead increases with the number of background tasks. About 4% for each.
Thanks for the info and effort.
Unfortunately, it doesn't look like it helped much. With pre-release busybox-w64-PRE-5443-gdde99c5ae.exe
and 1 bg job it's still x1.5 slower:
R:\>busybox-w64-PRE-5443-gdde99c5ae.exe sh
R:/ $ . ./time.sh
R:/ $ xtime loop 100000
619 ms
R:/ $ xtime loop 100000
616 ms
R:/ $ xtime loop 100000
617 ms
R:/ $ xtime loop 100000
621 ms
R:/ $ xtime loop 100000
613 ms
R:/ $ sleep 10&
[1] 13732
R:/ $ xtime loop 100000
894 ms
R:/ $ xtime loop 100000
893 ms
R:/ $ xtime loop 100000
897 ms
5442 pre-release:
R:\>busybox-w64-PRE-5442-gbb8f6b688.exe sh
R:/ $ . ./time.sh
R:/ $ xtime loop 100000
603 ms
R:/ $ xtime loop 100000
604 ms
R:/ $ xtime loop 100000
603 ms
R:/ $ sleep 10&
[1] 16576
R:/ $ xtime loop 100000
924 ms
R:/ $ xtime loop 100000
922 ms
R:/ $ xtime loop 100000
936 ms
R:/ $ xtime loop 100000
604 ms
[1]+ Done sleep 10
R:/ $ xtime loop 100000
605 ms
R:/ $
So without bg jobs 5443 it's 2% slower, and with 1 bg job it's about 3% faster compared to 5442. These are small differences, but they're consistent when switching forth and back between the builds.
Apologies for not measuring before guessing that that the mallocs might be responsible for much of the overhead. I would not have suggested to improve if I knew these are the expected gains/changes.
Side note 1: my own (clang + ucrt) build is about 400 ms for 100K loop, while the pre-release is ~ 600 ms. I'm guessing this is either due lack of optimization with the pre-release builds, or maybe the ucrt is quite faster (I did notice it in general, e.g. strstr
is way faster with ucrt).
Side note 2: to measure more consistently without being subjected to cpu frequency throttling, it's useful to set the max cpu percentage to 99% (power options, change this plan, advanced, processor power managenemt, maximum processor state) - this prevents switching to "turbo" frequency which is not sustainable for long durations. It's also possible to do that from the command line as admin (and similarly restore to 100 once you're done). This changes the "plugged in" value ("set ac value index"):
powercfg /setacvalueindex SCHEME_CURRENT SUB_PROCESSOR PROCTHROTTLEMAX 99
powercfg /setactive SCHEME_CURRENT
Making measurements of this stuff in my Windows virtual machines is a pain.
Aren't you using busybox-w32 on windows reagulary-ish?
I assumed you must be, or else what motivation do you have to work on this?...
(but I do very very much appreciate that you do work on it)
I would not have suggested to improve if I knew these are the expected gains/changes.
Sure, they aren't much, but they are something. And the code is now more compact, so on the whole it's an improvement.
Aren't you using busybox-w32 on windows reagulary-ish?
Only to test it.
I assumed you must be, or else what motivation do you have to work on this?...
It's fun.
Busybox is very fun :-) Without busybox for Windows I probably would never have known about the shell. It is very fun creating shell scripts and I think busybox is superior to bash for a lot of things; just some things busybox is missing compared to bash but for the rest is a lot better.
test case:
This is not only with
sleep
bg job. It also happens with notepad and I think anything else.