sh can be very slow with bg jobs

avih commented 1 month ago

test case:

$ now()   { R=$(date +%s%N); R=$((R/1000000 - ${1-0})); }
$ xtime() { now; "$@"; now "$R"; echo "$R ms"; }
$ loop()  { x=$1; while [ "$x" != 0 ]; do x=$((x-1)); done; }
$ xtime
15 ms
$ xtime sleep .5
546 ms
$ xtime loop 100000
401 ms
$ xtime loop 100
16 ms
$ sleep 10 &
[1] 12208
$ xtime loop 100
3208 ms
$ xtime loop 50
1656 ms
$
[1]+  Done                       sleep 10
$ xtime loop 100
16 ms
$ xtime loop 100
16 ms
$

This is not only with sleep bg job. It also happens with notepad and I think anything else.

rmyorston commented 1 month ago

Well, that was an interesting one. Some code introduced in 2012 was buggy and resulted in a user-visible problem. A workaround was added in 2013 and the actual problem was fixed in 2018, but the workaround was left in place. Until today.

Details are in the commit message.

There are prereleases: anything after PRE-5442 has the fix.

avih commented 1 month ago

Thanks.

So can I assume that waitpid is called very frequently (every command?) in the sh command loop, and when it was implemented with a timeout of 1ms, then it basically gave up its time slice, which would take at least anywhere between 1ms to 15ms (depending on the current system's load and timeBeginPeriod value), so each loop iteration which is 2-3 commands took ~30ms ?

For what it's worth, for me the latest pre-release (5442) is still about x1.5 slower with a background process compared to without bg jobs, but it's at least not x10000 (x10K) slower like it was before. Though still, x1.5 slowdown is not negligible.

I'm guessing much of the time is spent in the two ckmalloc in each waitpid_child. Maybe keep the allocations between invocations, and only enlarge them as needed? or some other solution which wouldn't require so frequent (additional) mallocs?

Or maybe realloc only if pid_nr changed from the previous call? (and free if it's 0 and there is an allocation).

Or, with more potential implications and complications, abort early in waitpid unless some X ms have elapsed since the last time it was entered (where X is 1 or 2 ms for instance).

rmyorston commented 1 month ago

Yes, waitpid is called every time a command is executed.

I've tried to improve matters by these changes:

Avoid one call to ckmalloc by getting rid of the pidlist array: the PID of the exiting process can be found from its handle.
Retain the allocated proclist array between calls. It will grow as big as it needs to be and is never freed. In most cases it'll probably only ever have a few elements.
The two passes through the linked list of jobs has been reduced to one. The array may need more allocations initially but once it's big enough it doesn't need to grow again.

Making measurements of this stuff in my Windows virtual machines is a pain. Not only does Windows randomly run things when least expected, the timings are affected by what's happening on the host and other VMs. I don't much trust the raw numbers so the overhead figures are only quoted to the nearest percentage point.

     Background tasks      1       2       3       10
     --------------------------------------------------
     PRE-5442             19%     23%     28%
     Drop pidlist         16%     21%     25%
     Static proclist      13%     17%     21%
     Single job loop      12%     17%     21%      46%

Those measurements are from my 32-bit Window 8.1 VM. For 64-bit:

     Background tasks      1       2       3
     ------------------------------------------
     PRE-5442             37%     41%     46%
     PRE-5443             31%     35%     38%

In both cases the figures show a slight improvement in all cases.

As was probably to be expected the overhead increases with the number of background tasks. About 4% for each.

avih commented 1 month ago

Thanks for the info and effort.

Unfortunately, it doesn't look like it helped much. With pre-release busybox-w64-PRE-5443-gdde99c5ae.exe and 1 bg job it's still x1.5 slower:

R:\>busybox-w64-PRE-5443-gdde99c5ae.exe sh
R:/ $ . ./time.sh
R:/ $ xtime loop 100000
619 ms
R:/ $ xtime loop 100000
616 ms
R:/ $ xtime loop 100000
617 ms
R:/ $ xtime loop 100000
621 ms
R:/ $ xtime loop 100000
613 ms
R:/ $ sleep 10&
[1] 13732
R:/ $ xtime loop 100000
894 ms
R:/ $ xtime loop 100000
893 ms
R:/ $ xtime loop 100000
897 ms

5442 pre-release:

R:\>busybox-w64-PRE-5442-gbb8f6b688.exe sh
R:/ $ . ./time.sh
R:/ $ xtime loop 100000
603 ms
R:/ $ xtime loop 100000
604 ms
R:/ $ xtime loop 100000
603 ms
R:/ $ sleep 10&
[1] 16576
R:/ $ xtime loop 100000
924 ms
R:/ $ xtime loop 100000
922 ms
R:/ $ xtime loop 100000
936 ms
R:/ $ xtime loop 100000
604 ms
[1]+  Done                       sleep 10
R:/ $ xtime loop 100000
605 ms
R:/ $

So without bg jobs 5443 it's 2% slower, and with 1 bg job it's about 3% faster compared to 5442. These are small differences, but they're consistent when switching forth and back between the builds.

Apologies for not measuring before guessing that that the mallocs might be responsible for much of the overhead. I would not have suggested to improve if I knew these are the expected gains/changes.

Side note 1: my own (clang + ucrt) build is about 400 ms for 100K loop, while the pre-release is ~ 600 ms. I'm guessing this is either due lack of optimization with the pre-release builds, or maybe the ucrt is quite faster (I did notice it in general, e.g. strstr is way faster with ucrt).

Side note 2: to measure more consistently without being subjected to cpu frequency throttling, it's useful to set the max cpu percentage to 99% (power options, change this plan, advanced, processor power managenemt, maximum processor state) - this prevents switching to "turbo" frequency which is not sustainable for long durations. It's also possible to do that from the command line as admin (and similarly restore to 100 once you're done). This changes the "plugged in" value ("set ac value index"):

powercfg /setacvalueindex SCHEME_CURRENT SUB_PROCESSOR PROCTHROTTLEMAX 99
powercfg /setactive SCHEME_CURRENT

Making measurements of this stuff in my Windows virtual machines is a pain.

Aren't you using busybox-w32 on windows reagulary-ish?

I assumed you must be, or else what motivation do you have to work on this?...

(but I do very very much appreciate that you do work on it)

rmyorston commented 1 month ago

I would not have suggested to improve if I knew these are the expected gains/changes.

Sure, they aren't much, but they are something. And the code is now more compact, so on the whole it's an improvement.

Aren't you using busybox-w32 on windows reagulary-ish?

Only to test it.

I assumed you must be, or else what motivation do you have to work on this?...

It's fun.

ale5000-git commented 1 month ago

Busybox is very fun :-) Without busybox for Windows I probably would never have known about the shell. It is very fun creating shell scripts and I think busybox is superior to bash for a lot of things; just some things busybox is missing compared to bash but for the rest is a lot better.

rmyorston / busybox-w32

sh can be very slow with bg jobs #434