Closed garoto closed 3 years ago
The slow part is shuf -i "1500000000-$(date +%s)" -n 5
. In my Windows 8.1 VM it takes about 14s; BusyBox shuf
on Linux takes about 7s; coreutils shuf
on Linux takes milliseconds.
It's the price we pay for BusyBox being small: it really does generate all those lines and shuffle them before throwing away all but five. Coreutils just selects five values from the range.
I'll have a look at what it would take to make BusyBox shuf
cleverer.
It's the price we pay for BusyBox being small: it really does generate all those lines and shuffle them before throwing away all but five. Coreutils just selects five values from the range.
Interesting. Thanks for looking into it. Just tried the oneliner using a recent-ish PortableGit shuf.exe
binary and it performs much better indeed:
root@crap-pc:R:/ # time sh -c 'for i in $(J:/Misc/PortableGit-2.29.2.3-64-bit/usr/bin/shuf.exe -i "1500000000-$(date +%s)" -n 5) ; do date -d @$i +%Y%m%d%H%M.%S; done'
201904210844.40
201711241649.18
202004130214.09
201709281604.52
201812020153.40
real 0m 0.14s
user 0m 0.00s
sys 0m 0.01s
I've improved performance of BusyBox shuf
by stopping shuffling once the required number of lines has been processed. On Linux this reduces the time for the sample command from 6.9s to 0.4s. Processing 5 lines is much faster than 115 million!
This isn't as highly optimised as coreutils shuf
but I'd like to get the change accepted upstream. I've submitted a patch that increases the size of the BusyBox binary by 39 bytes. We'll see if upstream considers this a sensible trade-off.
Awesome and much kudos to you for all your work on this busybox port!
On a not-so-related-note, when I try to apply your patch as posted to the busybox mailinglist to either busybox src tree and to your own fork tree, it always fails:
root@buildbox:~/src/busybox # git fetch
root@buildbox:~/src/busybox # git slog|head -n1
cbfdeba66 10-Mar-2021 16:31 +0100 Denys Vlasenko (HEAD -> master, origin/master, origin/HEAD) hush: make LINENO selectable without BASH-COMPAT
root@buildbox:~/src/busybox # git am < shuf3.patch
Applying:
error: patch failed: coreutils/shuf.c:39
error: coreutils/shuf.c: patch does not apply
Patch failed at 0001
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
root@buildbox:~/src/busybox # git am --abort
root@buildbox:~/src/busybox # patch --verbose -p1 -i shuf3.patch
Hmm... Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|diff --git a/coreutils/shuf.c b/coreutils/shuf.c
|index fdbd3e9b2..839d1b80f 100644
|--- a/coreutils/shuf.c
|+++ b/coreutils/shuf.c
--------------------------
patching file coreutils/shuf.c
Using Plan A...
Hunk #1 succeeded at 39 with fuzz 2.
Hunk #2 FAILED at 50.
Hunk #3 FAILED at 69.
Hunk #4 FAILED at 130.
3 out of 4 hunks FAILED -- saving rejects to file coreutils/shuf.c.rej
done
root@buildbox:~/src/busybox # cat shuf3.patch
diff --git a/coreutils/shuf.c b/coreutils/shuf.c
index fdbd3e9b2..839d1b80f 100644
--- a/coreutils/shuf.c
+++ b/coreutils/shuf.c
@@ -39,8 +39,10 @@
/*
* Use the Fisher-Yates shuffle algorithm on an array of lines.
+ * If the required number of output lines is less than the total
+ * we can stop shuffling early.
*/
-static void shuffle_lines(char **lines, unsigned numlines)
+static void shuffle_lines(char **lines, unsigned numlines, unsigned outlines)
{
unsigned i;
unsigned r;
@@ -48,7 +50,7 @@ static void shuffle_lines(char **lines, unsigned numlines)
srand(monotonic_us());
- for (i = numlines-1; i > 0; i--) {
+ for (i = numlines-1; i > 0 && outlines > 0; i--, outlines--) {
r = rand();
/* RAND_MAX can be as small as 32767 */
if (i > RAND_MAX)
@@ -67,7 +69,7 @@ int shuf_main(int argc, char **argv)
char *opt_i_str, *opt_n_str, *opt_o_str;
unsigned i;
char **lines;
- unsigned numlines;
+ unsigned numlines, outlines;
char eol;
opts = getopt32(argv, "^"
@@ -128,24 +130,24 @@ int shuf_main(int argc, char **argv)
fclose_if_not_stdin(fp);
}
- if (numlines != 0)
- shuffle_lines(lines, numlines);
+ outlines = numlines;
+ if (opts & OPT_n) {
+ outlines = xatou(opt_n_str);
+ if (outlines > numlines)
+ outlines = numlines;
+ }
+
+ if (numlines != 0 && outlines != 0)
+ shuffle_lines(lines, numlines, outlines);
if (opts & OPT_o)
xmove_fd(xopen(opt_o_str, O_WRONLY|O_CREAT|O_TRUNC), STDOUT_FILENO);
- if (opts & OPT_n) {
- unsigned maxlines;
- maxlines = xatou(opt_n_str);
- if (numlines > maxlines)
- numlines = maxlines;
- }
-
eol = '\n';
if (opts & OPT_z)
eol = '\0';
- for (i = 0; i < numlines; i++) {
+ for (i = numlines-outlines; i < numlines; i++) {
if (opts & OPT_i)
printf("%u%c", (unsigned)(uintptr_t)lines[i], eol);
else
root@buildbox:~/src/busybox # uname -a
Linux buildbox 5.10.0-3-amd64 #1 SMP Debian 5.10.13-1 (2021-02-06) x86_64 GNU/Linux
root@buildbox:~/src/busybox # lsb
lsblk lsb_release
root@buildbox:~/src/busybox # lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux bullseye/sid
Release: testing
Codename: bullseye
root@buildbox:~/src/busybox #
Got any hints of what I might be doing wrong?
- for (i = numlines-1; i > 0; i--) {
+ for (i = numlines-1; i > 0 && outlines > 0; i--, outlines--) {
If that's what's really in your file (and it isn't an artefact introduced by getting the text into GitHub) it looks as though you might have saved it as HTML rather than text. Maybe.
If that's what's really in your file [...]
Yep, guessing that is it. I 'generated' (ctrl+c & ctrl+v) the patch file by copying and pasting it from the html rendered output from the mailing list web interface.
Sorry for the noise.
Ugh, sorry for being a nuisance, but can you make your shuf patch avaiable in an ASCII format for me to test, since I can't figure out how to extract the relevant bits from the html rendition of the maililing list.
Hmm, the mailing list archive doesn't make it easy to get a pure text copy of a message. I did it in Firefox by selecting 'Save Page As...' from the right click menu and changing the format to 'Text Files'. Alternatively it's possible to download all the messages for the month as a gzipped mbox file.
Anyway, to make it easier for anyone who wants to test this I've put the patch and a binary built by applying it on top of current git HEAD (commit 35f7c5e6e8519ca16363c799fb1b112edf1eb12b) in the prerelease directory of my website.
The contents of the prerelese directory are transitory, things may vanish from there without notice.
It's been a while and upstream BusyBox have shown no interest in my patch, so I took another look at it. I've managed to make it even smaller, reducing the bloat from 39 bytes to 20 bytes with no material impact on performance.
I've submitted the revised patch to the upstream mailing list. There's a copy in the prerelease directory along with a new binary built by applying the patch to current git HEAD (commit c6bb78bf30f17aeac45e4bf9ea7eb1856fb5cdb7).
Upstream BusyBox accepted the second version of my patch.
In my Windows 8.1 VM the command shuf -i "1500000000-$(date +%s)" -n 5
now takes about 17s with the current release (slightly longer than the time reported above, due to the passage of time). In the latest prerelease it takes 0.25s.
real 0m 0.32s
on my potato rig. Glad your patch was approved upstream, thank you for working on this for this long.
The improved version is in the latest release.
Windows 8.1 + cmd.exe + busybox64 v1.34.0-FRP-3902-g61e53aa93
My nix shell knowledge is mediocre at best, so let me know if my oneliner syntax is the problem.