Open shlomif opened 4 years ago
Yikes, that is pretty bad. I'll look into it.
On Mon, 30 Mar 2020 05:02:04 -0700 Alex Lyon notifications@github.com wrote:
Yikes, that is pretty bad. I'll look into it.
Thanks!
--
Shlomi Fish https://www.shlomifish.org/ Free (Creative Commons) Music Downloads, Reviews and more - http://jamendo.com/
The only reason some jokes never die is because Chuck Norris is not interested in killing them. — http://www.shlomifish.org/humour/bits/facts/Chuck-Norris/
Please reply to list if it's a mailing list post - http://shlom.in/reply .
You might want to try again now. Performance is ~16-17 times worse than GNU right now (at least for me).
When #1547 is merged, performance will be ~6-7 times worse (as measured on my desktop).
@Arcterus : thanks! The performance looks much better on uutils' git master HEAD:
[shlomif@localhost coreutils]$ time (seq 2 "$(( 10 ** 6 ))" | ./target/release/coreutils factor | md5sum)
4cfd4f52505c4e3852c373b8b2e8a628 -
( seq 2 "$(( 10 ** 6 ))" | ./target/release/coreutils factor | md5sum; ) 1.46s user 0.05s system 112% cpu 1.341 total
[shlomif@localhost coreutils]$ time (seq 2 "$(( 10 ** 6 ))" | /usr/bin/factor | md5sum)
4cfd4f52505c4e3852c373b8b2e8a628 -
( seq 2 "$(( 10 ** 6 ))" | /usr/bin/factor | md5sum; ) 0.26s user 0.10s system 152% cpu 0.236 total
[shlomif@localhost coreutils]$
@shlomif You are welcome :)
I left the bug open through my performance-related PRs to factor
, as it's not yet as fast as the GNU implementation, but I'm hoping to get there... and beyond! :smiling_imp:
Another issue with the current factor implementation is that it only supports numbers up to 264 - 1. GNU factor supports numbers up to 2127 - 1 if compiled without the GNU Multiple Precision (GMP) library or arbitrary-precision numbers if compiled with it. This can be tested by running this command:
factor 9223372036854775807 18446744073709551615 170141183460469231731687303715884105727 340282366920938463463374607431768211455
which is factor
263 - 1, 264 - 1, 2127 - 1 and 2128 - 1.
GNU factor without GMP produces:
$ factor 9223372036854775807 18446744073709551615 170141183460469231731687303715884105727 340282366920938463463374607431768211455
9223372036854775807: 7 7 73 127 337 92737 649657
18446744073709551615: 3 5 17 257 641 65537 6700417
170141183460469231731687303715884105727: 170141183460469231731687303715884105727
factor: ‘340282366920938463463374607431768211455’ is too large
GNU factor with GMP produces:
$ factor 9223372036854775807 18446744073709551615 170141183460469231731687303715884105727 340282366920938463463374607431768211455
9223372036854775807: 7 7 73 127 337 92737 649657
18446744073709551615: 3 5 17 257 641 65537 6700417
170141183460469231731687303715884105727: 170141183460469231731687303715884105727
340282366920938463463374607431768211455: 3 5 17 257 641 65537 274177 6700417 67280421310721
and uutils' factor produces:
$ uu-factor 9223372036854775807 18446744073709551615 170141183460469231731687303715884105727 340282366920938463463374607431768211455
factor: warning: 170141183460469231731687303715884105727: number too large to fit in target type
factor: warning: 340282366920938463463374607431768211455: number too large to fit in target type
9223372036854775807: 7 7 73 127 337 60247241209
18446744073709551615: 3 5 17 257 641 65537 6700417
This was first documented in #201.
@nbraud Thanks for your work improving the performance! I am getting incorrect results for some numbers, for example:
$ factor 10425511
10425511: 2441 4271
$ uu-factor 10425511
10425511: 10425511
You can see a bunch by running this in Bash:
diff <(seq 2 $(( 10 ** 8 )) | factor) <(seq 2 $(( 10 ** 8 )) | uu-factor)
If you have time, would you mind tracking down which commit introduced the problem? If not, I’ll find it later today or tomorrow.
2020/06/22 午前7:13、Teal Dulcet notifications@github.comのメール:
@nbraud Thanks for your work improving the performance! I am getting incorrect results for some numbers, for example:
$ factor 10425511 10425511: 2441 4271 $ uu-factor 10425511 10425511: 10425511 You can see a bunch by running this in Bash:
diff <(seq 2 $(( 10 8 )) | factor) <(seq 2 $(( 10 8 )) | uu-factor) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Another issue with the current factor implementation is that it only supports numbers up to 2⁶⁴ - 1. GNU factor supports numbers up to 2¹²⁷ - 1 if compiled without the GNU Multiple Precision (GMP) library or arbitrary-precision numbers if compiled with it.
I'm aware, and it's on my TODO-list, but I do not believe it makes much sense to go implement support for larger numbers, when the current implementation will be far too slow to meaningfully use there.
Please open an issue about it, though, as it's definitely worth tracking :)
If you have time, would you mind tracking down which commit introduced the problem? If not, I’ll find it later today or tomorrow.
FYI, @Arcterus, I found the bug and opened a new PR. (Update: fixed and merged)
The merged commit, effb94b0, (thanks primarily to @nbraud) moves the needle on this code's performance from about 1/200th to about 1/3rd the speed of GNU factor
for very large sets of numbers (eg, factoring the first 10,000,000 integers). For a smaller set of numbers, this now operates at perceptually the same speed as GNU factor
(likely dominated by startup and I/O). The latest benchmark comparison is included within the discussion of #1571.
There are obviously more performance increases that can be obtained by optimizing the algorithms, but the effort seems exponentially harder for very small possible improvements. If someone want to champion massaging this into a world record breaker, I'm all for it. But, otherwise, I'm happy with this code state and performance and think this is a good place to stop.
@shlomif , your comments/experience/thoughts? @uutils/members ?
Hi @rivy ! Thanks for the heads up.
The performance ratio is somewhat worse here:
[shlomif@telaviv1 coreutils]$ time ( seq 2 10000000 | target/release/coreutils factor > /dev/null )
( seq 2 10000000 | target/release/coreutils factor > /dev/null; ) 12.72s user 0.13s system 101% cpu 12.620 total
[shlomif@telaviv1 coreutils]$ time ( seq 2 10000000 | factor > /dev/null )
( seq 2 10000000 | factor > /dev/null; ) 2.82s user 0.17s system 107% cpu 2.776 total
[shlomif@telaviv1 coreutils]$ perl -E 'say 12.72 / 2.776'
4.5821325648415
[shlomif@telaviv1 coreutils]$ inxi -CSG
System:
Host: telaviv1.shlomifish.org Kernel: 5.9.3-desktop-1.mga8 x86_64 bits: 64
Desktop: KDE Plasma 5.20.2 Distro: Mageia 8 mga8
CPU:
Info: Dual Core model: Intel Core i3-2100 bits: 64 type: MT MCP
L2 cache: 3072 KiB
Speed: 1596 MHz min/max: 1600/3100 MHz Core speeds (MHz): 1: 1596 2: 1596
3: 1596 4: 1597
Graphics:
Device-1: Intel 2nd Generation Core Processor Family Integrated Graphics
driver: i915 v: kernel
Display: server: Mageia X.org 1.20.9 driver: intel
resolution: 1920x1080~60Hz
OpenGL: renderer: Mesa DRI Intel HD Graphics 2000 (SNB GT1)
v: 3.3 Mesa 20.2.1
[shlomif@telaviv1 coreutils]$
Anyway, I am fine with closing this ticket because my interest in uutils is primarily academic. A 4.5 ratio is much more acceptable than a 200x one, so good job - and thanks.
I can confirm @shlomif results. For numbers less than 104, it is about 2.4× slower, for 105 about 3.5× slower, for 106 about 4.5× slower, for 107 about 4.3× slower and for 108 about 3.2× slower. For example, here is a benchmark of the system GNU factor (factor
), a locally compiled version of GNU factor (./factor
) and uutils' factor (./uu-factor
):
Benchmark #1: seq 0 1000000 | factor
Time (x̅ mean ± σ std dev): 0.3190s ± 0.0051s [User: 0.2982s, System: 0.0504s]
Range (min … x̃ median … max): 0.311s … 0.322s … 0.324s CPU: 109.3%, 5 runs
Benchmark #2: seq 0 1000000 | ./factor
Time (x̅ mean ± σ std dev): 0.3080s ± 0.0088s [User: 0.2640s, System: 0.0706s]
Range (min … x̃ median … max): 0.295s … 0.308s … 0.318s CPU: 108.6%, 5 runs
Benchmark #3: seq 0 1000000 | ./uu-factor
Time (x̅ mean ± σ std dev): 1.3884s ± 0.0332s [User: 1.3698s, System: 0.0276s]
Range (min … x̃ median … max): 1.347s … 1.383s … 1.444s CPU: 100.6%, 5 runs
Summary
#2 ‘seq 0 1000000 | ./factor’ ran
1.036 ± 0.034 times (103.6%) faster than #1 ‘seq 0 1000000 | factor’
4.508 ± 0.168 times (450.8%) faster than #3 ‘seq 0 1000000 | ./uu-factor’
(This is similar to @rivy's benchmarks, but using my Bash port of hyperfine, which outputs more info.)
However, if you go backwards 106 from 264, it is 5.8× slower:
Benchmark #1: seq 18446744073708551615 18446744073709551615 | factor
Time (x̅ mean ± σ std dev): 58.2510s ± 0.2072s [User: 58.1622s, System: 2.4806s]
Range (min … x̃ median … max): 58.040s … 58.163s … 58.525s CPU: 104.1%, 5 runs
Benchmark #2: seq 18446744073708551615 18446744073709551615 | ./factor
Time (x̅ mean ± σ std dev): 58.0552s ± 0.1607s [User: 58.0212s, System: 2.5270s]
Range (min … x̃ median … max): 57.800s … 58.063s … 58.307s CPU: 104.3%, 5 runs
Benchmark #3: seq 18446744073708551615 18446744073709551615 | ./uu-factor
Time (x̅ mean ± σ std dev): 336.7276s ± 0.4459s [User: 336.7042s, System: 0.3726s]
Range (min … x̃ median … max): 335.909s … 336.824s … 337.240s CPU: 100.1%, 5 runs
Summary
#2 ‘seq 18446744073708551615 18446744073709551615 | ./factor’ ran
1.003 ± 0.005 times (100.3%) faster than #1 ‘seq 18446744073708551615 18446744073709551615 | factor’
5.800 ± 0.018 times (580.0%) faster than #3 ‘seq 18446744073708551615 18446744073709551615 | ./uu-factor’
Since it seems to be slower for larger numbers, it follows that the ratio will be even worse once #1559 is fixed. While the performance has definitely improved a lot, I think this issue should at least be left open until it is no more than around two times slower in all cases, particularly after arbitrary-precision numbers are supported.
I'm happy to leave this an an open issue to encourage more refinement, with a plan to revisit next year.
PR's are welcome!
Since it seems to be slower for larger numbers, it follows that the ratio will be even worse once #1559 is fixed.
Yes, that's a large part of why I didn't bother implementing support for arbitrarily-large numbers yet, since it would be unusably slow anyhow.
While the performance has definitely improved a lot, I think this issue should at least be left open until it is no more than two times slower in all cases, particularly after arbitrary-precision numbers are supported.
I don't have strong opinions about what the performance threshold should be before closing this, but I have some branches implementing some better algorithms etc. so there's a clear direction forward for closing this.
Unfortunately, I was pretty broken for the last half-year and didn't have a workstation to run builds and benchmarks on (so far it's all been on my laptop, but it's very slow and has thermal issues that make precise benchmarking mostly impossible) but things should hopefully be looking up this year.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
it's not yet as fast as the GNU implementation, but I'm hoping to get there... and beyond! 😈
This C++ library allegedly has better performance than GNU factor and it does have good documentation on the algorithms the author used to achieve that (both on the README and in the code), so it may be a good reference for potential improvements here: https://github.com/hurchalla/factoring. It also supports up to 128-bit numbers.
FYI, if you want to make up a 4x performance improvement, doing so is pretty easy here as https://en.algorithmica.org/hpc/algorithms/factorization/ has a good overview for fast factoring of relatively small numbers in Rust. The key things missing here is Pollard-Brent Algorithm which lets you compute the GCD much less often and should yield a roughly 5x speedup. (you could also improve the gcd
computation but that is a smaller detail).
For performance on >64 bits, you probably would need ecm if you want top of the line performance, but that gets complicated fairly quickly.
has a good overview for fast factoring of relatively small numbers in Rust
Yes, that looks like a good resource, although note that his example code is in C++ not Rust. While uutils should definitely implement the Pollard-Brent rho algorithm as GNU has, I am not sure it would lead to a 5× speed up. I ran his factor code and it was orders of magnitude slower than GNU factor, although he does not have a complete implementation of all his methods combined, so it is hard to directly compare the performance. It looks like uutils already uses the faster binary GCD algorithm.
There is more information about all the algorithms used by GNU factor here: https://www.maizure.org/projects/decoded-gnu-coreutils/factor.html.
This C++ library allegedly has better performance than GNU factor
Since it has now been almost 3 years, I thought I should update the benchmarks from https://github.com/uutils/coreutils/issues/1456#issuecomment-721688400 and include more programs for comparison. For integers less than 104, it is now about 14.8× slower, for 105 about 7.6× slower, for 106 about 6.0× slower, for 107 about 5.7× slower and for 108 about 4.2× slower:
$ ./time.sh -i -r 5 'seq 2 10000000 | '{factor,./gnu/factor,./uu/factor,./factoring/gcc_example,./factoring/clang_example,'./numbers -p'}
Benchmark #1: seq 2 10000000 | factor
Time (mean ± σ std dev): 2.2944s ± 0.0112s [User: 2.0142s, System: 0.4204s]
Range (min … median … max): 2.281s … 2.291s … 2.313s CPU: 106.1%, 5 runs
Benchmark #2: seq 2 10000000 | ./gnu/factor
Time (mean ± σ std dev): 2.2020s ± 0.0087s [User: 1.8820s, System: 0.4534s]
Range (min … median … max): 2.188s … 2.201s … 2.213s CPU: 106.1%, 5 runs
Benchmark #3: seq 2 10000000 | ./uu/factor
Time (mean ± σ std dev): 12.6390s ± 0.0296s [User: 10.2822s, System: 2.4632s]
Range (min … median … max): 12.591s … 12.637s … 12.684s CPU: 100.8%, 5 runs
Benchmark #4: seq 2 10000000 | ./factoring/gcc_example
Time (mean ± σ std dev): 15.2668s ± 0.1457s [User: 12.6782s, System: 2.7610s]
Range (min … median … max): 15.093s … 15.291s … 15.512s CPU: 101.1%, 5 runs
Benchmark #5: seq 2 10000000 | ./factoring/clang_example
Time (mean ± σ std dev): 15.2658s ± 0.0899s [User: 12.7338s, System: 2.7064s]
Range (min … median … max): 15.172s … 15.212s … 15.419s CPU: 101.1%, 5 runs
Benchmark #6: seq 2 10000000 | ./numbers -p
Time (mean ± σ std dev): 19.5844s ± 0.2708s [User: 17.0100s, System: 2.7530s]
Range (min … median … max): 19.241s … 19.561s … 20.072s CPU: 100.9%, 5 runs
Summary
#2 ‘seq 2 10000000 | ./gnu/factor’ ran
1.042 ± 0.007 times (4.2%) faster than #1 ‘seq 2 10000000 | factor’
5.740 ± 0.026 times (474.0%) faster than #3 ‘seq 2 10000000 | ./uu/factor’
6.933 ± 0.072 times (593.3%) faster than #4 ‘seq 2 10000000 | ./factoring/gcc_example’
6.933 ± 0.049 times (593.3%) faster than #5 ‘seq 2 10000000 | ./factoring/clang_example’
8.894 ± 0.128 times (789.4%) faster than #6 ‘seq 2 10000000 | ./numbers -p’
Overall, uutils' factor seems to have gotten significantly slower over the past 3 years. Considering that GNU factor has largely not changed in over 10 years (since the rewrite in 2012) and uutils' factor has not changed in two years (since @nbraud left), my best guess is that this is due to compiler improvements that have benefited GNU factor more than uutils' factor.
If you go backwards 105 from 263, it is now 5.2× slower than GNU factor, but 20.9× slower than that EPR Factoring Library I mentioned above:
$ ./time.sh -i -r 5 "seq $(bc <<<'(2^63) - (10^5) - 1') $(bc <<<'(2^63) - 1') | "{factor,./gnu/factor,./uu/factor,./factoring/gcc_example,./factoring/clang_example}
Benchmark #1: seq 9223372036854675807 9223372036854775807 | factor
Time (mean ± σ std dev): 3.2782s ± 0.0088s [User: 3.2758s, System: 0.0080s]
Range (min … median … max): 3.267s … 3.276s … 3.294s CPU: 100.2%, 5 runs
Benchmark #2: seq 9223372036854675807 9223372036854775807 | ./gnu/factor
Time (mean ± σ std dev): 3.1176s ± 0.0144s [User: 3.0974s, System: 0.0146s]
Range (min … median … max): 3.104s … 3.114s … 3.145s CPU: 99.8%, 5 runs
Benchmark #3: seq 9223372036854675807 9223372036854775807 | ./uu/factor
Time (mean ± σ std dev): 16.2826s ± 0.1737s [User: 16.2178s, System: 0.0360s]
Range (min … median … max): 16.014s … 16.267s … 16.557s CPU: 99.8%, 5 runs
Benchmark #4: seq 9223372036854675807 9223372036854775807 | ./factoring/gcc_example
Time (mean ± σ std dev): 1.1940s ± 0.6380s [User: 0.8254s, System: 0.0332s]
Range (min … median … max): 0.873s … 0.875s … 2.470s CPU: 71.9%, 5 runs
Benchmark #5: seq 9223372036854675807 9223372036854775807 | ./factoring/clang_example
Time (mean ± σ std dev): 0.7786s ± 0.0031s [User: 0.7162s, System: 0.0510s]
Range (min … median … max): 0.773s … 0.779s … 0.782s CPU: 98.5%, 5 runs
Summary
#5 ‘seq 9223372036854675807 9223372036854775807 | ./factoring/clang_example’ ran
4.210 ± 0.020 times (321.0%) faster than #1 ‘seq 9223372036854675807 9223372036854775807 | factor’
4.004 ± 0.025 times (300.4%) faster than #2 ‘seq 9223372036854675807 9223372036854775807 | ./gnu/factor’
20.913 ± 0.238 times (1,991.3%) faster than #3 ‘seq 9223372036854675807 9223372036854775807 | ./uu/factor’
1.534 ± 0.819 times (53.4%) faster than #4 ‘seq 9223372036854675807 9223372036854775807 | ./factoring/gcc_example’
Due to #1559, uutils factor of course cannot currently test numbers over 64-bits. However, if I continue to compare GNU factor to the EPR Factoring Library, it is 3.5× faster for 64-bit numbers:
$ ./time.sh -i -r 5 "seq $(bc <<<'2^64') $(bc <<<'(2^64) + (10^5)') | "{factor,./gnu/factor,./factoring/gcc_example,./factoring/clang_example}
Benchmark #1: seq 18446744073709551616 18446744073709651616 | factor
Time (mean ± σ std dev): 4.1524s ± 0.0236s [User: 4.1490s, System: 0.0100s]
Range (min … median … max): 4.113s … 4.157s … 4.178s CPU: 100.2%, 5 runs
Benchmark #2: seq 18446744073709551616 18446744073709651616 | ./gnu/factor
Time (mean ± σ std dev): 3.9936s ± 0.0552s [User: 3.9614s, System: 0.0268s]
Range (min … median … max): 3.950s … 3.971s … 4.101s CPU: 99.9%, 5 runs
Benchmark #3: seq 18446744073709551616 18446744073709651616 | ./factoring/gcc_example
Time (mean ± σ std dev): 1.2992s ± 0.0106s [User: 1.2544s, System: 0.0296s]
Range (min … median … max): 1.287s … 1.299s … 1.317s CPU: 98.8%, 5 runs
Benchmark #4: seq 18446744073709551616 18446744073709651616 | ./factoring/clang_example
Time (mean ± σ std dev): 1.1244s ± 0.0041s [User: 1.0612s, System: 0.0454s]
Range (min … median … max): 1.118s … 1.124s … 1.131s CPU: 98.4%, 5 runs
Summary
#4 ‘seq 18446744073709551616 18446744073709651616 | ./factoring/clang_example’ ran
3.693 ± 0.025 times (269.3%) faster than #1 ‘seq 18446744073709551616 18446744073709651616 | factor’
3.552 ± 0.051 times (255.2%) faster than #2 ‘seq 18446744073709551616 18446744073709651616 | ./gnu/factor’
1.155 ± 0.010 times (15.5%) faster than #3 ‘seq 18446744073709551616 18446744073709651616 | ./factoring/gcc_example’
28.8× faster for 96-bit numbers:
$ ./time.sh -i -r 5 'seq 79228162514264337593543940336 79228162514264337593543962335 | '{factor,./gnu/factor,./factoring/gcc_example,./factoring/clang_example}
Benchmark #1: seq 79228162514264337593543940336 79228162514264337593543962335 | factor
Time (mean ± σ std dev): 152.4726s ± 1.5660s [User: 152.4634s, System: 0.0122s]
Range (min … median … max): 149.630s … 152.627s … 154.343s CPU: 100.0%, 5 runs
Benchmark #2: seq 79228162514264337593543940336 79228162514264337593543962335 | ./gnu/factor
Time (mean ± σ std dev): 150.3186s ± 0.2998s [User: 150.3032s, System: 0.0060s]
Range (min … median … max): 149.911s … 150.330s … 150.812s CPU: 100.0%, 5 runs
Benchmark #3: seq 79228162514264337593543940336 79228162514264337593543962335 | ./factoring/gcc_example
Time (mean ± σ std dev): 7.7510s ± 0.0475s [User: 7.7174s, System: 0.0114s]
Range (min … median … max): 7.690s … 7.731s … 7.814s CPU: 99.7%, 5 runs
Benchmark #4: seq 79228162514264337593543940336 79228162514264337593543962335 | ./factoring/clang_example
Time (mean ± σ std dev): 5.2026s ± 0.0816s [User: 5.1606s, System: 0.0220s]
Range (min … median … max): 5.089s … 5.184s … 5.310s CPU: 99.6%, 5 runs
Summary
#4 ‘seq 79228162514264337593543940336 79228162514264337593543962335 | ./factoring/clang_example’ ran
29.307 ± 0.549 times (2,830.7%) faster than #1 ‘seq 79228162514264337593543940336 79228162514264337593543962335 | factor’
28.893 ± 0.457 times (2,789.3%) faster than #2 ‘seq 79228162514264337593543940336 79228162514264337593543962335 | ./gnu/factor’
1.490 ± 0.025 times (49.0%) faster than #3 ‘seq 79228162514264337593543940336 79228162514264337593543962335 | ./factoring/gcc_example’
and a whopping 334× faster than GNU factor for 127-bit numbers:
$ ./time.sh -i -r 5 "seq $(bc <<<'(2^127) - (10^2) - 1') $(bc <<<'(2^127) - 1') | "{factor,./gnu/factor,./factoring/gcc_example,./factoring/clang_example}
Benchmark #1: seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | factor
Time (mean ± σ std dev): 143.1562s ± 0.5082s [User: 143.1556s, System: 0.0000s]
Range (min … median … max): 142.371s … 143.422s … 143.732s CPU: 100.0%, 5 runs
Benchmark #2: seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./gnu/factor
Time (mean ± σ std dev): 142.3252s ± 0.7053s [User: 141.9832s, System: 0.0060s]
Range (min … median … max): 141.651s … 141.996s … 143.588s CPU: 99.8%, 5 runs
Benchmark #3: seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./factoring/gcc_example
Time (mean ± σ std dev): 0.9602s ± 0.7591s [User: 0.5456s, System: 0.0080s]
Range (min … median … max): 0.507s … 0.564s … 2.473s CPU: 57.7%, 5 runs
Benchmark #4: seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./factoring/clang_example
Time (mean ± σ std dev): 0.4254s ± 0.1519s [User: 0.3314s, System: 0.0080s]
Range (min … median … max): 0.312s … 0.356s … 0.719s CPU: 79.8%, 5 runs
Summary
#4 ‘seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./factoring/clang_example’ ran
336.521 ± 120.198 times (33,552.1%) faster than #1 ‘seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | factor’
334.568 ± 119.506 times (33,356.8%) faster than #2 ‘seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./gnu/factor’
2.257 ± 1.958 times (125.7%) faster than #3 ‘seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./factoring/gcc_example’
Considering that GNU factor seems to be around 5 times faster than uutils' factor, if this library is over 300 times faster than GNU factor, that means it would be at least 1,500 times faster than uutils' factor for 128-bit numbers. I also found that for some numbers with large 64-bit factors it is an astonishing almost 2,000 times faster than GNU factor, which mean it would be at least 10,000 times faster than uutils' factor! 🤯 Note that this library is slower than both GNU factor and uutils' factor for smaller numbers under 32-bits as can be seen from the first benchmark above, but I filed an issue about this: https://github.com/hurchalla/factoring/issues/1.
For reference, the above benchmarks include the system GNU factor (factor
), a locally compiled version of GNU factor with -O3
and LTO enabled (./gnu/factor
), uutils' factor built with the "release" profile (./uu/factor
), a simple wrapper program I wrote for the EPR Factoring Library adapted from the author's example compiled with both GCC (./factoring/gcc_example
) and Clang (./factoring/clang_example
), and lastly my Numbers Tool program which includes a partial C++ port of GNU factor (./numbers -p
).
I tested several other programs and libraries, but nothing else seemed to have competitive performance in this range. The algorithms used by the yafu program (Yet Another Factorization Utility) may be useful for numbers above 64-bits (up to 160 digits or around 512-bits), but it was hard to benchmark, as it has extremely verbose output which is informative (similar to the undocumented ---debug
option of GNU factor), but significantly effects the performance when factoring a large range of numbers.
Nice investigation. How did you compile the rust coreutils ?
Nice investigation. How did you compile the rust coreutils ?
Thanks. This is the main command I used: make PROFILE=release
. The full list of commands are on the README for my tool (click "Instructions" to show them). From reading your Cargo.toml
file, it looks like this should enable LTO. I am open to suggestions for additional flags to add to further improve the resulting performance.
ok, so, it should be ok. I am surprised that we see such a big difference.
Compilers haven't improved that much (and clang and rust are both based on LLVM).
thanks
I complied GNU coreutils with the default GCC and uutils of course with the default rustc/LLVM. I suppose I could compile GNU coreutils with Clang/LLVM as well, but I figured it would be more optimized for their own compiler/toolchain. For reference, I used GCC 11.4, Clang 14.0 and cargo/rustc 1.72.
I no longer has access to the Intel Xeon E3 server I used in 2020 for the benchmarking, so I used an Intel Core i7-9750 system this time. It is probably faster than the old system (for single threaded programs at least), but I would not think this would make any difference in the GNU vs uutils performance ratio. Anyway, maybe someone else should build GNU and uutils coreutils themselves and then run hyperfine or my Benchmarking Tool to see if they can reproduce the results.
After building uutils using
cargo build --release
on Fedora 32 x86-64:A ~200 times performance loss is very bad.