uutils / coreutils

Cross-platform Rust rewrite of the GNU coreutils
https://uutils.github.io/
MIT License
17.57k stars 1.26k forks source link

uutils' factor is much slower than GNU's coreutils factor #1456

Open shlomif opened 4 years ago

shlomif commented 4 years ago

After building uutils using cargo build --release on Fedora 32 x86-64:

[shlomif@localhost coreutils]$ time (seq 2 "$(( 10 ** 6 ))" | ./target/release/uutils factor | md5sum)
4cfd4f52505c4e3852c373b8b2e8a628  -
( seq 2 "$(( 10 ** 6 ))" | ./target/release/uutils factor | md5sum; )  48.35s user 4.08s system 108% cpu 48.240 total
[shlomif@localhost coreutils]$ time (seq 2 "$(( 10 ** 6 ))" | /usr/bin/factor | md5sum)
4cfd4f52505c4e3852c373b8b2e8a628  -
( seq 2 "$(( 10 ** 6 ))" | /usr/bin/factor | md5sum; )  0.25s user 0.10s system 160% cpu 0.221 total

A ~200 times performance loss is very bad.

Arcterus commented 4 years ago

Yikes, that is pretty bad. I'll look into it.

shlomif commented 4 years ago

On Mon, 30 Mar 2020 05:02:04 -0700 Alex Lyon notifications@github.com wrote:

Yikes, that is pretty bad. I'll look into it.

Thanks!

--

Shlomi Fish https://www.shlomifish.org/ Free (Creative Commons) Music Downloads, Reviews and more - http://jamendo.com/

The only reason some jokes never die is because Chuck Norris is not interested in killing them. — http://www.shlomifish.org/humour/bits/facts/Chuck-Norris/

Please reply to list if it's a mailing list post - http://shlom.in/reply .

Arcterus commented 4 years ago

You might want to try again now. Performance is ~16-17 times worse than GNU right now (at least for me).

Arcterus commented 4 years ago

When #1547 is merged, performance will be ~6-7 times worse (as measured on my desktop).

shlomif commented 4 years ago

@Arcterus : thanks! The performance looks much better on uutils' git master HEAD:

[shlomif@localhost coreutils]$ time (seq 2 "$(( 10 ** 6 ))" | ./target/release/coreutils factor | md5sum)
4cfd4f52505c4e3852c373b8b2e8a628  -
( seq 2 "$(( 10 ** 6 ))" | ./target/release/coreutils factor | md5sum; )  1.46s user 0.05s system 112% cpu 1.341 total
[shlomif@localhost coreutils]$ time (seq 2 "$(( 10 ** 6 ))" | /usr/bin/factor | md5sum)                                                       
4cfd4f52505c4e3852c373b8b2e8a628  -
( seq 2 "$(( 10 ** 6 ))" | /usr/bin/factor | md5sum; )  0.26s user 0.10s system 152% cpu 0.236 total
[shlomif@localhost coreutils]$ 
nbraud commented 4 years ago

@shlomif You are welcome :)

I left the bug open through my performance-related PRs to factor, as it's not yet as fast as the GNU implementation, but I'm hoping to get there... and beyond! :smiling_imp:

tdulcet commented 4 years ago

Another issue with the current factor implementation is that it only supports numbers up to 264 - 1. GNU factor supports numbers up to 2127 - 1 if compiled without the GNU Multiple Precision (GMP) library or arbitrary-precision numbers if compiled with it. This can be tested by running this command:

factor 9223372036854775807 18446744073709551615 170141183460469231731687303715884105727 340282366920938463463374607431768211455

which is factor 263 - 1, 264 - 1, 2127 - 1 and 2128 - 1.

GNU factor without GMP produces:

$ factor 9223372036854775807 18446744073709551615 170141183460469231731687303715884105727 340282366920938463463374607431768211455
9223372036854775807: 7 7 73 127 337 92737 649657
18446744073709551615: 3 5 17 257 641 65537 6700417
170141183460469231731687303715884105727: 170141183460469231731687303715884105727
factor: ‘340282366920938463463374607431768211455’ is too large

GNU factor with GMP produces:

$ factor 9223372036854775807 18446744073709551615 170141183460469231731687303715884105727 340282366920938463463374607431768211455
9223372036854775807: 7 7 73 127 337 92737 649657
18446744073709551615: 3 5 17 257 641 65537 6700417
170141183460469231731687303715884105727: 170141183460469231731687303715884105727
340282366920938463463374607431768211455: 3 5 17 257 641 65537 274177 6700417 67280421310721

and uutils' factor produces:

$ uu-factor 9223372036854775807 18446744073709551615 170141183460469231731687303715884105727 340282366920938463463374607431768211455
factor: warning: 170141183460469231731687303715884105727: number too large to fit in target type
factor: warning: 340282366920938463463374607431768211455: number too large to fit in target type
9223372036854775807: 7 7 73 127 337 60247241209
18446744073709551615: 3 5 17 257 641 65537 6700417

This was first documented in #201.

tdulcet commented 4 years ago

@nbraud Thanks for your work improving the performance! I am getting incorrect results for some numbers, for example:

$ factor 10425511
10425511: 2441 4271
$ uu-factor 10425511
10425511: 10425511

You can see a bunch by running this in Bash:

diff <(seq 2 $(( 10 ** 8 )) | factor) <(seq 2 $(( 10 ** 8 )) | uu-factor)
Arcterus commented 4 years ago

If you have time, would you mind tracking down which commit introduced the problem? If not, I’ll find it later today or tomorrow.

2020/06/22 午前7:13、Teal Dulcet notifications@github.comのメール:

 @nbraud Thanks for your work improving the performance! I am getting incorrect results for some numbers, for example:

$ factor 10425511 10425511: 2441 4271 $ uu-factor 10425511 10425511: 10425511 You can see a bunch by running this in Bash:

diff <(seq 2 $(( 10 8 )) | factor) <(seq 2 $(( 10 8 )) | uu-factor) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

nbraud commented 4 years ago

Another issue with the current factor implementation is that it only supports numbers up to 2⁶⁴ - 1. GNU factor supports numbers up to 2¹²⁷ - 1 if compiled without the GNU Multiple Precision (GMP) library or arbitrary-precision numbers if compiled with it.

I'm aware, and it's on my TODO-list, but I do not believe it makes much sense to go implement support for larger numbers, when the current implementation will be far too slow to meaningfully use there.

Please open an issue about it, though, as it's definitely worth tracking :)

nbraud commented 4 years ago

If you have time, would you mind tracking down which commit introduced the problem? If not, I’ll find it later today or tomorrow.

FYI, @Arcterus, I found the bug and opened a new PR. (Update: fixed and merged)

rivy commented 3 years ago

The merged commit, effb94b0, (thanks primarily to @nbraud) moves the needle on this code's performance from about 1/200th to about 1/3rd the speed of GNU factor for very large sets of numbers (eg, factoring the first 10,000,000 integers). For a smaller set of numbers, this now operates at perceptually the same speed as GNU factor (likely dominated by startup and I/O). The latest benchmark comparison is included within the discussion of #1571.

There are obviously more performance increases that can be obtained by optimizing the algorithms, but the effort seems exponentially harder for very small possible improvements. If someone want to champion massaging this into a world record breaker, I'm all for it. But, otherwise, I'm happy with this code state and performance and think this is a good place to stop.

@shlomif , your comments/experience/thoughts? @uutils/members ?

shlomif commented 3 years ago

Hi @rivy ! Thanks for the heads up.

The performance ratio is somewhat worse here:

[shlomif@telaviv1 coreutils]$ time ( seq 2 10000000 | target/release/coreutils factor > /dev/null )
( seq 2 10000000 | target/release/coreutils factor > /dev/null; )  12.72s user 0.13s system 101% cpu 12.620 total
[shlomif@telaviv1 coreutils]$ time ( seq 2 10000000 | factor > /dev/null )
( seq 2 10000000 | factor > /dev/null; )  2.82s user 0.17s system 107% cpu 2.776 total
[shlomif@telaviv1 coreutils]$ perl -E 'say 12.72 / 2.776'
4.5821325648415
[shlomif@telaviv1 coreutils]$ inxi -CSG
System:
  Host: telaviv1.shlomifish.org Kernel: 5.9.3-desktop-1.mga8 x86_64 bits: 64 
  Desktop: KDE Plasma 5.20.2 Distro: Mageia 8 mga8 
CPU:
  Info: Dual Core model: Intel Core i3-2100 bits: 64 type: MT MCP 
  L2 cache: 3072 KiB 
  Speed: 1596 MHz min/max: 1600/3100 MHz Core speeds (MHz): 1: 1596 2: 1596 
  3: 1596 4: 1597 
Graphics:
  Device-1: Intel 2nd Generation Core Processor Family Integrated Graphics 
  driver: i915 v: kernel 
  Display: server: Mageia X.org 1.20.9 driver: intel 
  resolution: 1920x1080~60Hz 
  OpenGL: renderer: Mesa DRI Intel HD Graphics 2000 (SNB GT1) 
  v: 3.3 Mesa 20.2.1 
[shlomif@telaviv1 coreutils]$  

Anyway, I am fine with closing this ticket because my interest in uutils is primarily academic. A 4.5 ratio is much more acceptable than a 200x one, so good job - and thanks.

tdulcet commented 3 years ago

I can confirm @shlomif results. For numbers less than 104, it is about 2.4× slower, for 105 about 3.5× slower, for 106 about 4.5× slower, for 107 about 4.3× slower and for 108 about 3.2× slower. For example, here is a benchmark of the system GNU factor (factor), a locally compiled version of GNU factor (./factor) and uutils' factor (./uu-factor):

Benchmark #1: seq 0 1000000 | factor
  Time (x̅ mean ± σ std dev):      0.3190s ±  0.0051s          [User: 0.2982s, System: 0.0504s]
  Range (min … x̃ median … max):   0.311s …  0.322s …  0.324s   CPU: 109.3%, 5 runs

Benchmark #2: seq 0 1000000 | ./factor
  Time (x̅ mean ± σ std dev):      0.3080s ±  0.0088s          [User: 0.2640s, System: 0.0706s]
  Range (min … x̃ median … max):   0.295s …  0.308s …  0.318s   CPU: 108.6%, 5 runs

Benchmark #3: seq 0 1000000 | ./uu-factor
  Time (x̅ mean ± σ std dev):      1.3884s ±  0.0332s          [User: 1.3698s, System: 0.0276s]
  Range (min … x̃ median … max):   1.347s …  1.383s …  1.444s   CPU: 100.6%, 5 runs

Summary
  #2 ‘seq 0 1000000 | ./factor’ ran
    1.036 ± 0.034 times (103.6%) faster than #1 ‘seq 0 1000000 | factor’
    4.508 ± 0.168 times (450.8%) faster than #3 ‘seq 0 1000000 | ./uu-factor’

(This is similar to @rivy's benchmarks, but using my Bash port of hyperfine, which outputs more info.)

However, if you go backwards 106 from 264, it is 5.8× slower:

Benchmark #1: seq 18446744073708551615 18446744073709551615 | factor
  Time (x̅ mean ± σ std dev):     58.2510s ±  0.2072s          [User: 58.1622s, System: 2.4806s]
  Range (min … x̃ median … max):  58.040s … 58.163s … 58.525s   CPU: 104.1%, 5 runs

Benchmark #2: seq 18446744073708551615 18446744073709551615 | ./factor
  Time (x̅ mean ± σ std dev):     58.0552s ±  0.1607s          [User: 58.0212s, System: 2.5270s]
  Range (min … x̃ median … max):  57.800s … 58.063s … 58.307s   CPU: 104.3%, 5 runs

Benchmark #3: seq 18446744073708551615 18446744073709551615 | ./uu-factor
  Time (x̅ mean ± σ std dev):     336.7276s ±  0.4459s          [User: 336.7042s, System: 0.3726s]
  Range (min … x̃ median … max):  335.909s … 336.824s … 337.240s   CPU: 100.1%, 5 runs

Summary
  #2 ‘seq 18446744073708551615 18446744073709551615 | ./factor’ ran
    1.003 ± 0.005 times (100.3%) faster than #1 ‘seq 18446744073708551615 18446744073709551615 | factor’
    5.800 ± 0.018 times (580.0%) faster than #3 ‘seq 18446744073708551615 18446744073709551615 | ./uu-factor’

Since it seems to be slower for larger numbers, it follows that the ratio will be even worse once #1559 is fixed. While the performance has definitely improved a lot, I think this issue should at least be left open until it is no more than around two times slower in all cases, particularly after arbitrary-precision numbers are supported.

rivy commented 3 years ago

I'm happy to leave this an an open issue to encourage more refinement, with a plan to revisit next year.

PR's are welcome!

nbraud commented 3 years ago

Since it seems to be slower for larger numbers, it follows that the ratio will be even worse once #1559 is fixed.

Yes, that's a large part of why I didn't bother implementing support for arbitrarily-large numbers yet, since it would be unusably slow anyhow.

While the performance has definitely improved a lot, I think this issue should at least be left open until it is no more than two times slower in all cases, particularly after arbitrary-precision numbers are supported.

I don't have strong opinions about what the performance threshold should be before closing this, but I have some branches implementing some better algorithms etc. so there's a clear direction forward for closing this.

Unfortunately, I was pretty broken for the last half-year and didn't have a workstation to run builds and benchmarks on (so far it's all been on my laptop, but it's very slow and has thermal issues that make precise benchmarking mostly impossible) but things should hopefully be looking up this year.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tdulcet commented 1 year ago

it's not yet as fast as the GNU implementation, but I'm hoping to get there... and beyond! 😈

This C++ library allegedly has better performance than GNU factor and it does have good documentation on the algorithms the author used to achieve that (both on the README and in the code), so it may be a good reference for potential improvements here: https://github.com/hurchalla/factoring. It also supports up to 128-bit numbers.

oscardssmith commented 1 year ago

FYI, if you want to make up a 4x performance improvement, doing so is pretty easy here as https://en.algorithmica.org/hpc/algorithms/factorization/ has a good overview for fast factoring of relatively small numbers in Rust. The key things missing here is Pollard-Brent Algorithm which lets you compute the GCD much less often and should yield a roughly 5x speedup. (you could also improve the gcd computation but that is a smaller detail).

For performance on >64 bits, you probably would need ecm if you want top of the line performance, but that gets complicated fairly quickly.

tdulcet commented 1 year ago

has a good overview for fast factoring of relatively small numbers in Rust

Yes, that looks like a good resource, although note that his example code is in C++ not Rust. While uutils should definitely implement the Pollard-Brent rho algorithm as GNU has, I am not sure it would lead to a 5× speed up. I ran his factor code and it was orders of magnitude slower than GNU factor, although he does not have a complete implementation of all his methods combined, so it is hard to directly compare the performance. It looks like uutils already uses the faster binary GCD algorithm.

There is more information about all the algorithms used by GNU factor here: https://www.maizure.org/projects/decoded-gnu-coreutils/factor.html.

tdulcet commented 1 year ago

This C++ library allegedly has better performance than GNU factor

Since it has now been almost 3 years, I thought I should update the benchmarks from https://github.com/uutils/coreutils/issues/1456#issuecomment-721688400 and include more programs for comparison. For integers less than 104, it is now about 14.8× slower, for 105 about 7.6× slower, for 106 about 6.0× slower, for 107 about 5.7× slower and for 108 about 4.2× slower:

$ ./time.sh -i -r 5 'seq 2 10000000 | '{factor,./gnu/factor,./uu/factor,./factoring/gcc_example,./factoring/clang_example,'./numbers -p'}
Benchmark #1: seq 2 10000000 | factor
  Time (mean ± σ std dev):      2.2944s ±  0.0112s          [User: 2.0142s, System: 0.4204s]
  Range (min … median … max):   2.281s …  2.291s …  2.313s   CPU: 106.1%, 5 runs

Benchmark #2: seq 2 10000000 | ./gnu/factor
  Time (mean ± σ std dev):      2.2020s ±  0.0087s          [User: 1.8820s, System: 0.4534s]
  Range (min … median … max):   2.188s …  2.201s …  2.213s   CPU: 106.1%, 5 runs

Benchmark #3: seq 2 10000000 | ./uu/factor
  Time (mean ± σ std dev):     12.6390s ±  0.0296s          [User: 10.2822s, System: 2.4632s]
  Range (min … median … max):  12.591s … 12.637s … 12.684s   CPU: 100.8%, 5 runs

Benchmark #4: seq 2 10000000 | ./factoring/gcc_example
  Time (mean ± σ std dev):     15.2668s ±  0.1457s          [User: 12.6782s, System: 2.7610s]
  Range (min … median … max):  15.093s … 15.291s … 15.512s   CPU: 101.1%, 5 runs

Benchmark #5: seq 2 10000000 | ./factoring/clang_example
  Time (mean ± σ std dev):     15.2658s ±  0.0899s          [User: 12.7338s, System: 2.7064s]
  Range (min … median … max):  15.172s … 15.212s … 15.419s   CPU: 101.1%, 5 runs

Benchmark #6: seq 2 10000000 | ./numbers -p
  Time (mean ± σ std dev):     19.5844s ±  0.2708s          [User: 17.0100s, System: 2.7530s]
  Range (min … median … max):  19.241s … 19.561s … 20.072s   CPU: 100.9%, 5 runs

Summary
  #2 ‘seq 2 10000000 | ./gnu/factor’ ran
    1.042 ± 0.007 times (4.2%) faster than #1 ‘seq 2 10000000 | factor’
    5.740 ± 0.026 times (474.0%) faster than #3 ‘seq 2 10000000 | ./uu/factor’
    6.933 ± 0.072 times (593.3%) faster than #4 ‘seq 2 10000000 | ./factoring/gcc_example’
    6.933 ± 0.049 times (593.3%) faster than #5 ‘seq 2 10000000 | ./factoring/clang_example’
    8.894 ± 0.128 times (789.4%) faster than #6 ‘seq 2 10000000 | ./numbers -p’

Overall, uutils' factor seems to have gotten significantly slower over the past 3 years. Considering that GNU factor has largely not changed in over 10 years (since the rewrite in 2012) and uutils' factor has not changed in two years (since @nbraud left), my best guess is that this is due to compiler improvements that have benefited GNU factor more than uutils' factor.

If you go backwards 105 from 263, it is now 5.2× slower than GNU factor, but 20.9× slower than that EPR Factoring Library I mentioned above:

$ ./time.sh -i -r 5 "seq $(bc <<<'(2^63) - (10^5) - 1') $(bc <<<'(2^63) - 1') | "{factor,./gnu/factor,./uu/factor,./factoring/gcc_example,./factoring/clang_example}
Benchmark #1: seq 9223372036854675807 9223372036854775807 | factor
  Time (mean ± σ std dev):      3.2782s ±  0.0088s          [User: 3.2758s, System: 0.0080s]
  Range (min … median … max):   3.267s …  3.276s …  3.294s   CPU: 100.2%, 5 runs

Benchmark #2: seq 9223372036854675807 9223372036854775807 | ./gnu/factor
  Time (mean ± σ std dev):      3.1176s ±  0.0144s          [User: 3.0974s, System: 0.0146s]
  Range (min … median … max):   3.104s …  3.114s …  3.145s   CPU:  99.8%, 5 runs

Benchmark #3: seq 9223372036854675807 9223372036854775807 | ./uu/factor
  Time (mean ± σ std dev):     16.2826s ±  0.1737s          [User: 16.2178s, System: 0.0360s]
  Range (min … median … max):  16.014s … 16.267s … 16.557s   CPU:  99.8%, 5 runs

Benchmark #4: seq 9223372036854675807 9223372036854775807 | ./factoring/gcc_example
  Time (mean ± σ std dev):      1.1940s ±  0.6380s          [User: 0.8254s, System: 0.0332s]
  Range (min … median … max):   0.873s …  0.875s …  2.470s   CPU:  71.9%, 5 runs

Benchmark #5: seq 9223372036854675807 9223372036854775807 | ./factoring/clang_example
  Time (mean ± σ std dev):      0.7786s ±  0.0031s          [User: 0.7162s, System: 0.0510s]
  Range (min … median … max):   0.773s …  0.779s …  0.782s   CPU:  98.5%, 5 runs

Summary
  #5 ‘seq 9223372036854675807 9223372036854775807 | ./factoring/clang_example’ ran
    4.210 ± 0.020 times (321.0%) faster than #1 ‘seq 9223372036854675807 9223372036854775807 | factor’
    4.004 ± 0.025 times (300.4%) faster than #2 ‘seq 9223372036854675807 9223372036854775807 | ./gnu/factor’
   20.913 ± 0.238 times (1,991.3%) faster than #3 ‘seq 9223372036854675807 9223372036854775807 | ./uu/factor’
    1.534 ± 0.819 times (53.4%) faster than #4 ‘seq 9223372036854675807 9223372036854775807 | ./factoring/gcc_example’

Due to #1559, uutils factor of course cannot currently test numbers over 64-bits. However, if I continue to compare GNU factor to the EPR Factoring Library, it is 3.5× faster for 64-bit numbers:

$ ./time.sh -i -r 5 "seq $(bc <<<'2^64') $(bc <<<'(2^64) + (10^5)') | "{factor,./gnu/factor,./factoring/gcc_example,./factoring/clang_example}
Benchmark #1: seq 18446744073709551616 18446744073709651616 | factor
  Time (mean ± σ std dev):      4.1524s ±  0.0236s          [User: 4.1490s, System: 0.0100s]
  Range (min … median … max):   4.113s …  4.157s …  4.178s   CPU: 100.2%, 5 runs

Benchmark #2: seq 18446744073709551616 18446744073709651616 | ./gnu/factor
  Time (mean ± σ std dev):      3.9936s ±  0.0552s          [User: 3.9614s, System: 0.0268s]
  Range (min … median … max):   3.950s …  3.971s …  4.101s   CPU:  99.9%, 5 runs

Benchmark #3: seq 18446744073709551616 18446744073709651616 | ./factoring/gcc_example
  Time (mean ± σ std dev):      1.2992s ±  0.0106s          [User: 1.2544s, System: 0.0296s]
  Range (min … median … max):   1.287s …  1.299s …  1.317s   CPU:  98.8%, 5 runs

Benchmark #4: seq 18446744073709551616 18446744073709651616 | ./factoring/clang_example
  Time (mean ± σ std dev):      1.1244s ±  0.0041s          [User: 1.0612s, System: 0.0454s]
  Range (min … median … max):   1.118s …  1.124s …  1.131s   CPU:  98.4%, 5 runs

Summary
  #4 ‘seq 18446744073709551616 18446744073709651616 | ./factoring/clang_example’ ran
    3.693 ± 0.025 times (269.3%) faster than #1 ‘seq 18446744073709551616 18446744073709651616 | factor’
    3.552 ± 0.051 times (255.2%) faster than #2 ‘seq 18446744073709551616 18446744073709651616 | ./gnu/factor’
    1.155 ± 0.010 times (15.5%) faster than #3 ‘seq 18446744073709551616 18446744073709651616 | ./factoring/gcc_example’

28.8× faster for 96-bit numbers:

$ ./time.sh -i -r 5 'seq 79228162514264337593543940336 79228162514264337593543962335 | '{factor,./gnu/factor,./factoring/gcc_example,./factoring/clang_example}
Benchmark #1: seq 79228162514264337593543940336 79228162514264337593543962335 | factor
  Time (mean ± σ std dev):     152.4726s ±  1.5660s          [User: 152.4634s, System: 0.0122s]
  Range (min … median … max):  149.630s … 152.627s … 154.343s   CPU: 100.0%, 5 runs

Benchmark #2: seq 79228162514264337593543940336 79228162514264337593543962335 | ./gnu/factor
  Time (mean ± σ std dev):     150.3186s ±  0.2998s          [User: 150.3032s, System: 0.0060s]
  Range (min … median … max):  149.911s … 150.330s … 150.812s   CPU: 100.0%, 5 runs

Benchmark #3: seq 79228162514264337593543940336 79228162514264337593543962335 | ./factoring/gcc_example
  Time (mean ± σ std dev):      7.7510s ±  0.0475s          [User: 7.7174s, System: 0.0114s]
  Range (min … median … max):   7.690s …  7.731s …  7.814s   CPU:  99.7%, 5 runs

Benchmark #4: seq 79228162514264337593543940336 79228162514264337593543962335 | ./factoring/clang_example
  Time (mean ± σ std dev):      5.2026s ±  0.0816s          [User: 5.1606s, System: 0.0220s]
  Range (min … median … max):   5.089s …  5.184s …  5.310s   CPU:  99.6%, 5 runs

Summary
  #4 ‘seq 79228162514264337593543940336 79228162514264337593543962335 | ./factoring/clang_example’ ran
   29.307 ± 0.549 times (2,830.7%) faster than #1 ‘seq 79228162514264337593543940336 79228162514264337593543962335 | factor’
   28.893 ± 0.457 times (2,789.3%) faster than #2 ‘seq 79228162514264337593543940336 79228162514264337593543962335 | ./gnu/factor’
    1.490 ± 0.025 times (49.0%) faster than #3 ‘seq 79228162514264337593543940336 79228162514264337593543962335 | ./factoring/gcc_example’

and a whopping 334× faster than GNU factor for 127-bit numbers:

$ ./time.sh -i -r 5 "seq $(bc <<<'(2^127) - (10^2) - 1') $(bc <<<'(2^127) - 1') | "{factor,./gnu/factor,./factoring/gcc_example,./factoring/clang_example}
Benchmark #1: seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | factor
  Time (mean ± σ std dev):     143.1562s ±  0.5082s          [User: 143.1556s, System: 0.0000s]
  Range (min … median … max):  142.371s … 143.422s … 143.732s   CPU: 100.0%, 5 runs

Benchmark #2: seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./gnu/factor
  Time (mean ± σ std dev):     142.3252s ±  0.7053s          [User: 141.9832s, System: 0.0060s]
  Range (min … median … max):  141.651s … 141.996s … 143.588s   CPU:  99.8%, 5 runs

Benchmark #3: seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./factoring/gcc_example
  Time (mean ± σ std dev):      0.9602s ±  0.7591s          [User: 0.5456s, System: 0.0080s]
  Range (min … median … max):   0.507s …  0.564s …  2.473s   CPU:  57.7%, 5 runs

Benchmark #4: seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./factoring/clang_example
  Time (mean ± σ std dev):      0.4254s ±  0.1519s          [User: 0.3314s, System: 0.0080s]
  Range (min … median … max):   0.312s …  0.356s …  0.719s   CPU:  79.8%, 5 runs

Summary
  #4 ‘seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./factoring/clang_example’ ran
  336.521 ± 120.198 times (33,552.1%) faster than #1 ‘seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | factor’
  334.568 ± 119.506 times (33,356.8%) faster than #2 ‘seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./gnu/factor’
    2.257 ± 1.958 times (125.7%) faster than #3 ‘seq 170141183460469231731687303715884105627 170141183460469231731687303715884105727 | ./factoring/gcc_example’

Considering that GNU factor seems to be around 5 times faster than uutils' factor, if this library is over 300 times faster than GNU factor, that means it would be at least 1,500 times faster than uutils' factor for 128-bit numbers. I also found that for some numbers with large 64-bit factors it is an astonishing almost 2,000 times faster than GNU factor, which mean it would be at least 10,000 times faster than uutils' factor! 🤯 Note that this library is slower than both GNU factor and uutils' factor for smaller numbers under 32-bits as can be seen from the first benchmark above, but I filed an issue about this: https://github.com/hurchalla/factoring/issues/1.

For reference, the above benchmarks include the system GNU factor (factor), a locally compiled version of GNU factor with -O3 and LTO enabled (./gnu/factor), uutils' factor built with the "release" profile (./uu/factor), a simple wrapper program I wrote for the EPR Factoring Library adapted from the author's example compiled with both GCC (./factoring/gcc_example) and Clang (./factoring/clang_example), and lastly my Numbers Tool program which includes a partial C++ port of GNU factor (./numbers -p).

I tested several other programs and libraries, but nothing else seemed to have competitive performance in this range. The algorithms used by the yafu program (Yet Another Factorization Utility) may be useful for numbers above 64-bits (up to 160 digits or around 512-bits), but it was hard to benchmark, as it has extremely verbose output which is informative (similar to the undocumented ---debug option of GNU factor), but significantly effects the performance when factoring a large range of numbers.

sylvestre commented 1 year ago

Nice investigation. How did you compile the rust coreutils ?

tdulcet commented 1 year ago

Nice investigation. How did you compile the rust coreutils ?

Thanks. This is the main command I used: make PROFILE=release. The full list of commands are on the README for my tool (click "Instructions" to show them). From reading your Cargo.toml file, it looks like this should enable LTO. I am open to suggestions for additional flags to add to further improve the resulting performance.

sylvestre commented 1 year ago

ok, so, it should be ok. I am surprised that we see such a big difference.

Compilers haven't improved that much (and clang and rust are both based on LLVM).

thanks

tdulcet commented 1 year ago

I complied GNU coreutils with the default GCC and uutils of course with the default rustc/LLVM. I suppose I could compile GNU coreutils with Clang/LLVM as well, but I figured it would be more optimized for their own compiler/toolchain. For reference, I used GCC 11.4, Clang 14.0 and cargo/rustc 1.72.

I no longer has access to the Intel Xeon E3 server I used in 2020 for the benchmarking, so I used an Intel Core i7-9750 system this time. It is probably faster than the old system (for single threaded programs at least), but I would not think this would make any difference in the GNU vs uutils performance ratio. Anyway, maybe someone else should build GNU and uutils coreutils themselves and then run hyperfine or my Benchmarking Tool to see if they can reproduce the results.