primesearch / Mlucas

Ⓜ️ Ernst Mayer's Mlucas and Mfactor programs for GIMPS
https://mersenneforum.org/mayer/README.html
GNU General Public License v3.0
5 stars 1 forks source link

Error: `Assertion failed: convert_res_bytewise_FP: Illegal combination of nonzero carry` #19

Open tdulcet opened 5 months ago

tdulcet commented 5 months ago

Latest Mlucas v21.0.1, AVX2 build, Assignment: PRP=1,2,700001,-1

Output:

$ ./Mlucas -cpu 0

    Mlucas 21.0.1

    https://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
System total RAM = 15893, free RAM = 15174
INFO: 15174 MB of free system RAM detected.
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 11.4.0.
HWLOC Version = 2.5.0;
        Hardware topology: 7 levels, 1 sockets, 6 cores, 12 logical processors (threads)
INFO: Build uses AVX2 instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 12 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 1 cores: 0.
User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
NTHREADS = 1
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
 looking for worktodo.txt file...
 worktodo.txt file found...reading next assignment...
 worktodo.txt entry: PRP=1,2,700001,-1

INFO: Maximum recommended exponent for FFT length (96 Kdbl) = 1983260; p[ = 700001]/pmax_rec = 0.3529547311.
Initial DWT-multipliers chain length = [long] in carry step.
INFO: primary restart file p700001 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
mers_mod_square: Init threadpool of 1 threads
Using 1 threads in carry step
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 551631
<snip>
INFO: Maximum recommended exponent for FFT length (96 Kdbl) = 1983260; p[ = 700001]/pmax_rec = 0.3529547311.
Initial DWT-multipliers chain length = [hiacc] in carry step.
 INFO: restart file p700001 found...reading...
INFO: Maximum recommended exponent for FFT length (96 Kdbl) = 1983260; p[ = 700001]/pmax_rec = 0.3529547311.
Initial DWT-multipliers chain length = [hiacc] in carry step.
 INFO: restart file p700001 found...reading...
INFO: Maximum recommended exponent for FFT length (96 Kdbl) = 1983260; p[ = 700001]/pmax_rec = 0.3529547311.
Initial DWT-multipliers chain length = [hiacc] in carry step.
 INFO: restart file p700001 found...reading...
ERROR: at line 5668 of file ../src/Mlucas.c
Assertion failed: convert_res_bytewise_FP: Illegal combination of nonzero carry = 1, most sig. word =             -21.0000

.stat file:

INFO: primary restart file p700001 not found...looking for secondary...
INFO: primary restart file p700001 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
M700001: using FFT length 96K = 98304 8-byte floats, initial residue shift count = 51327
This gives an average    7.120778401692708 bits per digit
The test will be done in form of a 3-PRP test.
Using complex FFT radices        24         8        16        16
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 551631
<snip>
Restarting M700001 at iteration = 380000. Res64: 03980C61BBAC8628, residue shift count = 460455
M700001: using FFT length 96K = 98304 8-byte floats, initial residue shift count = 460455
This gives an average    7.120778401692708 bits per digit
The test will be done in form of a 3-PRP test.
[2024-04-07 04:21:14] M700001 Iter# = 390000 [55.71% complete] clocks = 00:00:13.071 [  1.3072 msec/iter] Res64: D0BA77D74E6BBBEA. AvgMaxErr = 0.000000004. MaxErr = 0.000000006. Residue shift count = 644618.
Restarting M700001 at iteration = 390000. Res64: D0BA77D74E6BBBEA, residue shift count = 644618
M700001: using FFT length 96K = 98304 8-byte floats, initial residue shift count = 644618
This gives an average    7.120778401692708 bits per digit
The test will be done in form of a 3-PRP test.
[2024-04-07 04:21:27] M700001 Iter# = 400000 [57.14% complete] clocks = 00:00:12.907 [  1.2908 msec/iter] Res64: 6BAE11EC9CF55E94. AvgMaxErr = 0.000000004. MaxErr = 0.000000006. Residue shift count = 28904.
Restarting M700001 at iteration = 400000. Res64: 6BAE11EC9CF55E94, residue shift count = 28904
M700001: using FFT length 96K = 98304 8-byte floats, initial residue shift count = 28904
This gives an average    7.120778401692708 bits per digit
The test will be done in form of a 3-PRP test.
[2024-04-07 04:21:40] M700001 Iter# = 410000 [58.57% complete] clocks = 00:00:12.994 [  1.2995 msec/iter] Res64: 5754D94558C49ED9. AvgMaxErr = 0.000000004. MaxErr = 0.000000007. Residue shift count = 285098.
xanthe-cat commented 5 months ago

I ran the same PRP on my M1/ASIMD build where it selected a much smaller FFT of 36K; are you able to use a smaller FFT than 96K? First p700001.stat:

INFO: primary restart file p700001 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
M700001: using FFT length 36K = 36864 8-byte floats, initial residue shift count = 51327
This gives an average   18.988742404513889 bits per digit
The test will be done in form of a 3-PRP test.
Using complex FFT radices        36        32        16
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 551631
[2024-04-09 08:21:22] M700001 Iter# = 10000 [ 1.43% complete] clocks = 00:00:04.149 [  0.4150 msec/iter] Res64: F73F55AC8F92C1F0. AvgMaxErr = 0.036662518. MaxErr = 0.062500000. Residue shift count = 470088.
...
[2024-04-09 08:24:04] M700001 Iter# = 390000 [55.71% complete] clocks = 00:00:04.214 [  0.4214 msec/iter] Res64: D0BA77D74E6BBBEA. AvgMaxErr = 0.036756567. MaxErr = 0.062500000. Residue shift count = 644618.
[2024-04-09 08:24:08] M700001 Iter# = 400000 [57.14% complete] clocks = 00:00:04.188 [  0.4188 msec/iter] Res64: 6BAE11EC9CF55E94. AvgMaxErr = 0.036776587. MaxErr = 0.062500000. Residue shift count = 28904.
[2024-04-09 08:24:13] M700001 Iter# = 410000 [58.57% complete] clocks = 00:00:04.225 [  0.4225 msec/iter] Res64: 5754D94558C49ED9. AvgMaxErr = 0.036733561. MaxErr = 0.054687500. Residue shift count = 285098.
...
[2024-04-09 08:26:20] M700001 Iter# = 700000 [100.00% complete] clocks = 00:00:04.651 [  0.4652 msec/iter] Res64: 3D70083B9439BA98. AvgMaxErr = 0.036834389. MaxErr = 0.062500000. Residue shift count = 51327.
[2024-04-09 08:26:20] M700001 Iter# = 700001 [100.00% complete] clocks = 00:00:00.000 [  0.7658 msec/iter] Res64: C8C0467CC5E32F55. AvgMaxErr = 0.027343750. MaxErr = 0.027343750. Residue shift count = 102654.
M700001 is not prime. Program: E21.0.1. Final residue shift count = 102654.
If using the manual results submission form at mersenne.org, paste the following JSON-formatted results line:
{"status":"C", "exponent":700001, "worktype":"PRP-3", "res64":"32C007D4F98B0542", "residue-type":1, "fft-length":36864, "shift-count":102654, "error-code":"00000000", "program":{"name":"Mlucas", "version":"21.0.1"}, "timestamp":"2024-04-08 22:26:20 UTC"}

Your run appears to have the 10000-iteration restart bug which I might separately flag as an issue. The output from Mlucas looked like:

cxc@192-168-1-3 obj_asimd % ./Mlucas         

    Mlucas 21.0.1

    http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
INFO: 16384 MB of available system RAM detected.
CPU Family = ARM Embedded ABI, OS = OS X, 64-bit Version, compiled with Gnu-C-compatible [llvm/clang], Version 14.0.0 (clang-1400.0.29.202).
INFO: Build uses ARMv8 advanced-SIMD instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation. 
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 8 available processor cores.
INFO: testing FFT radix tables...
User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
 looking for worktodo.txt file...
 worktodo.txt file found...reading next assignment...
 worktodo.txt entry: PRP=1,2,700001,-1,75,0

INFO: Maximum recommended exponent for FFT length (36 Kdbl) = 759433; p[ = 700001]/pmax_rec = 0.9217416151.
Initial DWT-multipliers chain length = [long] in carry step.
INFO: primary restart file p700001 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
mers_mod_square: Init threadpool of 1 threads
Using 1 threads in carry step
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 551631
M700001 is not prime. Program: E21.0.1. Final residue shift count = 102654.
If using the manual results submission form at mersenne.org, paste the following JSON-formatted results line:
{"status":"C", "exponent":700001, "worktype":"PRP-3", "res64":"32C007D4F98B0542", "residue-type":1, "fft-length":36864, "shift-count":102654, "error-code":"00000000", "program":{"name":"Mlucas", "version":"21.0.1"}, "timestamp":"2024-04-08 22:26:20 UTC"}
tdulcet commented 5 months ago

I ran the same PRP on my M1/ASIMD build where it selected a much smaller FFT of 36K; are you able to use a smaller FFT than 96K?

Thanks for testing it. When passing the -fft 36K option, it still used 96K, but I was able to fudge the ms/iter speeds in mlucas.cfg so that the 36K FFT length was faster. This caused it to use 36K, which did work as expected:

$ ./Mlucas -cpu 0:3

    Mlucas 21.0.1

    http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
System total RAM = 15893, free RAM = 15326
INFO: 15326 MB of free system RAM detected.
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 11.4.0.
HWLOC Version = 2.5.0;
        Hardware topology: 7 levels, 1 sockets, 6 cores, 12 logical processors (threads)
INFO: Build uses AVX2 instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 12 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 4 cores: 0.1.2.3.
User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
NTHREADS = 4
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
 looking for worktodo.txt file...
 worktodo.txt file found...reading next assignment...
 worktodo.txt entry: PRP=1,2,700001,-1

INFO: Maximum recommended exponent for FFT length (36 Kdbl) = 759433; p[ = 700001]/pmax_rec = 0.9217416151.
Initial DWT-multipliers chain length = [long] in carry step.
INFO: primary restart file p700001 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
mers_mod_square: Init threadpool of 4 threads
Using 4 threads in carry step
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 551631
M700001 is not prime. Program: E21.0.1. Final residue shift count = 102654.
If using the manual results submission form at mersenne.org, paste the following JSON-formatted results line:
{"status":"C", "exponent":700001, "worktype":"PRP-3", "res64":"32C007D4F98B0542", "residue-type":1, "fft-length":36864, "shift-count":102654, "error-code":"00000000", "program":{"name":"Mlucas", "version":"21.0.1"}, "timestamp":"2024-04-09 09:43:00 UTC"}

The bug must be related to using a larger than optimal FFT length, which should work, but in the meantime maybe Mlucas should not try to use FFT lengths more than some multiple of the optimal, even if they are faster.

xanthe-cat commented 4 months ago

I have a question about something which might be tangentially related this problem; do you know how the code throttles the variable which decides how to chain multiplications together? In the Mlucas standard output there are lines such as:

Initial DWT-multipliers chain length = [long] in carry step.

If things are not going well, one of Ernst’s tricks is to change the chain length; your output above soon changes to:

Initial DWT-multipliers chain length = [hiacc] in carry step.

Usually [long] is the fastest mode, though Ernst has three further settings to more carefully multiply, [medium], [short], and dialling things up to eleven, [hiacc]. I presume that last setting is an abbreviation for “high accuracy”. One of my problems (trying to use a teensy FFT for the Suyama test of $F_{13}$) is that it seems to do a whole lot of mod-square calculations (eight thousand or so) fine, and then as it tries to perform the final one it drops the ball with this carry error. Since the Suyama call is a separate part of the codebase, I would like to tell Mlucas to use the [hiacc] setting for that one mod-square operation, but I don’t see how that can even be specified.

tdulcet commented 4 months ago

In my example, I do not believe it should be using [hiacc], as the ROE is already extremely low (MaxErr = 0.000000006) due to the excessively large FFT length, so I suspect that this is a separate issue caused by #21.

Anyway, to answer your question, the "chain length" is not something that can generally be specified on a per iteration basis. When Mlucas detects that the ROE is too high, it first tries to increase chain length to resolve the issue, before finally resorting to increasing the FFT length, which is of course much more costly in terms of performance. When it does increase chain length, it restarts the test from the last savefile, which means that it loses up to 10K iterations by default. Considering that the entire F13 test has less than 10K iterations, it would probably be easiest to force the test to use [hiacc] from the start. In that case, just adjust the logic as needed here: https://github.com/primesearch/Mlucas/blob/18398583da19f270eed22a036c27d9b6beb9973d/src/Mlucas.c#L1351-L1360 For example, you could add a USE_SHORT_CY_CHAIN = USE_SHORT_CY_CHAIN_MAX; line above the fprintf() function.