primesearch / Mlucas

Ⓜ️ Ernst Mayer's Mlucas and Mfactor programs for the Great Internet Mersenne Prime Search (GIMPS)
https://mersenneforum.org/mayer/README.html
GNU General Public License v3.0
6 stars 1 forks source link

Assertion failed: For completed S1 expect ilo == maxiter == PM1_S1_PROD_BITS #31

Open proski opened 3 weeks ago

proski commented 3 weeks ago

I'm using the mlucas.sh script from Distributed-Computing-Scripts to run mlucas on aarch64 Linux (Fedora 40 Asahi). Initially, 8 processes were started. I stopped and started the processes several times using jobs.sh created in the objs directory (I didn't use crontab). At some point, I noticed that only 4 processes were running. The other 4 refused to start. They were the processes that reached stage 2. The log files for the failing processes show an assertion error:

 INFO: restart file p124130143 found...reading...
ERROR: at line 1776 of file ../src/Mlucas.c
Assertion failed: For completed S1 expect ilo == maxiter == PM1_S1_PROD_BITS!

I'm attaching the full log. Mlucas.out.txt

Also note that the log contains Stage 2: No factor found. but I'm not sure it's the final result.

I can reproduce the error by running ../Mlucas -core '4' -maxalloc 11.25 in the Mlucas/obj/run4 directory.

proski commented 3 weeks ago

Adding a debug print statement immediately before the ASSERT gives some useful information.

Core 4: ilo = 1490000, maxiter = 1443032, PM1_S1_PROD_BITS = 1443032 Core 5: ilo = 1490000, maxiter = 1443032, PM1_S1_PROD_BITS = 1443032 Core 6: ilo = 1630000, maxiter = 1587485, PM1_S1_PROD_BITS = 1587485 Core 7: ilo = 1490000, maxiter = 1443032, PM1_S1_PROD_BITS = 1443032

tdulcet commented 3 weeks ago

Thank you for the bug report and the detailed information. From reading your provided output file, I believe I found the cause of the issue. In one of the restarts, there was not enough available RAM for stage 2, so it decided to use a larger B1 value instead:

pm1_set_bounds: Stage 2 needs at least 24+5 buffers ... each buffer needs 52 MB; avail-RAM allows 21 such.
pm1_set_bounds: Insufficient free memory for Stage 2 ... will run only Stage 1.
Setting default p-1 stage bounds b1 = 1300000, b2_start = 0, b2 = 0.

However, in the following restart there was enough available RAM for stage 2, but that assert statement was not expecting the larger B1 value from the savefile:

pm1_set_bounds: Stage 2 needs at least 24+5 buffers ... each buffer needs 52 MB; avail-RAM allows 55 such.
Setting default p-1 stage bounds b1 = 1000000, b2_start = 1000000, b2 = 33000000.

Note the B1 = 1300000 (above) vs 1000000 (below). The solution might be to change this assert statement to ignore the B1 value from the savefile (the ilo variable):

ASSERT(HERE, maxiter == PM1_S1_PROD_BITS, "For completed S1 expect maxiter == PM1_S1_PROD_BITS!");

However, the larger issue here is that my Mlucas install script is designed to maximize throughput for LL, PRP and P-1 stage 1 tests, but this does not apply to P-1 stage 2 where the performance largely depends on the amount of available RAM. If you were planning to exclusively do P-1 tests with Mlucas, I would recommend using only 1 or 2 workers/runs. This would of course decrease the throughput for stage 1, but the gains in performance for stage 2 would likely more than make up for that. This should also eliminate the possibly of workers not having enough available RAM for stage 2. An alternative could be to increase the -maxalloc percentage to allow each worker to use slightly more RAM.

Also note that the log contains Stage 2: No factor found. but I'm not sure it's the final result.

Yes, Mlucas performs several GCDs during stage 2 by default. You could check the .stat file to see the percentage complete and more information.

proski commented 3 weeks ago

Thank you for your quick response! I just wanted to test Mlucas and chose tests I haven't done before (P-1). I'm going to complete started assignments and then try tests that don't require so much memory.