primesearch / Mlucas

Ⓜ️ Ernst Mayer's Mlucas and Mfactor programs for GIMPS
https://mersenneforum.org/mayer/README.html
GNU General Public License v3.0
8 stars 3 forks source link

Segmentation fault for "small" and "medium" self-tests on AMD 7950X #16

Open Hermann-SW opened 8 months ago

Hermann-SW commented 8 months ago

I was asked to run "./Mlucas -s tiny" here: https://github.com/primesearch/Mlucas/issues/15#issuecomment-2027020472

Just wanted to create this issue that small and medium dump core.

hermann@7950x:~/Mlucas/obj$ uname -a
Linux 7950x 6.5.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
hermann@7950x:~/Mlucas/obj$ head -1 /etc/os-release 
PRETTY_NAME="Ubuntu 22.04.4 LTS"
hermann@7950x:~/Mlucas/obj$ lscpu |grep "Model "
Model name:                         AMD Ryzen 9 7950X 16-Core Processor
hermann@7950x:~/Mlucas/obj$ 
hermann@7950x:~/Mlucas/obj$ ./Mlucas -s small

    Mlucas 20.1.1

    http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
System total RAM = 31190, free RAM = 29671
INFO: 29671 MB of free system RAM detected.
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 11.4.0.
INFO: Build uses AVX512 instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 32 available processor cores.
INFO: testing FFT radix tables...

           Mlucas selftest running.....

/****************************************************************************/

INFO: Unable to find/open mlucas.cfg file in r+ mode ... creating from scratch.
User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
 worktodo.txt file not found...using user-supplied command-line exponent p = 5152643
INFO: Maximum recommended exponent for FFT length (256 Kdbl) = 5178863; p[ = 5152643]/pmax_rec = 0.9949371126.
Initial DWT-multipliers chain length = [hiacc] in carry step.
M5152643: using FFT length 256K = 262144 8-byte floats, initial residue shift count = 1159220
This gives an average   19.655773162841797 bits per digit
Using complex FFT radices       256        16        32
mers_mod_square: Init threadpool of 1 threads
Using 1 threads in carry step
M5152643 Roundoff warning on iteration        2, maxerr =   0.420149805045
M5152643 Roundoff warning on iteration        6, maxerr =   0.414123535156
M5152643 Roundoff warning on iteration       10, maxerr =   0.406272888184
M5152643 Roundoff warning on iteration       11, maxerr =   0.411499023438
Segmentation fault (core dumped)
hermann@7950x:~/Mlucas/obj$ 

and

hermann@7950x:~/Mlucas/obj$ ./Mlucas -s medium

    Mlucas 20.1.1

    http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
System total RAM = 31190, free RAM = 29671
INFO: 29671 MB of free system RAM detected.
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 11.4.0.
INFO: Build uses AVX512 instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 32 available processor cores.
INFO: testing FFT radix tables...

           Mlucas selftest running.....

/****************************************************************************/

INFO: Unable to find/open mlucas.cfg file in r+ mode ... creating from scratch.
User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
 worktodo.txt file not found...using user-supplied command-line exponent p = 39003229
INFO: Maximum recommended exponent for FFT length (2048 Kdbl) = 39606917; p[ = 39003229]/pmax_rec = 0.9847580159.
Initial DWT-multipliers chain length = [short] in carry step.
M39003229: using FFT length 2048K = 2097152 8-byte floats, initial residue shift count = 31137256
This gives an average   18.598188877105713 bits per digit
Using complex FFT radices      1024        32        32
mers_mod_square: Init threadpool of 1 threads
Using 1 threads in carry step
Segmentation fault (core dumped)
hermann@7950x:~/Mlucas/obj$ 
tdulcet commented 8 months ago

Thanks for the bug report! I recall Ken also had issues with a newer AMD CPU, but we were not able to fully debug the problem. It looks like you are using the AVX512 build mode with GCC 11.4. Do the "teensy", "tiny", "large" and "huge" self-tests work work as expected?

Would you mind running Mlucas with GDB so we can see where exactly it is seg faulting. If you have GDB installed, for the "small" self-test just run: gdb -args ./Mlucas -s small, type r (run) to start Mlucas and bt (backtrace) to show the stack trace after it crashes, and then repeat this for the "medium" self-test.

In addition, could you try building it with Clang so we could rule out any compiler differences. If you have Clang installed, just remove the existing obj directory, run: export CC=clang and then rerun the makemake.sh script as you did before: bash makemake.sh use_hwloc.

Hermann-SW commented 8 months ago

How do I start the 5 self-tests?

hermann@7950x:~/Mlucas/obj$ gdb -args ./Mlucas -s small
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
--Type <RET> for more, q to quit, c to continue without paging--
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./Mlucas...
(gdb) r
Starting program: /home/hermann/Mlucas/obj/Mlucas -s small
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

    Mlucas 20.1.1

    http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
System total RAM = 31190, free RAM = 29625
INFO: 29625 MB of free system RAM detected.
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 11.4.0.
INFO: Build uses AVX512 instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
[Detaching after vfork from child process 2394]
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 32 available processor cores.
INFO: testing FFT radix tables...

           Mlucas selftest running.....

/****************************************************************************/

User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
 worktodo.txt file not found...using user-supplied command-line exponent p = 5152643
INFO: Maximum recommended exponent for FFT length (256 Kdbl) = 5178863; p[ = 5152643]/pmax_rec = 0.9949371126.
Initial DWT-multipliers chain length = [hiacc] in carry step.
M5152643: using FFT length 256K = 262144 8-byte floats, initial residue shift count = 1159220
This gives an average   19.655773162841797 bits per digit
Using complex FFT radices       256        16        32
[New Thread 0x7ffff2b7a640 (LWP 2395)]
mers_mod_square: Init threadpool of 1 threads
[New Thread 0x7ffff1cb9640 (LWP 2396)]
Using 1 threads in carry step
M5152643 Roundoff warning on iteration        2, maxerr =   0.420149805045
M5152643 Roundoff warning on iteration        6, maxerr =   0.414123535156
M5152643 Roundoff warning on iteration       10, maxerr =   0.406272888184
M5152643 Roundoff warning on iteration       11, maxerr =   0.411499023438

Thread 3 "Mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff1cb9640 (LWP 2396)]
0x000055555572efa7 in cy256_process_chunk (targ=<optimized out>) at ../src/radix256_main_carry_loop.h:472
472             AVX_cmplx_carry_fast_pow2_wtsinit_X16(add1,add2,add3, itmp, half_arr,sign_mask, n_minus_sil,n_minus_silp1,sinwt,sinwtm1, sse_bw,sse_nm1)
(gdb) bt
#0  0x000055555572efa7 in cy256_process_chunk (targ=<optimized out>)
    at ../src/radix256_main_carry_loop.h:472
#1  0x00005555558e3062 in worker_thr_routine (data=0x555555a5ea80)
    at ../src/threadpool.c:452
#2  0x00007ffff7c94ac3 in start_thread (arg=<optimized out>)
    at ./nptl/pthread_create.c:442
#3  0x00007ffff7d26850 in clone3 ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) 
tdulcet commented 8 months ago

Thanks for the additional information! That is very helpful.

How do I start the 5 self-tests?

To start the self-tests, just pass each self-test value to the -s option:

./Mlucas -s teensy
./Mlucas -s tiny
./Mlucas -s small
./Mlucas -s medium
./Mlucas -s large
./Mlucas -s huge

If you want to run them in GDB instead, just prefix each command with gdb -args.

Hermann-SW commented 8 months ago

I did run all 6, teensy/tiny/large/huge do complete (huge took 15min to complete with around 100% CPU). small does core dump, bt in previous comment.

hermann@7950x:~/Mlucas/obj$ gdb -args ./Mlucas -s medium
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.
--Type <RET> for more, q to quit, c to continue without paging--

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./Mlucas...
(gdb) r
Starting program: /home/hermann/Mlucas/obj/Mlucas -s medium
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

    Mlucas 20.1.1

    http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
System total RAM = 31190, free RAM = 29308
INFO: 29308 MB of free system RAM detected.
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 11.4.0.
INFO: Build uses AVX512 instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
[Detaching after vfork from child process 3784]
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 32 available processor cores.
INFO: testing FFT radix tables...

           Mlucas selftest running.....

/****************************************************************************/

User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
 worktodo.txt file not found...using user-supplied command-line exponent p = 39003229
INFO: Maximum recommended exponent for FFT length (2048 Kdbl) = 39606917; p[ = 39003229]/pmax_rec = 0.9847580159.
Initial DWT-multipliers chain length = [short] in carry step.
M39003229: using FFT length 2048K = 2097152 8-byte floats, initial residue shift count = 31137256
This gives an average   18.598188877105713 bits per digit
Using complex FFT radices      1024        32        32
[New Thread 0x7fffda2f8640 (LWP 3785)]
mers_mod_square: Init threadpool of 1 threads
[New Thread 0x7fffd7eeb640 (LWP 3786)]
Using 1 threads in carry step

Thread 3 "Mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7eeb640 (LWP 3786)]
0x00005555555f1d8d in cy1024_process_chunk (targ=<optimized out>) at ../src/radix1024_main_carry_loop.h:275
275             AVX_cmplx_carry_fast_pow2_wtsinit_X16(add1,add2,add3, itmp, half_arr,sign_mask, n_minus_sil,n_minus_silp1,sinwt,sinwtm1, sse_bw,sse_nm1)
(gdb) bt
#0  0x00005555555f1d8d in cy1024_process_chunk (targ=<optimized out>) at ../src/radix1024_main_carry_loop.h:275
#1  0x00005555558e3062 in worker_thr_routine (data=0x555555a338f0) at ../src/threadpool.c:452
#2  0x00007ffff7c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#3  0x00007ffff7d26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) 
tdulcet commented 8 months ago

Thanks for testing them all. Yes, the larger self-tests do take progressively longer, but 15 minutes is actually quite fast for the huge self-test when run single threaded.

It looks the both the small and medium self-tests are seg faulting in the same inline assembly. @ldesnogu - Since you are our resident inline assembly expert, do you have any insights as to why this is seg faulting on AMD CPUs?

ldesnogu commented 8 months ago

I alas have no access to an AMD machine. One can identify the faulty instruction by running with gdb and doing "disassemble $pc"

Hermann-SW commented 8 months ago
hermann@7950x:~/Mlucas/obj$ gdb -args ./Mlucas -s medium
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
...
Thread 3 "Mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7eeb640 (LWP 2248)]
0x00005555555f1d8d in cy1024_process_chunk (targ=<optimized out>) at ../src/radix1024_main_carry_loop.h:275
275             AVX_cmplx_carry_fast_pow2_wtsinit_X16(add1,add2,add3, itmp, half_arr,sign_mask, n_minus_sil,n_minus_silp1,sinwt,sinwtm1, sse_bw,sse_nm1)
(gdb) disassemble $pc
Dump of assembler code for function cy1024_process_chunk:
   0x00005555555eb220 <+0>: endbr64 
   0x00005555555eb224 <+4>: lea    0x8(%rsp),%r10
   0x00005555555eb229 <+9>: and    $0xffffffffffffffc0,%rsp
   0x00005555555eb22d <+13>:    push   -0x8(%r10)
   0x00005555555eb231 <+17>:    push   %rbp
   0x00005555555eb232 <+18>:    mov    %rsp,%rbp
   0x00005555555eb235 <+21>:    push   %r15
   0x00005555555eb237 <+23>:    push   %r14
   0x00005555555eb239 <+25>:    push   %r13
   0x00005555555eb23b <+27>:    push   %r12
   0x00005555555eb23d <+29>:    push   %r10
   0x00005555555eb23f <+31>:    push   %rbx
   0x00005555555eb240 <+32>:    sub    $0xbc0,%rsp
   0x00005555555eb247 <+39>:    mov    %rdi,-0xb98(%rbp)
   0x00005555555eb24e <+46>:    mov    %fs:0x28,%rax
   0x00005555555eb257 <+55>:    mov    %rax,-0x38(%rbp)
   0x00005555555eb25b <+59>:    mov    0x3a1eeb(%rip),%eax        # 0x55555598d14c <USE_SHORT_CY_CHAIN>
--Type <RET> for more, q to quit, c to continue without paging--
   0x00005555555eb261 <+65>:    movl   $0x10,-0xa98(%rbp)
   0x00005555555eb26b <+75>:    test   %eax,%eax
   0x00005555555eb26d <+77>:    je     0x5555555eb285 <cy1024_process_chunk+101>
   0x00005555555eb26f <+79>:    cmp    $0x1,%eax
   0x00005555555eb272 <+82>:    sete   %al
   0x00005555555eb275 <+85>:    movzbl %al,%eax
   0x00005555555eb278 <+88>:    lea    0x4(,%rax,4),%eax
   0x00005555555eb27f <+95>:    mov    %eax,-0xa98(%rbp)
   0x00005555555eb285 <+101>:   mov    -0xb98(%rbp),%r12
   0x00005555555eb28c <+108>:   lea    0x353fd5(%rip),%r13        # 0x55555593f268
   0x00005555555eb293 <+115>:   xor    %edx,%edx
   0x00005555555eb295 <+117>:   mov    $0xc91,%edi
   0x00005555555eb29a <+122>:   mov    %r13,%rcx
   0x00005555555eb29d <+125>:   mov    0x4(%r12),%ebx
   0x00005555555eb2a2 <+130>:   vmovsd 0x18(%r12),%xmm5
   0x00005555555eb2a9 <+137>:   vmovsd 0x58(%r12),%xmm0
   0x00005555555eb2b0 <+144>:   mov    0x24(%r12),%eax
   0x00005555555eb2b5 <+149>:   mov    %ebx,-0xacc(%rbp)
--Type <RET> for more, q to quit, c to continue without paging--
   0x00005555555eb2bb <+155>:   mov    0x8(%r12),%ebx
   0x00005555555eb2c0 <+160>:   mov    %eax,-0xa04(%rbp)
   0x00005555555eb2c6 <+166>:   mov    %ebx,%esi
   0x00005555555eb2c8 <+168>:   vmovsd %xmm5,-0xbb0(%rbp)
   0x00005555555eb2d0 <+176>:   vmovsd 0x50(%r12),%xmm5
   0x00005555555eb2d7 <+183>:   shl    $0xa,%esi
   0x00005555555eb2da <+186>:   mov    %esi,-0xb58(%rbp)
   0x00005555555eb2e0 <+192>:   mov    0xc(%r12),%esi
   0x00005555555eb2e5 <+197>:   mov    %esi,-0xb64(%rbp)
   0x00005555555eb2eb <+203>:   mov    0x10(%r12),%esi
   0x00005555555eb2f0 <+208>:   vmovsd %xmm5,-0xb48(%rbp)
   0x00005555555eb2f8 <+216>:   mov    %esi,-0xa30(%rbp)
   0x00005555555eb2fe <+222>:   mov    0x20(%r12),%esi
   0x00005555555eb303 <+227>:   mov    %esi,-0xb90(%rbp)
   0x00005555555eb309 <+233>:   mov    0x28(%r12),%esi
   0x00005555555eb30e <+238>:   mov    %esi,-0xb8c(%rbp)
   0x00005555555eb314 <+244>:   mov    0x2c(%r12),%esi
   0x00005555555eb319 <+249>:   mov    %esi,-0xae0(%rbp)
--Type <RET> for more, q to quit, c to continue without paging--
   0x00005555555eb31f <+255>:   mov    0x30(%r12),%esi
   0x00005555555eb324 <+260>:   mov    %esi,-0xb68(%rbp)
   0x00005555555eb32a <+266>:   mov    0x34(%r12),%esi
   0x00005555555eb32f <+271>:   mov    %esi,-0xb54(%rbp)
   0x00005555555eb335 <+277>:   mov    0x38(%r12),%esi
   0x00005555555eb33a <+282>:   mov    %esi,-0xb28(%rbp)
   0x00005555555eb340 <+288>:   mov    0x3c(%r12),%esi
   0x00005555555eb345 <+293>:   mov    %esi,-0xb74(%rbp)
   0x00005555555eb34b <+299>:   mov    0x40(%r12),%esi
   0x00005555555eb350 <+304>:   mov    %esi,-0xaa0(%rbp)
   0x00005555555eb356 <+310>:   vmovsd %xmm0,-0x900(%rbp)
   0x00005555555eb35e <+318>:   mov    0x78(%r12),%r15
   0x00005555555eb363 <+323>:   mov    0x80(%r12),%r14
   0x00005555555eb36b <+331>:   vmovsd 0x347db5(%rip),%xmm5        # 0x555555933128
   0x00005555555eb373 <+339>:   mov    0x60(%r12),%rsi
   0x00005555555eb378 <+344>:   vmovq  0x343b80(%rip),%xmm1        # 0x55555592ef00
   0x00005555555eb380 <+352>:   vmovsd (%r15),%xmm0
   0x00005555555eb385 <+357>:   vfmadd132sd (%r14),%xmm5,%xmm0
--Type <RET> for more, q to quit, c to continue without paging--
   0x00005555555eb38a <+362>:   vmovsd 0x343ade(%rip),%xmm5        # 0x55555592ee70
   0x00005555555eb392 <+370>:   mov    %rsi,-0xa28(%rbp)
   0x00005555555eb399 <+377>:   mov    0x68(%r12),%rsi
   0x00005555555eb39e <+382>:   mov    %rsi,-0xb80(%rbp)
   0x00005555555eb3a5 <+389>:   mov    0x70(%r12),%rsi
   0x00005555555eb3aa <+394>:   mov    %rsi,-0xac0(%rbp)
   0x00005555555eb3b1 <+401>:   lea    0x357f08(%rip),%rsi        # 0x5555559432c0
   0x00005555555eb3b8 <+408>:   vandpd %xmm1,%xmm0,%xmm0
   0x00005555555eb3bc <+412>:   vcomisd %xmm0,%xmm5
   0x00005555555eb3c0 <+416>:   seta   %dl
   0x00005555555eb3c3 <+419>:   call   0x5555559228a0 <ASSERT>
   0x00005555555eb3c8 <+424>:   vmovsd 0x8(%r15),%xmm0
   0x00005555555eb3ce <+430>:   xor    %edx,%edx
   0x00005555555eb3d0 <+432>:   mov    %r13,%rcx
   0x00005555555eb3d3 <+435>:   vmovsd 0x347d4d(%rip),%xmm5        # 0x555555933128
   0x00005555555eb3db <+443>:   vmovq  0x343b1d(%rip),%xmm1        # 0x55555592ef00
   0x00005555555eb3e3 <+451>:   lea    0x357ed6(%rip),%rsi        # 0x5555559432c0
   0x00005555555eb3ea <+458>:   mov    $0xc92,%edi
--Type <RET> for more, q to quit, c to continue without paging--
   0x00005555555eb3ef <+463>:   vfmadd132sd 0x8(%r14),%xmm5,%xmm0
   0x00005555555eb3f5 <+469>:   vmovsd 0x343a73(%rip),%xmm5        # 0x55555592ee70
   0x00005555555eb3fd <+477>:   vandpd %xmm1,%xmm0,%xmm0
   0x00005555555eb401 <+481>:   vcomisd %xmm0,%xmm5
   0x00005555555eb405 <+485>:   seta   %dl
   0x00005555555eb408 <+488>:   call   0x5555559228a0 <ASSERT>
   0x00005555555eb40d <+493>:   mov    0x88(%r12),%rsi
   0x00005555555eb415 <+501>:   lea    (%rbx,%rbx,1),%edx
   0x00005555555eb418 <+504>:   mov    %r12,-0xb98(%rbp)
   0x00005555555eb41f <+511>:   mov    0x3ff12f(%rip),%eax        # 0x5555559ea554 <DAT_BITS>
   0x00005555555eb425 <+517>:   mov    0x3ff125(%rip),%ecx        # 0x5555559ea550 <PAD_BITS>
   0x00005555555eb42b <+523>:   mov    %rsi,-0xb88(%rbp)
   0x00005555555eb432 <+530>:   mov    0x90(%r12),%rsi
   0x00005555555eb43a <+538>:   shrx   %eax,%edx,%edi
   0x00005555555eb43f <+543>:   shlx   %ecx,%edi,%edi
   0x00005555555eb444 <+548>:   add    %edx,%edi
   0x00005555555eb446 <+550>:   mov    %rsi,-0xaa8(%rbp)
   0x00005555555eb44d <+557>:   mov    0x98(%r12),%rsi
--Type <RET> for more, q to quit, c to continue without paging--
ldesnogu commented 8 months ago

I'm afraid you didn't scroll enough. The place where the segfault occurs would start with '=>'

For instance:

0x000055555563dbcd <+381>: mov %ebx,-0x40(%rsp) 0x000055555563dbd1 <+385>: lea (%r11,%rdx,8),%rdi => 0x000055555563dbd5 <+389>: movsd (%r8),%xmm0 0x000055555563dbda <+394>: movslq %ebx,%rdx 0x000055555563dbdd <+397>: mov -0x74(%rsp),%ebx 0x000055555563dbe1 <+401>: movsd (%rdi),%xmm1

And when typing that comment, I noticed the '=>' disappeared from the quote... Anyway you need to dump until 0x00005555555f1d8d.

Hermann-SW commented 8 months ago

I found no "=>", but learned how to query the (big) offset:

Thread 3 "Mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7eeb640 (LWP 2444)]
0x00005555555f1d8d in cy1024_process_chunk (targ=<optimized out>) at ../src/radix1024_main_carry_loop.h:275
275             AVX_cmplx_carry_fast_pow2_wtsinit_X16(add1,add2,add3, itmp, half_arr,sign_mask, n_minus_sil,n_minus_silp1,sinwt,sinwtm1, sse_bw,sse_nm1)
(gdb) display/i $ps
1: x/i $ps
   0x10286: <error: Cannot access memory at address 0x10286>
(gdb) display/i $pc
2: x/i $pc
=> 0x5555555f1d8d <cy1024_process_chunk+27501>: vmovaps -0x30(%rbx),%zmm5
(gdb) 
...
   0x00005555555f1d79 <+27481>: mov    -0x9c8(%rbp),%rbx
   0x00005555555f1d80 <+27488>: vmovaps (%rax),%zmm4
   0x00005555555f1d86 <+27494>: vmovaps 0x40(%rax),%zmm20
=> 0x00005555555f1d8d <+27501>: vmovaps -0x30(%rbx),%zmm5
   0x00005555555f1d97 <+27511>: vmovaps -0x70(%rbx),%zmm21
   0x00005555555f1da1 <+27521>: vpermq %zmm5,%zmm3,%zmm5
   0x00005555555f1da7 <+27527>: vpermq %zmm21,%zmm3,%zmm21
   0x00005555555f1dad <+27533>: kshiftrw $0x8,%k1,%k3
   0x00005555555f1db3 <+27539>: kshiftrw $0x8,%k2,%k4
   0x00005555555f1db9 <+27545>: vbroadcastsd 0x1000(%rdi),%zmm17
   0x00005555555f1dc3 <+27555>: vbroadcastsd 0x1008(%rdi),%zmm18
   0x00005555555f1dcd <+27565>: vaddpd %zmm30,%zmm30,%zmm8{%k1}
   0x00005555555f1dd3 <+27571>: vaddpd %zmm30,%zmm30,%zmm24{%k3}
   0x00005555555f1dd9 <+27577>: vaddpd %zmm31,%zmm31,%zmm9{%k2}
   0x00005555555f1ddf <+27583>: vaddpd %zmm31,%zmm31,%zmm25{%k4}
   0x00005555555f1de5 <+27589>: vmulpd %zmm4,%zmm17,%zmm1
   0x00005555555f1deb <+27595>: vmulpd %zmm20,%zmm17,%zmm17
   0x00005555555f1df1 <+27601>: vmulpd %zmm5,%zmm18,%zmm2
   0x00005555555f1df7 <+27607>: vmulpd %zmm21,%zmm18,%zmm18
   0x00005555555f1dfd <+27613>: vmulpd %zmm8,%zmm1,%zmm1
   0x00005555555f1e03 <+27619>: vmulpd %zmm24,%zmm17,%zmm17
   0x00005555555f1e09 <+27625>: vmulpd %zmm9,%zmm2,%zmm2
   0x00005555555f1e0f <+27631>: vmulpd %zmm25,%zmm18,%zmm18
   0x00005555555f1e15 <+27637>: vmovaps %zmm1,(%rdi)
   0x00005555555f1e1b <+27643>: vmovaps %zmm17,0x40(%rdi)
   0x00005555555f1e22 <+27650>: vmovaps %zmm2,0x80(%rdi)
   0x00005555555f1e29 <+27657>: vmovaps %zmm18,0xc0(%rdi)
   0x00005555555f1e30 <+27664>: mov    -0x918(%rbp),%rax
   0x00005555555f1e37 <+27671>: mov    -0x908(%rbp),%rbx
   0x00005555555f1e3e <+27678>: vmovaps (%rax),%zmm6
   0x00005555555f1e44 <+27684>: vmovaps (%rbx),%zmm7
   0x00005555555f1e4a <+27690>: vpaddd %zmm6,%zmm0,%zmm0
   0x00005555555f1e50 <+27696>: vpandd %zmm7,%zmm0,%zmm0
   0x00005555555f1e56 <+27702>: vmovaps %zmm30,%zmm8
   0x00005555555f1e5c <+27708>: vmovaps %zmm30,%zmm24
   0x00005555555f1e62 <+27714>: vmovaps %zmm31,%zmm9
   0x00005555555f1e68 <+27720>: vmovaps %zmm31,%zmm25
   0x00005555555f1e6e <+27726>: mov    -0x9f0(%rbp),%rcx
--Type <RET> for more, q to quit, c to continue without paging--
ldesnogu commented 8 months ago

That's indeed much easier :-)

Now I'd like to see register contents: 'i r' I find it odd that a non multiple of 64 (vector length) is used as offset. I'm not familiar enough with x86, but it's possible this is an unaligned access causing a fault.

Hermann-SW commented 8 months ago
3: x/i $pc
=> 0x5555555f1d8d <cy1024_process_chunk+27501>: vmovaps -0x30(%rbx),%zmm5
(gdb) i r
rax            0x5555559eef80      93824997060480
rbx            0x555d354b77f8      93858814457848
rcx            0x7fffd76e6380      140736807723904
rdx            0x7fffd76e6400      140736807724032
rsi            0x1020304050607     283686952306183
rdi            0x7fffd76961c0      140736807395776
rbp            0x7fffd7eead30      0x7fffd7eead30
rsp            0x7fffd7eea140      0x7fffd7eea140
r8             0xc                 12
r9             0x80                128
r10            0xf3e               3902
r11            0xc0                192
r12            0xc                 12
r13            0x7fffdc08f080      140736884961408
r14            0x7fffdc1f2280      140736886416000
r15            0x555555a00d80      93824997133696
rip            0x5555555f1d8d      0x5555555f1d8d <cy1024_process_chunk+27501>
eflags         0x10286             [ PF SF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x4101d475e22355b8  4684258690212255160
k1             0x0                 0
k2             0x0                 0
k3             0x0                 0
k4             0x41038e61a6ad0108  4684744587454775560
k5             0x0                 0
k6             0x0                 0
k7             0x0                 0
(gdb) 
ldesnogu commented 8 months ago

The address is definitely not aligned then. But according to Intel documentation this should cause no issue.

Can you please try to dump memory around 0x555d354b77f8? You can play with the 'x/100x address' command. For instance 'x/64x 0x555d354b77f8' and 'x/64x 0x555d354b77b8'

ldesnogu commented 8 months ago

Hmm I might have looked at the wrong place about the alignment enforcement: it looks like movaps needs aligned data. https://stackoverflow.com/questions/62176908/unaligned-vector-pointers-oddities-avx512

ldesnogu commented 8 months ago

And in the source code:

"vmovaps -0x30(%%rbx),%%zmm5 \n\t vmovaps -0x70(%%rbx),%%zmm21 \n\t"/* wtB[j-1]; load doubles from rcx+[-0x30,-0x28,-0x20,-0x18,-0x10,-0x08, 0, +0x08] - It may not look like it but this is in fact an aligned load */\

We now need to understand why this ends up being unaligned.

And the comment or code is buggy: the code uses rbx and the comment mentions rcx.

ldesnogu commented 8 months ago

(I'm likely misusing this to dump my thoughts, but as we don't have other means of communication...)

I tried on an Intel AVX-512 machine and I could reach the offending code but not from the same point. Dumping the pointers add1/add2/add3, I can confirm the access on my machine is aligned

diff --git a/src/radix1024_main_carry_loop.h b/src/radix1024_main_carry_loop.h
index 761bd6e..81a0f73 100755
--- a/src/radix1024_main_carry_loop.h
+++ b/src/radix1024_main_carry_loop.h
@@ -266,6 +266,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee
                        add1 = &wt1[col  +ii];  /* Don't use add0 here, to avoid need to reload main-array address */
                        add2 = &wt1[co2-1-ii];
                        add3 = &wt1[co3-1-ii];
+                        printf("%p %p %p\n", add1, add2, add3);

                        // Since use wt1-array in the wtsinit macro, need to fiddle this here:
                        co2 = co3;      // For all data but the first set in each j-block, co2=co3. Thus, after the first block of data is done
Hermann-SW commented 8 months ago
3: x/i $pc
=> 0x5555555f1d8d <cy1024_process_chunk+27501>: vmovaps -0x30(%rbx),%zmm5
(gdb) x/64x 0x555d354b77f8
0x555d354b77f8: Cannot access memory at address 0x555d354b77f8
(gdb) x/64x 0x555d354b77b8
0x555d354b77b8: Cannot access memory at address 0x555d354b77b8
(gdb) 
ldesnogu commented 7 months ago

@Hermann-SW could you please try the patch above with the printf? Also you should not blindly use the address as you did for dumping. First check %rbx contents.

Hermann-SW commented 7 months ago

core dump broke last line of output, so I added fflush():

                        add3 = &wt1[co3-1-ii];
+                       printf("%p %p %p\n", add1, add2, add3);fflush(stdout);

                        // Since use wt1-array in the wtsinit macro, need to fiddle this here:

errout2.zip

hermann@7950x:~/Mlucas/obj$ rm -f err out
hermann@7950x:~/Mlucas/obj$ ./Mlucas -s medium 2>err >out
Segmentation fault (core dumped)
hermann@7950x:~/Mlucas/obj$ zip -c errout2 err out </dev/zero
  adding: err (deflated 60%)
  adding: out (deflated 99%)
Enter comment for err:
Enter comment for out:
hermann@7950x:~/Mlucas/obj$ wc --lines err out
     42 err
 206412 out
 206454 total
hermann@7950x:~/Mlucas/obj$ tail -5 out
0x5bc0b40c6b80 0x5bc0b40cdf70 0x5bc0b40cbf70
0x5bc0b40c6d80 0x5bc0b40cdd70 0x5bc0b40cbd70
0x5bc0b40c6f80 0x5bc0b40cdb70 0x5bc0b40cbb70
0x5bc0b40c7180 0x5bc0b40cd970 0x5bc0b40cb970
0x5bc0b40c7380 0x5bc0b40cd770 0x5bc0b40cb770
hermann@7950x:~/Mlucas/obj$