Open Hermann-SW opened 7 months ago
Thanks for the bug report! I recall Ken also had issues with a newer AMD CPU, but we were not able to fully debug the problem. It looks like you are using the AVX512 build mode with GCC 11.4. Do the "teensy", "tiny", "large" and "huge" self-tests work work as expected?
Would you mind running Mlucas with GDB so we can see where exactly it is seg faulting. If you have GDB installed, for the "small" self-test just run: gdb -args ./Mlucas -s small
, type r
(run) to start Mlucas and bt
(backtrace) to show the stack trace after it crashes, and then repeat this for the "medium" self-test.
In addition, could you try building it with Clang so we could rule out any compiler differences. If you have Clang installed, just remove the existing obj
directory, run: export CC=clang
and then rerun the makemake.sh
script as you did before: bash makemake.sh use_hwloc
.
How do I start the 5 self-tests?
hermann@7950x:~/Mlucas/obj$ gdb -args ./Mlucas -s small
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
--Type <RET> for more, q to quit, c to continue without paging--
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./Mlucas...
(gdb) r
Starting program: /home/hermann/Mlucas/obj/Mlucas -s small
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Mlucas 20.1.1
http://www.mersenneforum.org/mayer/README.html
INFO: testing qfloat routines...
System total RAM = 31190, free RAM = 29625
INFO: 29625 MB of free system RAM detected.
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 11.4.0.
INFO: Build uses AVX512 instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
[Detaching after vfork from child process 2394]
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 32 available processor cores.
INFO: testing FFT radix tables...
Mlucas selftest running.....
/****************************************************************************/
User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
worktodo.txt file not found...using user-supplied command-line exponent p = 5152643
INFO: Maximum recommended exponent for FFT length (256 Kdbl) = 5178863; p[ = 5152643]/pmax_rec = 0.9949371126.
Initial DWT-multipliers chain length = [hiacc] in carry step.
M5152643: using FFT length 256K = 262144 8-byte floats, initial residue shift count = 1159220
This gives an average 19.655773162841797 bits per digit
Using complex FFT radices 256 16 32
[New Thread 0x7ffff2b7a640 (LWP 2395)]
mers_mod_square: Init threadpool of 1 threads
[New Thread 0x7ffff1cb9640 (LWP 2396)]
Using 1 threads in carry step
M5152643 Roundoff warning on iteration 2, maxerr = 0.420149805045
M5152643 Roundoff warning on iteration 6, maxerr = 0.414123535156
M5152643 Roundoff warning on iteration 10, maxerr = 0.406272888184
M5152643 Roundoff warning on iteration 11, maxerr = 0.411499023438
Thread 3 "Mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff1cb9640 (LWP 2396)]
0x000055555572efa7 in cy256_process_chunk (targ=<optimized out>) at ../src/radix256_main_carry_loop.h:472
472 AVX_cmplx_carry_fast_pow2_wtsinit_X16(add1,add2,add3, itmp, half_arr,sign_mask, n_minus_sil,n_minus_silp1,sinwt,sinwtm1, sse_bw,sse_nm1)
(gdb) bt
#0 0x000055555572efa7 in cy256_process_chunk (targ=<optimized out>)
at ../src/radix256_main_carry_loop.h:472
#1 0x00005555558e3062 in worker_thr_routine (data=0x555555a5ea80)
at ../src/threadpool.c:452
#2 0x00007ffff7c94ac3 in start_thread (arg=<optimized out>)
at ./nptl/pthread_create.c:442
#3 0x00007ffff7d26850 in clone3 ()
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)
Thanks for the additional information! That is very helpful.
How do I start the 5 self-tests?
To start the self-tests, just pass each self-test value to the -s
option:
./Mlucas -s teensy
./Mlucas -s tiny
./Mlucas -s small
./Mlucas -s medium
./Mlucas -s large
./Mlucas -s huge
If you want to run them in GDB instead, just prefix each command with gdb -args
.
I did run all 6, teensy/tiny/large/huge do complete (huge took 15min to complete with around 100% CPU). small does core dump, bt in previous comment.
hermann@7950x:~/Mlucas/obj$ gdb -args ./Mlucas -s medium
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
--Type <RET> for more, q to quit, c to continue without paging--
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./Mlucas...
(gdb) r
Starting program: /home/hermann/Mlucas/obj/Mlucas -s medium
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Mlucas 20.1.1
http://www.mersenneforum.org/mayer/README.html
INFO: testing qfloat routines...
System total RAM = 31190, free RAM = 29308
INFO: 29308 MB of free system RAM detected.
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 11.4.0.
INFO: Build uses AVX512 instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
[Detaching after vfork from child process 3784]
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 32 available processor cores.
INFO: testing FFT radix tables...
Mlucas selftest running.....
/****************************************************************************/
User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
worktodo.txt file not found...using user-supplied command-line exponent p = 39003229
INFO: Maximum recommended exponent for FFT length (2048 Kdbl) = 39606917; p[ = 39003229]/pmax_rec = 0.9847580159.
Initial DWT-multipliers chain length = [short] in carry step.
M39003229: using FFT length 2048K = 2097152 8-byte floats, initial residue shift count = 31137256
This gives an average 18.598188877105713 bits per digit
Using complex FFT radices 1024 32 32
[New Thread 0x7fffda2f8640 (LWP 3785)]
mers_mod_square: Init threadpool of 1 threads
[New Thread 0x7fffd7eeb640 (LWP 3786)]
Using 1 threads in carry step
Thread 3 "Mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7eeb640 (LWP 3786)]
0x00005555555f1d8d in cy1024_process_chunk (targ=<optimized out>) at ../src/radix1024_main_carry_loop.h:275
275 AVX_cmplx_carry_fast_pow2_wtsinit_X16(add1,add2,add3, itmp, half_arr,sign_mask, n_minus_sil,n_minus_silp1,sinwt,sinwtm1, sse_bw,sse_nm1)
(gdb) bt
#0 0x00005555555f1d8d in cy1024_process_chunk (targ=<optimized out>) at ../src/radix1024_main_carry_loop.h:275
#1 0x00005555558e3062 in worker_thr_routine (data=0x555555a338f0) at ../src/threadpool.c:452
#2 0x00007ffff7c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#3 0x00007ffff7d26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)
Thanks for testing them all. Yes, the larger self-tests do take progressively longer, but 15 minutes is actually quite fast for the huge self-test when run single threaded.
It looks the both the small and medium self-tests are seg faulting in the same inline assembly. @ldesnogu - Since you are our resident inline assembly expert, do you have any insights as to why this is seg faulting on AMD CPUs?
I alas have no access to an AMD machine. One can identify the faulty instruction by running with gdb and doing "disassemble $pc"
hermann@7950x:~/Mlucas/obj$ gdb -args ./Mlucas -s medium
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
...
Thread 3 "Mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7eeb640 (LWP 2248)]
0x00005555555f1d8d in cy1024_process_chunk (targ=<optimized out>) at ../src/radix1024_main_carry_loop.h:275
275 AVX_cmplx_carry_fast_pow2_wtsinit_X16(add1,add2,add3, itmp, half_arr,sign_mask, n_minus_sil,n_minus_silp1,sinwt,sinwtm1, sse_bw,sse_nm1)
(gdb) disassemble $pc
Dump of assembler code for function cy1024_process_chunk:
0x00005555555eb220 <+0>: endbr64
0x00005555555eb224 <+4>: lea 0x8(%rsp),%r10
0x00005555555eb229 <+9>: and $0xffffffffffffffc0,%rsp
0x00005555555eb22d <+13>: push -0x8(%r10)
0x00005555555eb231 <+17>: push %rbp
0x00005555555eb232 <+18>: mov %rsp,%rbp
0x00005555555eb235 <+21>: push %r15
0x00005555555eb237 <+23>: push %r14
0x00005555555eb239 <+25>: push %r13
0x00005555555eb23b <+27>: push %r12
0x00005555555eb23d <+29>: push %r10
0x00005555555eb23f <+31>: push %rbx
0x00005555555eb240 <+32>: sub $0xbc0,%rsp
0x00005555555eb247 <+39>: mov %rdi,-0xb98(%rbp)
0x00005555555eb24e <+46>: mov %fs:0x28,%rax
0x00005555555eb257 <+55>: mov %rax,-0x38(%rbp)
0x00005555555eb25b <+59>: mov 0x3a1eeb(%rip),%eax # 0x55555598d14c <USE_SHORT_CY_CHAIN>
--Type <RET> for more, q to quit, c to continue without paging--
0x00005555555eb261 <+65>: movl $0x10,-0xa98(%rbp)
0x00005555555eb26b <+75>: test %eax,%eax
0x00005555555eb26d <+77>: je 0x5555555eb285 <cy1024_process_chunk+101>
0x00005555555eb26f <+79>: cmp $0x1,%eax
0x00005555555eb272 <+82>: sete %al
0x00005555555eb275 <+85>: movzbl %al,%eax
0x00005555555eb278 <+88>: lea 0x4(,%rax,4),%eax
0x00005555555eb27f <+95>: mov %eax,-0xa98(%rbp)
0x00005555555eb285 <+101>: mov -0xb98(%rbp),%r12
0x00005555555eb28c <+108>: lea 0x353fd5(%rip),%r13 # 0x55555593f268
0x00005555555eb293 <+115>: xor %edx,%edx
0x00005555555eb295 <+117>: mov $0xc91,%edi
0x00005555555eb29a <+122>: mov %r13,%rcx
0x00005555555eb29d <+125>: mov 0x4(%r12),%ebx
0x00005555555eb2a2 <+130>: vmovsd 0x18(%r12),%xmm5
0x00005555555eb2a9 <+137>: vmovsd 0x58(%r12),%xmm0
0x00005555555eb2b0 <+144>: mov 0x24(%r12),%eax
0x00005555555eb2b5 <+149>: mov %ebx,-0xacc(%rbp)
--Type <RET> for more, q to quit, c to continue without paging--
0x00005555555eb2bb <+155>: mov 0x8(%r12),%ebx
0x00005555555eb2c0 <+160>: mov %eax,-0xa04(%rbp)
0x00005555555eb2c6 <+166>: mov %ebx,%esi
0x00005555555eb2c8 <+168>: vmovsd %xmm5,-0xbb0(%rbp)
0x00005555555eb2d0 <+176>: vmovsd 0x50(%r12),%xmm5
0x00005555555eb2d7 <+183>: shl $0xa,%esi
0x00005555555eb2da <+186>: mov %esi,-0xb58(%rbp)
0x00005555555eb2e0 <+192>: mov 0xc(%r12),%esi
0x00005555555eb2e5 <+197>: mov %esi,-0xb64(%rbp)
0x00005555555eb2eb <+203>: mov 0x10(%r12),%esi
0x00005555555eb2f0 <+208>: vmovsd %xmm5,-0xb48(%rbp)
0x00005555555eb2f8 <+216>: mov %esi,-0xa30(%rbp)
0x00005555555eb2fe <+222>: mov 0x20(%r12),%esi
0x00005555555eb303 <+227>: mov %esi,-0xb90(%rbp)
0x00005555555eb309 <+233>: mov 0x28(%r12),%esi
0x00005555555eb30e <+238>: mov %esi,-0xb8c(%rbp)
0x00005555555eb314 <+244>: mov 0x2c(%r12),%esi
0x00005555555eb319 <+249>: mov %esi,-0xae0(%rbp)
--Type <RET> for more, q to quit, c to continue without paging--
0x00005555555eb31f <+255>: mov 0x30(%r12),%esi
0x00005555555eb324 <+260>: mov %esi,-0xb68(%rbp)
0x00005555555eb32a <+266>: mov 0x34(%r12),%esi
0x00005555555eb32f <+271>: mov %esi,-0xb54(%rbp)
0x00005555555eb335 <+277>: mov 0x38(%r12),%esi
0x00005555555eb33a <+282>: mov %esi,-0xb28(%rbp)
0x00005555555eb340 <+288>: mov 0x3c(%r12),%esi
0x00005555555eb345 <+293>: mov %esi,-0xb74(%rbp)
0x00005555555eb34b <+299>: mov 0x40(%r12),%esi
0x00005555555eb350 <+304>: mov %esi,-0xaa0(%rbp)
0x00005555555eb356 <+310>: vmovsd %xmm0,-0x900(%rbp)
0x00005555555eb35e <+318>: mov 0x78(%r12),%r15
0x00005555555eb363 <+323>: mov 0x80(%r12),%r14
0x00005555555eb36b <+331>: vmovsd 0x347db5(%rip),%xmm5 # 0x555555933128
0x00005555555eb373 <+339>: mov 0x60(%r12),%rsi
0x00005555555eb378 <+344>: vmovq 0x343b80(%rip),%xmm1 # 0x55555592ef00
0x00005555555eb380 <+352>: vmovsd (%r15),%xmm0
0x00005555555eb385 <+357>: vfmadd132sd (%r14),%xmm5,%xmm0
--Type <RET> for more, q to quit, c to continue without paging--
0x00005555555eb38a <+362>: vmovsd 0x343ade(%rip),%xmm5 # 0x55555592ee70
0x00005555555eb392 <+370>: mov %rsi,-0xa28(%rbp)
0x00005555555eb399 <+377>: mov 0x68(%r12),%rsi
0x00005555555eb39e <+382>: mov %rsi,-0xb80(%rbp)
0x00005555555eb3a5 <+389>: mov 0x70(%r12),%rsi
0x00005555555eb3aa <+394>: mov %rsi,-0xac0(%rbp)
0x00005555555eb3b1 <+401>: lea 0x357f08(%rip),%rsi # 0x5555559432c0
0x00005555555eb3b8 <+408>: vandpd %xmm1,%xmm0,%xmm0
0x00005555555eb3bc <+412>: vcomisd %xmm0,%xmm5
0x00005555555eb3c0 <+416>: seta %dl
0x00005555555eb3c3 <+419>: call 0x5555559228a0 <ASSERT>
0x00005555555eb3c8 <+424>: vmovsd 0x8(%r15),%xmm0
0x00005555555eb3ce <+430>: xor %edx,%edx
0x00005555555eb3d0 <+432>: mov %r13,%rcx
0x00005555555eb3d3 <+435>: vmovsd 0x347d4d(%rip),%xmm5 # 0x555555933128
0x00005555555eb3db <+443>: vmovq 0x343b1d(%rip),%xmm1 # 0x55555592ef00
0x00005555555eb3e3 <+451>: lea 0x357ed6(%rip),%rsi # 0x5555559432c0
0x00005555555eb3ea <+458>: mov $0xc92,%edi
--Type <RET> for more, q to quit, c to continue without paging--
0x00005555555eb3ef <+463>: vfmadd132sd 0x8(%r14),%xmm5,%xmm0
0x00005555555eb3f5 <+469>: vmovsd 0x343a73(%rip),%xmm5 # 0x55555592ee70
0x00005555555eb3fd <+477>: vandpd %xmm1,%xmm0,%xmm0
0x00005555555eb401 <+481>: vcomisd %xmm0,%xmm5
0x00005555555eb405 <+485>: seta %dl
0x00005555555eb408 <+488>: call 0x5555559228a0 <ASSERT>
0x00005555555eb40d <+493>: mov 0x88(%r12),%rsi
0x00005555555eb415 <+501>: lea (%rbx,%rbx,1),%edx
0x00005555555eb418 <+504>: mov %r12,-0xb98(%rbp)
0x00005555555eb41f <+511>: mov 0x3ff12f(%rip),%eax # 0x5555559ea554 <DAT_BITS>
0x00005555555eb425 <+517>: mov 0x3ff125(%rip),%ecx # 0x5555559ea550 <PAD_BITS>
0x00005555555eb42b <+523>: mov %rsi,-0xb88(%rbp)
0x00005555555eb432 <+530>: mov 0x90(%r12),%rsi
0x00005555555eb43a <+538>: shrx %eax,%edx,%edi
0x00005555555eb43f <+543>: shlx %ecx,%edi,%edi
0x00005555555eb444 <+548>: add %edx,%edi
0x00005555555eb446 <+550>: mov %rsi,-0xaa8(%rbp)
0x00005555555eb44d <+557>: mov 0x98(%r12),%rsi
--Type <RET> for more, q to quit, c to continue without paging--
I'm afraid you didn't scroll enough. The place where the segfault occurs would start with '=>'
For instance:
0x000055555563dbcd <+381>: mov %ebx,-0x40(%rsp) 0x000055555563dbd1 <+385>: lea (%r11,%rdx,8),%rdi => 0x000055555563dbd5 <+389>: movsd (%r8),%xmm0 0x000055555563dbda <+394>: movslq %ebx,%rdx 0x000055555563dbdd <+397>: mov -0x74(%rsp),%ebx 0x000055555563dbe1 <+401>: movsd (%rdi),%xmm1
And when typing that comment, I noticed the '=>' disappeared from the quote... Anyway you need to dump until 0x00005555555f1d8d.
I found no "=>", but learned how to query the (big) offset:
Thread 3 "Mlucas" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7eeb640 (LWP 2444)]
0x00005555555f1d8d in cy1024_process_chunk (targ=<optimized out>) at ../src/radix1024_main_carry_loop.h:275
275 AVX_cmplx_carry_fast_pow2_wtsinit_X16(add1,add2,add3, itmp, half_arr,sign_mask, n_minus_sil,n_minus_silp1,sinwt,sinwtm1, sse_bw,sse_nm1)
(gdb) display/i $ps
1: x/i $ps
0x10286: <error: Cannot access memory at address 0x10286>
(gdb) display/i $pc
2: x/i $pc
=> 0x5555555f1d8d <cy1024_process_chunk+27501>: vmovaps -0x30(%rbx),%zmm5
(gdb)
...
0x00005555555f1d79 <+27481>: mov -0x9c8(%rbp),%rbx
0x00005555555f1d80 <+27488>: vmovaps (%rax),%zmm4
0x00005555555f1d86 <+27494>: vmovaps 0x40(%rax),%zmm20
=> 0x00005555555f1d8d <+27501>: vmovaps -0x30(%rbx),%zmm5
0x00005555555f1d97 <+27511>: vmovaps -0x70(%rbx),%zmm21
0x00005555555f1da1 <+27521>: vpermq %zmm5,%zmm3,%zmm5
0x00005555555f1da7 <+27527>: vpermq %zmm21,%zmm3,%zmm21
0x00005555555f1dad <+27533>: kshiftrw $0x8,%k1,%k3
0x00005555555f1db3 <+27539>: kshiftrw $0x8,%k2,%k4
0x00005555555f1db9 <+27545>: vbroadcastsd 0x1000(%rdi),%zmm17
0x00005555555f1dc3 <+27555>: vbroadcastsd 0x1008(%rdi),%zmm18
0x00005555555f1dcd <+27565>: vaddpd %zmm30,%zmm30,%zmm8{%k1}
0x00005555555f1dd3 <+27571>: vaddpd %zmm30,%zmm30,%zmm24{%k3}
0x00005555555f1dd9 <+27577>: vaddpd %zmm31,%zmm31,%zmm9{%k2}
0x00005555555f1ddf <+27583>: vaddpd %zmm31,%zmm31,%zmm25{%k4}
0x00005555555f1de5 <+27589>: vmulpd %zmm4,%zmm17,%zmm1
0x00005555555f1deb <+27595>: vmulpd %zmm20,%zmm17,%zmm17
0x00005555555f1df1 <+27601>: vmulpd %zmm5,%zmm18,%zmm2
0x00005555555f1df7 <+27607>: vmulpd %zmm21,%zmm18,%zmm18
0x00005555555f1dfd <+27613>: vmulpd %zmm8,%zmm1,%zmm1
0x00005555555f1e03 <+27619>: vmulpd %zmm24,%zmm17,%zmm17
0x00005555555f1e09 <+27625>: vmulpd %zmm9,%zmm2,%zmm2
0x00005555555f1e0f <+27631>: vmulpd %zmm25,%zmm18,%zmm18
0x00005555555f1e15 <+27637>: vmovaps %zmm1,(%rdi)
0x00005555555f1e1b <+27643>: vmovaps %zmm17,0x40(%rdi)
0x00005555555f1e22 <+27650>: vmovaps %zmm2,0x80(%rdi)
0x00005555555f1e29 <+27657>: vmovaps %zmm18,0xc0(%rdi)
0x00005555555f1e30 <+27664>: mov -0x918(%rbp),%rax
0x00005555555f1e37 <+27671>: mov -0x908(%rbp),%rbx
0x00005555555f1e3e <+27678>: vmovaps (%rax),%zmm6
0x00005555555f1e44 <+27684>: vmovaps (%rbx),%zmm7
0x00005555555f1e4a <+27690>: vpaddd %zmm6,%zmm0,%zmm0
0x00005555555f1e50 <+27696>: vpandd %zmm7,%zmm0,%zmm0
0x00005555555f1e56 <+27702>: vmovaps %zmm30,%zmm8
0x00005555555f1e5c <+27708>: vmovaps %zmm30,%zmm24
0x00005555555f1e62 <+27714>: vmovaps %zmm31,%zmm9
0x00005555555f1e68 <+27720>: vmovaps %zmm31,%zmm25
0x00005555555f1e6e <+27726>: mov -0x9f0(%rbp),%rcx
--Type <RET> for more, q to quit, c to continue without paging--
That's indeed much easier :-)
Now I'd like to see register contents: 'i r' I find it odd that a non multiple of 64 (vector length) is used as offset. I'm not familiar enough with x86, but it's possible this is an unaligned access causing a fault.
3: x/i $pc
=> 0x5555555f1d8d <cy1024_process_chunk+27501>: vmovaps -0x30(%rbx),%zmm5
(gdb) i r
rax 0x5555559eef80 93824997060480
rbx 0x555d354b77f8 93858814457848
rcx 0x7fffd76e6380 140736807723904
rdx 0x7fffd76e6400 140736807724032
rsi 0x1020304050607 283686952306183
rdi 0x7fffd76961c0 140736807395776
rbp 0x7fffd7eead30 0x7fffd7eead30
rsp 0x7fffd7eea140 0x7fffd7eea140
r8 0xc 12
r9 0x80 128
r10 0xf3e 3902
r11 0xc0 192
r12 0xc 12
r13 0x7fffdc08f080 140736884961408
r14 0x7fffdc1f2280 140736886416000
r15 0x555555a00d80 93824997133696
rip 0x5555555f1d8d 0x5555555f1d8d <cy1024_process_chunk+27501>
eflags 0x10286 [ PF SF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
k0 0x4101d475e22355b8 4684258690212255160
k1 0x0 0
k2 0x0 0
k3 0x0 0
k4 0x41038e61a6ad0108 4684744587454775560
k5 0x0 0
k6 0x0 0
k7 0x0 0
(gdb)
The address is definitely not aligned then. But according to Intel documentation this should cause no issue.
Can you please try to dump memory around 0x555d354b77f8? You can play with the 'x/100x address' command. For instance 'x/64x 0x555d354b77f8' and 'x/64x 0x555d354b77b8'
Hmm I might have looked at the wrong place about the alignment enforcement: it looks like movaps needs aligned data. https://stackoverflow.com/questions/62176908/unaligned-vector-pointers-oddities-avx512
And in the source code:
"vmovaps -0x30(%%rbx),%%zmm5 \n\t vmovaps -0x70(%%rbx),%%zmm21 \n\t"/* wtB[j-1]; load doubles from rcx+[-0x30,-0x28,-0x20,-0x18,-0x10,-0x08, 0, +0x08] - It may not look like it but this is in fact an aligned load */\
We now need to understand why this ends up being unaligned.
And the comment or code is buggy: the code uses rbx and the comment mentions rcx.
(I'm likely misusing this to dump my thoughts, but as we don't have other means of communication...)
I tried on an Intel AVX-512 machine and I could reach the offending code but not from the same point. Dumping the pointers add1/add2/add3, I can confirm the access on my machine is aligned
diff --git a/src/radix1024_main_carry_loop.h b/src/radix1024_main_carry_loop.h
index 761bd6e..81a0f73 100755
--- a/src/radix1024_main_carry_loop.h
+++ b/src/radix1024_main_carry_loop.h
@@ -266,6 +266,7 @@ normally be getting dispatched to [radix] separate blocks of the A-array, we nee
add1 = &wt1[col +ii]; /* Don't use add0 here, to avoid need to reload main-array address */
add2 = &wt1[co2-1-ii];
add3 = &wt1[co3-1-ii];
+ printf("%p %p %p\n", add1, add2, add3);
// Since use wt1-array in the wtsinit macro, need to fiddle this here:
co2 = co3; // For all data but the first set in each j-block, co2=co3. Thus, after the first block of data is done
3: x/i $pc
=> 0x5555555f1d8d <cy1024_process_chunk+27501>: vmovaps -0x30(%rbx),%zmm5
(gdb) x/64x 0x555d354b77f8
0x555d354b77f8: Cannot access memory at address 0x555d354b77f8
(gdb) x/64x 0x555d354b77b8
0x555d354b77b8: Cannot access memory at address 0x555d354b77b8
(gdb)
@Hermann-SW could you please try the patch above with the printf? Also you should not blindly use the address as you did for dumping. First check %rbx contents.
core dump broke last line of output, so I added fflush():
add3 = &wt1[co3-1-ii];
+ printf("%p %p %p\n", add1, add2, add3);fflush(stdout);
// Since use wt1-array in the wtsinit macro, need to fiddle this here:
hermann@7950x:~/Mlucas/obj$ rm -f err out
hermann@7950x:~/Mlucas/obj$ ./Mlucas -s medium 2>err >out
Segmentation fault (core dumped)
hermann@7950x:~/Mlucas/obj$ zip -c errout2 err out </dev/zero
adding: err (deflated 60%)
adding: out (deflated 99%)
Enter comment for err:
Enter comment for out:
hermann@7950x:~/Mlucas/obj$ wc --lines err out
42 err
206412 out
206454 total
hermann@7950x:~/Mlucas/obj$ tail -5 out
0x5bc0b40c6b80 0x5bc0b40cdf70 0x5bc0b40cbf70
0x5bc0b40c6d80 0x5bc0b40cdd70 0x5bc0b40cbd70
0x5bc0b40c6f80 0x5bc0b40cdb70 0x5bc0b40cbb70
0x5bc0b40c7180 0x5bc0b40cd970 0x5bc0b40cb970
0x5bc0b40c7380 0x5bc0b40cd770 0x5bc0b40cb770
hermann@7950x:~/Mlucas/obj$
I was asked to run "./Mlucas -s tiny" here: https://github.com/primesearch/Mlucas/issues/15#issuecomment-2027020472
Just wanted to create this issue that small and medium dump core.
and