Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (for cpp512z with FPTYPE=f only: fix it with 'volatile')

valassi commented 1 month ago

While rerunning tests in PR #841 I came across a new FPE "Floating-point exception - erroneous arithmetic operation" in gqttq tmad tests.

This is very surprising because I think that there is actually no change in the code (just some makefile changes leading to file name changes). I will try to rerun the test.

Anyway, for reference the issue is here in tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt

...
*** (2-512z) EXECUTE MADEVENT_CPP x10 (create events.lhe) ***
--------------------
CUDACPP_RUNTIME_FBRIDGEMODE = (not set)
CUDACPP_RUNTIME_VECSIZEUSED = 8192
--------------------
81920 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
--------------------
Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp'

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f2a1a623860 in ???
#1  0x7f2a1a622a05 in ???
#2  0x7f2a1a254def in ???
#3  0x7f2a1ae20acc in ???
#4  0x7f2a1acc4575 in ???
#5  0x7f2a1ae1d4c9 in ???
#6  0x7f2a1ae2570d in ???
#7  0x7f2a1ae2afa1 in ???
#8  0x43008b in ???
#9  0x431c10 in ???
#10  0x432d47 in ???
#11  0x433b1e in ???
#12  0x44a921 in ???
#13  0x42ebbf in ???
#14  0x40371e in ???
#15  0x7f2a1a23feaf in ???
#16  0x7f2a1a23ff5f in ???
#17  0x403844 in ???
#18  0xffffffffffffffff in ???
./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
 PDF set = nn23lo1
 alpha_s(Mz)= 0.1300 running at 2 loops.
 alpha_s(Mz)= 0.1300 running at 2 loops.
 Renormalization scale set on event-by-event basis
 Factorization   scale set on event-by-event basis

 getting user params
Enter number of events and max and min iterations: 
 Number of events and iterations        81920           1           1

valassi commented 1 month ago

Very strange. I have rerun the test and the FPE has disappeared. Closing as not reproducible.

valassi commented 1 month ago

Note: similar issues have resurfaced in susy_gg_t1t1, being debugged in #826

valassi commented 1 month ago

I have found again this SIGFPE in gqttq for FPTYPE=f, while running with the code using Olivier's patch #850 for the susy xsec mismatch #825.

My impression (or hope) is that this is the same issue as #855, i.e. a SIGFPE in fortran aloha_functions.f that can be fixed with volatile (see PR #857). I will test that too.

valassi commented 1 month ago

I have debugged this further.

First point, this is intermittent. Sometimes the code succeeds, sometimes the code fails (rerunning the same executable multiple times). Maybe half half, maybe less.

Second, this is NOT RELATED to the other SIGFPE #855 in rotxxx. So it will not be fixed by #857.

This crash happens deep inside cudacpp, within the color. I managed to create a gdb trace after rebuilding all with -g. This is in 19a2e0cb85e2306affdb143945026623e421b588 from PR #860

cd gq_ttq.mad/SubProcesses/P1_gu_ttxu
make cleanall
make -j FPTYPE=f BACKEND=cpp512z 
./madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp

and

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu> gdb ./madevent_cpp 
...
(gdb) run < /tmp/avalassi/input_gqttq_x1_cudacpp
...
Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98db1 in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, 
    allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, 
    allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1193
1193                if( okcol )
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) p okcol
$1 = <optimized out>
(gdb) p allrndcol
$3 = (const mgOnGpu::fptype *) 0x6300d00
(gdb) p ievt
$4 = <optimized out>
(gdb) p ieppV
$5 = <optimized out>
(gdb) p targetamp
$6 = {{3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 
    0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {
    3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 
    0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {
    3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 
    0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {
    3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 
    0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}}
(gdb) p neppV
$7 = 16
(gdb) p icolC
$8 = <optimized out>
(gdb) p ncolor
$9 = 4
(gdb) w
Missing arguments.
(gdb) l
1188    #if defined MGONGPU_CPPSIMD
1189                const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
1190    #else
1191                const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
1192    #endif
1193                if( okcol )
1194                {
1195                  allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1]
1196                  break;
1197                }

I suspect that this is related instead to the iconfig-channel mapping issues that @oliviermattelaer investigated in #852 ?

Anyway, keep this open.

To use the debugger, I added these patches

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu> git diff  --no-ext-diff
diff --git a/epochX/cudacpp/gq_ttq.mad/Source/make_opts b/epochX/cudacpp/gq_ttq.mad/Source/make_opts
index e4b87ee6a..6ccc273c1 100644
--- a/epochX/cudacpp/gq_ttq.mad/Source/make_opts
+++ b/epochX/cudacpp/gq_ttq.mad/Source/make_opts
@@ -1,7 +1,7 @@
 DEFAULT_CPP_COMPILER=g++
 DEFAULT_F2PY_COMPILER=f2py3
 DEFAULT_F_COMPILER=gfortran
-GLOBAL_FLAG=-O3 -ffast-math -fbounds-check
+GLOBAL_FLAG=-g -O3 -ffast-math -fbounds-check
 MACFLAG=
 MG5AMC_VERSION=SpecifiedByMG5aMCAtRunTime
 PYTHIA8_PATH=NotInstalled
diff --git a/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk b/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk
index 89da34009..b8fa4e131 100644
--- a/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk
+++ b/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk
@@ -387,6 +387,10 @@ else
   ###override OMPFLAGS = # disable OpenMP MT on all other platforms (default before #575)
 endif

+# Debug SIGFPE crash #845
+override OMPFLAGS=
+override OPTFLAGS=-g -O3
+
 #-------------------------------------------------------------------------------

 #=== Configure defaults and check if user-defined choices exist for RNDGEN (legacy!), HASCURAND, HASHIPRAND

valassi commented 1 month ago

I retried exacty the same recipe on https://github.com/madgraph5/madgraph4gpu/commit/19a2e0cb85e2306affdb143945026623e421b588

It seems to crash in 1189 instead of 1193? But it looks the same

cd gq_ttq.mad/SubProcesses/P1_gu_ttxu
make cleanall
make -j FPTYPE=f BACKEND=cpp512z 
./madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp
...
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f7a7cc23860 in ???
#1  0x7f7a7cc22a05 in ???
#2  0x7f7a7c854def in ???
#3  0x7f7a7d2f0d6f in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1189
#4  0x7f7a7d2f7a3d in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
#5  0x7f7a7d2fd2d1 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
#6  0x7f7a7d2fd2d1 in fbridgesequence_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
#7  0x43008b in smatrix1_multi_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
#8  0x431c10 in dsig1_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
#9  0x432d47 in dsigproc_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
#10  0x433b1e in dsig_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
#11  0x44a921 in sample_full_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
#12  0x42ebbf in driver
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:256
#13  0x40371e in main
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:301
Floating point exception (core dumped)

and through gdb

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98d6f in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, 
    allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, 
    allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1189
1189                const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0  0x00007ffff7f98d6f in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, 
    allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, 
    allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1189
#1  0x00007ffff7f9fa3e in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x6340ee0, channelId=channelId@entry=1)
    at MatrixElementKernels.cc:115
#2  0x00007ffff7fa52d2 in mg5amcCpu::Bridge<double>::cpu_sequence (goodHelOnly=false, selcol=0x7fffffc1cb50, selhel=0x7fffffc2cb50, 
    mes=0x7fffffc3cb50, channelId=1, rndcol=0x7fffffc9ceb0, rndhel=0x7fffffcbceb0, gs=0x1d35a68 <strong_+8>, momenta=<optimized out>, 
    this=0x62e0a70) at /usr/include/c++/11/bits/unique_ptr.h:173
#3  fbridgesequence_ (ppbridge=<optimized out>, momenta=<optimized out>, gs=0x1d35a68 <strong_+8>, rndhel=0x7fffffcbceb0, 
    rndcol=0x7fffffc9ceb0, pchannelId=<optimized out>, mes=0x7fffffc3cb50, selhel=0x7fffffc2cb50, selcol=0x7fffffc1cb50) at fbridge.cc:106
#4  0x000000000043008c in smatrix1_multi (p_multi=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, 
    hel_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, 
    col_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, channel=1, 
    out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, selected_hel=..., selected_col=..., 
    vecsize_used=16384) at auto_dsig1.f:618
#5  0x0000000000431c11 in dsig1_vec (all_pp=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, 
    all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig1.f:445
#6  0x0000000000432d48 in dsigproc_vec (all_p=..., 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1, 
    symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig.f:1034
#7  0x0000000000433b1f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1, 
    all_out=..., vecsize_used=16384) at auto_dsig.f:327
#8  0x000000000044a922 in sample_full (ndim=7, ncall=8192, itmax=1, itmin=1, dsig=0x433d10 <dsig>, ninvar=7, nconfigs=1, vecsize_used=16384)
    at dsample.f:208
#9  0x000000000042ebc0 in driver () at driver.f:256
#10 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:301
#11 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#12 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#13 0x0000000000403845 in _start ()
(gdb) l
1184              const int ievt = ievt00 + ieppV;
1185              //printf( "sigmaKin: ievt=%4d rndcol=%f\n", ievt, allrndcol[ievt] );
1186              for( int icolC = 0; icolC < ncolor; icolC++ )
1187              {
1188    #if defined MGONGPU_CPPSIMD
1189                const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
1190    #else
1191                const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
1192    #endif
1193                if( okcol )
(gdb) p okcol
$1 = <optimized out>
(gdb) p ievt
$2 = <optimized out>
(gdb) p ieppV
$3 = <optimized out>
(gdb) p neppV
$4 = 16
(gdb) p icolC
$5 = <optimized out>
(gdb) p ncolor
$6 = 4
(gdb) p allrndcol
$7 = (const mgOnGpu::fptype *) 0x6300d00
(gdb) p targetamp
$8 = {{3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 
    0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {3.6287187e-05, 
    0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 0.000313476485, 
    5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {3.6287187e-05, 0.00301690097, 
    9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 0.000313476485, 5.8289319e-05, 
    0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {3.6287187e-05, 0.00301690097, 9.26938374e-05, 
    0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 0.000313476485, 5.8289319e-05, 0.00402065413, 
    0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}}

valassi commented 1 week ago

I changed the name to indicate that this crash is most likely related to iconfig-channel mapping issues.

I will instead remove "iconfig-channel mapping issues" from the name of #855, which is ONLY about the rotxxx crash, most likely unrelated to iconfig-channel mapping issues.

valassi commented 1 week ago

Note:

I can still reproduce this sort of intermittent crash in sigmakin also after fixing rotxxx
It seems too erratic and random to be put in the CI: it is not always the second execution, it really is very random (and rare). Maybe randomly the error will show up in the CI too, but there is no way to force it I would say.
I investigated this code though valgrind. See https://github.com/madgraph5/madgraph4gpu/issues/868#issuecomment-2195066002. Initially I thought I had invalid reads/writes, but these disappear using a max stack trace. Eventually I got NO ERRORS FROM VALGRIND. So using valgrind to investigate this specific issue seems not useful.

valassi commented 1 week ago

I have almost completed MR #873 which fixes the channelid-iconfig mapping and icolamp issues in #856.

Unfortunately, howver, this des NOT fix this intermittent crash #845.

I have reproduuced it again

cd gq_ttq.mad/SubProcesses/P1_gu_ttxu
make cleanall
make -j FPTYPE=f BACKEND=cpp512z 
./madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp

Using '-g' in make_opts and cudacpp.mk, this sometimes crashes as follows

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f04f8a23860 in ???
#1  0x7f04f8a22a05 in ???
#2  0x7f04f8654def in ???
#3  0x7f04f91f200c in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1193
#4  0x7f04f9096575 in ???
#5  0x7f04f91eec89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1095
#6  0x7f04f91f8bfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
#7  0x7f04f91fe491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
#8  0x7f04f91fe491 in fbridgesequence_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
#9  0x4300eb in smatrix1_multi_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
#10  0x431c70 in dsig1_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
#11  0x432da7 in dsigproc_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
#12  0x433b7e in dsig_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
#13  0x44a9c1 in sample_full_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
#14  0x42ebdf in driver
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:257
#15  0x40371e in main
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:302
Floating point exception (core dumped)

valassi commented 1 week ago

I have changed the name (previously "Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (in sigmakin random color selection - iconfig-channel mapping issues?)") because I no longer see a connection to color selection...

valassi commented 1 week ago

I have renamed this issue to mention "(for cpp512z with FPTYPE=f only: fix it with 'volatile')"

Indeed, I checked that this only happens for cpp512z with FPTYPE=f. So it clearly looks like a SIMD-specific optimization issue, like those that I fixed with 'volatile' in many other parts of the code. And indeed I just created a patch that fixes the issue

--- a/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc
+++ b/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc
@@ -1190,7 +1190,8 @@ namespace mg5amcCpu
           for( int icolC = 0; icolC < ncolor; icolC++ )
           {
 #if defined MGONGPU_CPPSIMD
-            const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
+            // Add volatile here to avoid SIGFPE crashes in FPTYPE=f cpp512z builds (#845)
+            volatile const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
 #else
             const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
 #endif

By the way note this interesting post on SIMD and float, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993

This is now fixed in CODEGEN in #874. I think this can be closed when that PR is merged.

madgraph5 / madgraph4gpu

Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (for cpp512z with FPTYPE=f only: fix it with 'volatile') #845