Closed valassi closed 6 days ago
Very strange. I have rerun the test and the FPE has disappeared. Closing as not reproducible.
Note: similar issues have resurfaced in susy_gg_t1t1, being debugged in #826
I have found again this SIGFPE in gqttq for FPTYPE=f, while running with the code using Olivier's patch #850 for the susy xsec mismatch #825.
My impression (or hope) is that this is the same issue as #855, i.e. a SIGFPE in fortran aloha_functions.f that can be fixed with volatile (see PR #857). I will test that too.
I have debugged this further.
First point, this is intermittent. Sometimes the code succeeds, sometimes the code fails (rerunning the same executable multiple times). Maybe half half, maybe less.
Second, this is NOT RELATED to the other SIGFPE #855 in rotxxx. So it will not be fixed by #857.
This crash happens deep inside cudacpp, within the color. I managed to create a gdb trace after rebuilding all with -g. This is in 19a2e0cb85e2306affdb143945026623e421b588 from PR #860
cd gq_ttq.mad/SubProcesses/P1_gu_ttxu
make cleanall
make -j FPTYPE=f BACKEND=cpp512z
./madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp
and
[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu> gdb ./madevent_cpp
...
(gdb) run < /tmp/avalassi/input_gqttq_x1_cudacpp
...
Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98db1 in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>,
allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080,
allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1193
1193 if( okcol )
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) p okcol
$1 = <optimized out>
(gdb) p allrndcol
$3 = (const mgOnGpu::fptype *) 0x6300d00
(gdb) p ievt
$4 = <optimized out>
(gdb) p ieppV
$5 = <optimized out>
(gdb) p targetamp
$6 = {{3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475,
0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {
3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475,
0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {
3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475,
0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {
3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475,
0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}}
(gdb) p neppV
$7 = 16
(gdb) p icolC
$8 = <optimized out>
(gdb) p ncolor
$9 = 4
(gdb) w
Missing arguments.
(gdb) l
1188 #if defined MGONGPU_CPPSIMD
1189 const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
1190 #else
1191 const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
1192 #endif
1193 if( okcol )
1194 {
1195 allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1]
1196 break;
1197 }
I suspect that this is related instead to the iconfig-channel mapping issues that @oliviermattelaer investigated in #852 ?
Anyway, keep this open.
To use the debugger, I added these patches
[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu> git diff --no-ext-diff
diff --git a/epochX/cudacpp/gq_ttq.mad/Source/make_opts b/epochX/cudacpp/gq_ttq.mad/Source/make_opts
index e4b87ee6a..6ccc273c1 100644
--- a/epochX/cudacpp/gq_ttq.mad/Source/make_opts
+++ b/epochX/cudacpp/gq_ttq.mad/Source/make_opts
@@ -1,7 +1,7 @@
DEFAULT_CPP_COMPILER=g++
DEFAULT_F2PY_COMPILER=f2py3
DEFAULT_F_COMPILER=gfortran
-GLOBAL_FLAG=-O3 -ffast-math -fbounds-check
+GLOBAL_FLAG=-g -O3 -ffast-math -fbounds-check
MACFLAG=
MG5AMC_VERSION=SpecifiedByMG5aMCAtRunTime
PYTHIA8_PATH=NotInstalled
diff --git a/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk b/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk
index 89da34009..b8fa4e131 100644
--- a/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk
+++ b/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk
@@ -387,6 +387,10 @@ else
###override OMPFLAGS = # disable OpenMP MT on all other platforms (default before #575)
endif
+# Debug SIGFPE crash #845
+override OMPFLAGS=
+override OPTFLAGS=-g -O3
+
#-------------------------------------------------------------------------------
#=== Configure defaults and check if user-defined choices exist for RNDGEN (legacy!), HASCURAND, HASHIPRAND
I retried exacty the same recipe on https://github.com/madgraph5/madgraph4gpu/commit/19a2e0cb85e2306affdb143945026623e421b588
It seems to crash in 1189 instead of 1193? But it looks the same
cd gq_ttq.mad/SubProcesses/P1_gu_ttxu
make cleanall
make -j FPTYPE=f BACKEND=cpp512z
./madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp
...
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7f7a7cc23860 in ???
#1 0x7f7a7cc22a05 in ???
#2 0x7f7a7c854def in ???
#3 0x7f7a7d2f0d6f in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1189
#4 0x7f7a7d2f7a3d in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
#5 0x7f7a7d2fd2d1 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
#6 0x7f7a7d2fd2d1 in fbridgesequence_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
#7 0x43008b in smatrix1_multi_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
#8 0x431c10 in dsig1_vec_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
#9 0x432d47 in dsigproc_vec_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
#10 0x433b1e in dsig_vec_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
#11 0x44a921 in sample_full_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
#12 0x42ebbf in driver
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:256
#13 0x40371e in main
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:301
Floating point exception (core dumped)
and through gdb
Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98d6f in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>,
allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080,
allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1189
1189 const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0 0x00007ffff7f98d6f in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>,
allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080,
allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1189
#1 0x00007ffff7f9fa3e in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x6340ee0, channelId=channelId@entry=1)
at MatrixElementKernels.cc:115
#2 0x00007ffff7fa52d2 in mg5amcCpu::Bridge<double>::cpu_sequence (goodHelOnly=false, selcol=0x7fffffc1cb50, selhel=0x7fffffc2cb50,
mes=0x7fffffc3cb50, channelId=1, rndcol=0x7fffffc9ceb0, rndhel=0x7fffffcbceb0, gs=0x1d35a68 <strong_+8>, momenta=<optimized out>,
this=0x62e0a70) at /usr/include/c++/11/bits/unique_ptr.h:173
#3 fbridgesequence_ (ppbridge=<optimized out>, momenta=<optimized out>, gs=0x1d35a68 <strong_+8>, rndhel=0x7fffffcbceb0,
rndcol=0x7fffffc9ceb0, pchannelId=<optimized out>, mes=0x7fffffc3cb50, selhel=0x7fffffc2cb50, selcol=0x7fffffc1cb50) at fbridge.cc:106
#4 0x000000000043008c in smatrix1_multi (p_multi=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>,
hel_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>,
col_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, channel=1,
out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, selected_hel=..., selected_col=...,
vecsize_used=16384) at auto_dsig1.f:618
#5 0x0000000000431c11 in dsig1_vec (all_pp=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>,
all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>,
all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>,
all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>,
all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0,
all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig1.f:445
#6 0x0000000000432d48 in dsigproc_vec (all_p=...,
all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>,
all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>,
all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1,
symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0,
all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig.f:1034
#7 0x0000000000433b1f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1,
all_out=..., vecsize_used=16384) at auto_dsig.f:327
#8 0x000000000044a922 in sample_full (ndim=7, ncall=8192, itmax=1, itmin=1, dsig=0x433d10 <dsig>, ninvar=7, nconfigs=1, vecsize_used=16384)
at dsample.f:208
#9 0x000000000042ebc0 in driver () at driver.f:256
#10 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:301
#11 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#12 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#13 0x0000000000403845 in _start ()
(gdb) l
1184 const int ievt = ievt00 + ieppV;
1185 //printf( "sigmaKin: ievt=%4d rndcol=%f\n", ievt, allrndcol[ievt] );
1186 for( int icolC = 0; icolC < ncolor; icolC++ )
1187 {
1188 #if defined MGONGPU_CPPSIMD
1189 const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
1190 #else
1191 const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
1192 #endif
1193 if( okcol )
(gdb) p okcol
$1 = <optimized out>
(gdb) p ievt
$2 = <optimized out>
(gdb) p ieppV
$3 = <optimized out>
(gdb) p neppV
$4 = 16
(gdb) p icolC
$5 = <optimized out>
(gdb) p ncolor
$6 = 4
(gdb) p allrndcol
$7 = (const mgOnGpu::fptype *) 0x6300d00
(gdb) p targetamp
$8 = {{3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475,
0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {3.6287187e-05,
0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 0.000313476485,
5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {3.6287187e-05, 0.00301690097,
9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 0.000313476485, 5.8289319e-05,
0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {3.6287187e-05, 0.00301690097, 9.26938374e-05,
0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 0.000313476485, 5.8289319e-05, 0.00402065413,
0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}}
I changed the name to indicate that this crash is most likely related to iconfig-channel mapping issues.
I will instead remove "iconfig-channel mapping issues" from the name of #855, which is ONLY about the rotxxx crash, most likely unrelated to iconfig-channel mapping issues.
Note:
I have almost completed MR #873 which fixes the channelid-iconfig mapping and icolamp issues in #856.
Unfortunately, howver, this des NOT fix this intermittent crash #845.
I have reproduuced it again
cd gq_ttq.mad/SubProcesses/P1_gu_ttxu
make cleanall
make -j FPTYPE=f BACKEND=cpp512z
./madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp
Using '-g' in make_opts and cudacpp.mk, this sometimes crashes as follows
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7f04f8a23860 in ???
#1 0x7f04f8a22a05 in ???
#2 0x7f04f8654def in ???
#3 0x7f04f91f200c in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1193
#4 0x7f04f9096575 in ???
#5 0x7f04f91eec89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1095
#6 0x7f04f91f8bfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
#7 0x7f04f91fe491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
#8 0x7f04f91fe491 in fbridgesequence_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
#9 0x4300eb in smatrix1_multi_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
#10 0x431c70 in dsig1_vec_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
#11 0x432da7 in dsigproc_vec_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
#12 0x433b7e in dsig_vec_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
#13 0x44a9c1 in sample_full_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
#14 0x42ebdf in driver
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:257
#15 0x40371e in main
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:302
Floating point exception (core dumped)
I have changed the name (previously "Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (in sigmakin random color selection - iconfig-channel mapping issues?)") because I no longer see a connection to color selection...
I have renamed this issue to mention "(for cpp512z with FPTYPE=f only: fix it with 'volatile')"
Indeed, I checked that this only happens for cpp512z with FPTYPE=f. So it clearly looks like a SIMD-specific optimization issue, like those that I fixed with 'volatile' in many other parts of the code. And indeed I just created a patch that fixes the issue
--- a/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc
+++ b/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc
@@ -1190,7 +1190,8 @@ namespace mg5amcCpu
for( int icolC = 0; icolC < ncolor; icolC++ )
{
#if defined MGONGPU_CPPSIMD
- const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
+ // Add volatile here to avoid SIGFPE crashes in FPTYPE=f cpp512z builds (#845)
+ volatile const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
#else
const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
#endif
By the way note this interesting post on SIMD and float, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993
This is now fixed in CODEGEN in #874. I think this can be closed when that PR is merged.
While rerunning tests in PR #841 I came across a new FPE "Floating-point exception - erroneous arithmetic operation" in gqttq tmad tests.
This is very surprising because I think that there is actually no change in the code (just some makefile changes leading to file name changes). I will try to rerun the test.
Anyway, for reference the issue is here in tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt