Closed valassi closed 5 days ago
After PR #850, the code does provide a cross-section (even if the statement might be machine specific). But in any case the cross-section does not match (1.098 for fortran (LTS is at 1.101) versus 1.357 for C++) So still need investigation
Hi @oliviermattelaer thanks for the patch in PR https://github.com/madgraph5/madgraph4gpu/pull/850! This seems to fix the other issue #825 on cross section mismatch for susy_gg_tt (pending investgations in other processes).
After PR #850, the code does provide a cross-section (even if the statement might be machine specific). But in any case the cross-section does not match (1.098 for fortran (LTS is at 1.101) versus 1.357 for C++) So still need investigation
However I am puzzled by your statement. As mentioned in https://github.com/madgraph5/madgraph4gpu/pull/850#issuecomment-2139208110 the susy_gg_t1t1 test still gives no cross section in my test. Can you confirm you see a cross section?
And/or can you try to run this script (from epochX/cudacpp) and see what it gives? (Do a git diff afterwards)
./tmad/teeMadX.sh -susyggt1t1 +10x
I would be curious to see if in your environment this succeeds...
Thanks! Andrea
actually none of the script ./tmad/teeMadX.sh are working on my laptop... (even eemumu) they all crashed due to the google test/cmake issue
So yes this is not working but not real information to take away. But I do confirm that running "as an user" provides a non zero (but wrong) cross-section. So I will start by investigate that missmatch and then hopefully this will fix your issue too (or we will need to iterate)
I have check subset of diagram (physical meaning not being the point) --cross means agreement--:
So it seems that the issue is quite subtle here since each diagram "alone" works but not when combined. I'm looking for a wrong phase for the moment. But I do not have any clear indication of the issue behind the above point.
@roiser Looks like this could be something for you to investigate on the ordering of the coupling here is the value that I got for the coupling in fortran
c4 = (0, 1.27)
c_3v = (-1,12,0)
c_3s = (0, -1.12)
(the exact value are not important since they are all running) here is the value that I have for the CPP code:
c4= (-1.14, 0)
c_3v = (0, 1.12)
c_3s = (0, -1.14)
So in fortran I do have
c_3s = i c_3v
while in CPP
c_3s = i c_4
Which sounds like an ordering issue. Second issue is that the coupling seems to have a phase between fortran and cpp (which is not really problematic since global phase disapear) but if you can check that you have not a swap of real/imaginary component at the same time.
Thanks,
Olivier
actually none of the script ./tmad/teeMadX.sh are working on my laptop... (even eemumu) they all crashed due to the google test/cmake issue
Hi Olivier, thanks. Two points
@oliviermattelaer I tried to 'run as a user'. I still get a crash. From our gpucpp branch (HEAD detached at f9f957918), inside ./bin/mg5_aMC
I do
set stdout_level DEBUG
set zerowidth_tchannel F
import model MSSM_SLHA2
generate g g > t1 t1~
output madevent_simd susy_gg_t1t1.mad --hel_recycling=False --vector_size=32
launch
Except for launch, this is what is used to generate the code in teh repo (see file https://github.com/madgraph5/madgraph4gpu/blob/master/epochX/cudacpp/susy_gg_t1t1.mad/mg5.in).
With default parameters for launch, this eventually gives me
...
Using random number seed offset = 21
INFO: Running Survey
Creating Jobs
Working on SubProcesses
[Errno 2] No such file or directory: '/data/avalassi/GPU2023/madgraph4gpuX/MG5aMC/mg5amcnlo/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/Hel/selection'
INFO: P1_gg_t1t1x
Building madevent in madevent_interface.py with 'cpp' matrix elements
INFO: Idle: 1, Running: 1, Completed: 0 [ current time: 13h22 ]
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7f335fa23860 in ???
#1 0x7f335fa22a05 in ???
#2 0x7f335f654def in ???
#3 0x7f33601722d5 in ???
#4 0x7f3360044575 in ???
#5 0x7f336016fef1 in ???
#6 0x7f3360173d5d in ???
#7 0x7f3360179363 in ???
#8 0x42044f in ???
#9 0x42158d in ???
#10 0x421de9 in ???
#11 0x4224c0 in ???
#12 0x432c88 in ???
#13 0x41fba4 in ???
#14 0x41ffe5 in ???
#15 0x7f335f63feaf in ???
#16 0x7f335f63ff5f in ???
#17 0x4036b4 in ???
#18 0xffffffffffffffff in ???
rm: cannot remove 'results.dat': No such file or directory
ERROR DETECTED
INFO: Idle: 0, Running: 1, Completed: 1 [ 0.19s ]
INFO: Idle: 0, Running: 0, Completed: 2 [ 0.21s ]
INFO: End survey
refine 10000
Creating Jobs
INFO: Refine results to 10000
INFO: Generating 10000.0 unweighted events.
Error when reading /data/avalassi/GPU2023/madgraph4gpuX/MG5aMC/mg5amcnlo/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/G3/results.dat
Command "generate_events run_01" interrupted with error:
Exception : Reported error: End code 136.0
...
Note: SIGFPE: Floating-point exception - erroneous arithmetic operation.
is what I had seen also in gqttq once, then it had disappeared, see #845. Most likely here it is staying. I guess this is what should be debugged.
I am very puzzled that you do not see it. Maybe it is because you are on a Mac? Can you try on a Linux box please?
Thanks Andrea
So yes on our haswell node of the cluster (since this maybe hardware specific) it does crash:
Backtrace for this error:
#0 0x7ff7388beb4f in ???
#1 0x7ff739888bc9 in _ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i._omp_fn.0
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/CPPProcess.cc:1155
#2 0x7ff7398326e5 in GOMP_parallel
at ../../../libgomp/parallel.c:178
#3 0x7ff739888797 in _ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/CPPProcess.cc:1059
#4 0x7ff739891d0f in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/MatrixElementKernels.cc:115
#5 0x7ff7398944b9 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/Bridge.h:390
#6 0x7ff73989614f in fbridgesequence_
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/fbridge.cc:106
#7 0x42e62e in smatrix1_multi_
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/auto_dsig1.f:574
#8 0x42fd15 in dsig1_vec_
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/auto_dsig1.f:401
#9 0x430aec in dsigproc_vec_
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/auto_dsig.f:1031
#10 0x431510 in dsig_vec_
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/auto_dsig.f:327
#11 0x4449c5 in sample_full_
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/Source/dsample.f:208
#12 0x42dd22 in driver
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/driver.f:256
#13 0x42e16d in main
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/driver.f:301
Floating point exception (core dumped)
I am even more confused, but the issue I see in tmad tests is NOT a SIGFPE.
Essentially, to reproduce outside tmad tests:
make -j BACKEND=cppnone
./madevent_cpp < input_susyggt1t1_x1_cudacpp
In practice:
make -j BACKEND=cppnone
cat > input_susyggt1t1_x1_cudacpp << EOF
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
EOF
./madevent_cpp < input_susyggt1t1_x1_cudacpp
What I get is an error which ends like this
...
Running Configuration Number: 1
Not subdividing B.W.
Attempting mappinvarients 1 4
Determine nb_t
T-channel found: 0
Completed mapping 4
about to integrate 4 8192 1 1 4 1
Keeping grid fixed.
4 dimensions 8192 events 4 invarients 1 iterations 1 config(s), (0.99)
Using h-tuple random number sequence.
Error opening grid
Using Uniform Grid! 16
Using uniform alpha 1.0000000000000000
Grid defined OK
Masses: 0.000E+00 0.000E+00 0.400E+03 0.400E+03
Set CM energy to 13000.00
Mapping Graph 1 to config 1
Determine nb_t
T-channel found: 0
Transforming s_hat 1/s 3 3.7807079264437879E-003 638939.63956899999 168999999.99999997
Error opening symfact.dat. No permutations used.
Using random seed offsets 1 : 1
with seed 21
Ranmar initialization seeds 27402 9395
Particle 3 4
Et > 0.0 0.0
E > 0.0 0.0
Eta < -1.0 -1.0
xqcut: 0.0 0.0
d R # 3 > -0.0 0.0
s min # 3> 0.0 0.0
xqcutij # 3> 0.0 0.0
RESET CUMULATIVE VARIABLE
NGOODHEL = 4
NCOMB = 4
MULTI_CHANNEL = TRUE
CHANNEL_ID = 2
RESET CUMULATIVE VARIABLE
4096 points passed the cut but all returned zero
therefore considering this contribution as zero
Deleting file events.lhe
So yes on our haswell node of the cluster (since this maybe hardware specific) it does crash:
Thanks @oliviermattelaer ! We were writing at the same time.
Weird, so MANY issues
launch
, it crashes with some SIGFPE, but apparently only on Linux...(I will not do much more before I leave on holiday in a few days, hopefully you'll find out more in the meantime! thanks)
In practice, go to epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x and then
make -j BACKEND=cppnone cat > input_susyggt1t1_x1_cudacpp << EOF 8192 1 1 ! Number of events and max and min iterations 0.000001 ! Accuracy (ignored because max iterations = min iterations) 0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present) 1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement) 0 ! Helicity Sum/event 0=exact 1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!) EOF ./madevent_cpp < input_susyggt1t1_x1_cudacpp
@oliviermattelaer can you try also this one for curiosity please? both from linux and mac... thanks
Backtrace for this error:
0 0x7ff7388beb4f in ???
1 0x7ff739888bc9 in _ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i._omp_fn.0
at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/CPPProcess.cc:1155
2 0x7ff7398326e5 in GOMP_parallel
at ../../../libgomp/parallel.c:178
And @oliviermattelaer other question, what is at this line? CPPProcess.cc:1155
Thanks Andrea
PS In my case 1154,55,56 are
#if defined MGONGPU_CPPSIMD
const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
#else
Anyway, I confirm that I also reproduce "SIGFPE: erroneous arithmetic operation" from the repo
./bin/generate_events run01
I have no idea why I get a SIGFPE crash in this mode and I get an empty cross section but no crash if I run madevent manually
Hi,
yes the line is indeed:
const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
and no surprise, targetamp[ncolor-1] is here zero...
ievt 0 , ieppV, 0 , ncolor 2, max 0.000000
ievt 1 , ieppV, 1 , ncolor 2, max 0.000000
ievt 2 , ieppV, 2 , ncolor 2, max 0.000000
ievt 3 , ieppV, 3 , ncolor 2, max 0.000000
ievt 4 , ieppV, 0 , ncolor 2, max 0.000000
ievt 5 , ieppV, 1 , ncolor 2, max 0.000000
ievt 6 , ieppV, 2 , ncolor 2, max 0.000000
ievt 7 , ieppV, 3 , ncolor 2, max 0.000000
ievt 8 , ieppV, 0 , ncolor 2, max 0.000000
ievt 9 , ieppV, 1 , ncolor 2, max 0.000000
ievt 10 , ieppV, 2 , ncolor 2, max 0.000000
ievt 11 , ieppV, 3 , ncolor 2, max 0.000000
ievt 12 , ieppV, 0 , ncolor 2, max 0.000000
ievt 13 , ieppV, 1 , ncolor 2, max 0.000000
ievt 14 , ieppV, 2 , ncolor 2, max 0.000000
ievt 15 , ieppV, 3 , ncolor 2, max 0.000000
ievt 0 , ieppV, 0 , ncolor 2, max 3.101819
ievt 1 , ieppV, 1 , ncolor 2, max 3.336748
ievt 2 , ieppV, 2 , ncolor 2, max 2.629749
ievt 3 , ieppV, 3 , ncolor 2, max 3.350298
ievt 4 , ieppV, 0 , ncolor 2, max 4.813758
ievt 5 , ieppV, 1 , ncolor 2, max 2.866929
ievt 6 , ieppV, 2 , ncolor 2, max 2.657539
ievt 7 , ieppV, 3 , ncolor 2, max 4.112587
ievt 8 , ieppV, 0 , ncolor 2, max 8.225509
ievt 9 , ieppV, 1 , ncolor 2, max 3.152658
ievt 10 , ieppV, 2 , ncolor 2, max 2.698144
ievt 11 , ieppV, 3 , ncolor 2, max 2.709947
ievt 12 , ieppV, 0 , ncolor 2, max 2.629970
ievt 13 , ieppV, 1 , ncolor 2, max 2.769736
ievt 14 , ieppV, 2 , ncolor 2, max 2.622860
ievt 15 , ieppV, 3 , ncolor 2, max 6.583436
ievt 0 , ieppV, 0 , ncolor 2, max 0.000000
ievt 1 , ieppV, 1 , ncolor 2, max 0.000000
ievt 2 , ieppV, 2 , ncolor 2, max 0.000000
ievt 3 , ieppV, 3 , ncolor 2, max 0.000000
ievt 4 , ieppV, 0 , ncolor 2, max 0.000000
ievt 5 , ieppV, 1 , ncolor 2, max 0.000000
ievt 6 , ieppV, 2 , ncolor 2, max 0.000000
ievt 7 , ieppV, 3 , ncolor 2, max 0.000000
ievt 8 , ieppV, 0 , ncolor 2, max 0.000000
ievt 9 , ieppV, 1 , ncolor 2, max 0.000000
ievt 10 , ieppV, 2 , ncolor 2, max 0.000000
ievt 11 , ieppV, 3 , ncolor 2, max 0.000000
ievt 12 , ieppV, 0 , ncolor 2, max 0.000000
ievt 13 , ieppV, 1 , ncolor 2, max 0.000000
ievt 14 , ieppV, 2 , ncolor 2, max 0.000000
ievt 15 , ieppV, 3 , ncolor 2, max 0.000000
What surprise/interest me is that it is 0 (or not) for a full block of 16 events... which might be a symfact related issue... And "YES" if I remove the symmetric channel this never happens ...
What surprise/interest me is that it is 0 (or not) for a full block of 16 events... which might be a symfact related issue...
Thanks Olivier. No I have rerun with cppnone
in the runcards and then reran /bin/generate_events run01
(note: I need the patch in PR #851). I still get the same SIGFPE crash. So I assume that this is NOT a SIMD issue. But I will do more tests.
I have no idea why I get a SIGFPE crash in this mode and I get an empty cross section but no crash if I run madevent manually
Ok, interesting, I got this one.
Note: my 'tmad' tests always use channel=1. When doing a launch, it launches several processes, including channel 3, which gives the crash.
And gdb on channel 3 tells me that the crash is in the fortran code for phase space sampling?
Program received signal SIGFPE, Arithmetic exception.
0x000000000043809f in rotxxx_ ()
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0 0x000000000043809f in rotxxx_ ()
#1 0x0000000000405820 in gentcms_ ()
#2 0x00000000004067b2 in one_tree_ ()
#3 0x0000000000408c72 in gen_mom_ ()
#4 0x000000000040a0aa in x_to_f_arg_ ()
#5 0x0000000000444fe0 in sample_full_ ()
#6 0x000000000042bb39 in MAIN__ ()
#7 0x000000000040371f in main ()
And even more strange, if I use specifically the fortran MEs ie madevent_fortran then
More on the SIGFPE crash in madevent_cudacpp.
If I build the fortran part with -g:
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check # crashes
GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -g # crashes with debug info
Then (the 'txt3` file is the one that ends with 3):
[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x> gdb ./madevent_cpp
...
(gdb) run < input_app.txt3
...
Setting grid 1 0.94518E-03 1
Transforming s_hat 1/s 3 3.7807079264437879E-003 638939.63956899999 168999999.99999997
Error opening symfact.dat. No permutations used.
Using random seed offsets 3 : 1
with seed 57
Ranmar initialization seeds 11126 9433
Program received signal SIGFPE, Arithmetic exception.
rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
1247 prot(1) = q(1)*q(3)/qq/qt*p1 -q(2)/qt*p(2) +q(1)/qq*p(3)
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0 rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
#1 0x0000000000405820 in gentcms (pa=..., pb=..., t=-41214.684204869853, phi=4.7054556515034918, ma2=0, m1=399.66849999999999,
m2=399.66849999999999, p1=..., pr=..., jac=1061858298.7999998) at genps.f:1480
#2 0x00000000004067b2 in one_tree (itree=..., tstrategy=<optimized out>, iconfig=3, nbranch=2, p=..., m=..., s=..., x=...,
jac=1061858298.7999998, pswgt=1) at genps.f:1167
#3 0x0000000000408c72 in gen_mom (iconfig=3, mincfig=3, maxcfig=3, invar=4, wgt=0.00020000000000000001, x=..., p1=...) at genps.f:68
#4 0x000000000040a0aa in x_to_f_arg (ndim=4, iconfig=3, mincfig=3, maxcfig=3, invar=4, wgt=0.00020000000000000001, x=..., p=...)
at genps.f:60
#5 0x0000000000444fe0 in sample_full (ndim=4, ncall=1000, itmax=5, itmin=3, dsig=0x430440 <dsig>, ninvar=4, nconfigs=1,
vecsize_used=16384) at dsample.f:172
#6 0x000000000042bb39 in driver () at driver.f:256
#7 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:301
#8 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#9 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x0000000000403845 in _start ()
(gdb) print q
$1 = (505.75540315099767, 0, 0, 505.75540315099767)
(gdb) print q(1)
$2 = 0
(gdb) print q
$3 = (505.75540315099767, 0, 0, 505.75540315099767)
(gdb) print qq
$4 = <optimized out>
(gdb) print qt
$5 = <optimized out>
(gdb) print p1
$6 = <optimized out>
(gdb) print p
$7 = (505.75540315099767, -0.28985446571100887, -41.805289511944629, 307.09257834957424)
(gdb)
$8 = (505.75540315099767, -0.28985446571100887, -41.805289511944629, 307.09257834957424)
And not surprisingly having -O2 in make_opts means that the SIGFPE disappears. And here we go again, but this one is deep in the Fortran.
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check # crashes
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -g # crashes with debug info
GLOBAL_FLAG=-O2 -ffast-math -fbounds-check -g # does not crash
In this case
./madevent_cpp < input_app.txt3
completes without a crash, and even gives a nice cross section.
(Note1 for myself: I tried to add gdb in survey.sh which is where the madevent executable is called. But this is very complex, also because you need an input redirection. And I managed to get my keyboard screen frozen. Instead: build with 'make -j' normally and then call madevent_cpp, it is easy to reproduce the issues) (Note2 for myself: the input_app.txt in SubProcesses/P1* does not have the channel line, and hence fails with a different error. You must add the channel line manually, or just get them from the G1, G2, G3 subdirectories)
Summary: here we go again, this is a SIGFPE that appears only in some situations, in particular it only appears in optimized code where it is very difficult to debug it. What is new is that this is a SIGFPE in Fortran, not in cudacpp: but strangely enough only appears if the MEs (another part of the code!) uses cudacpp and not fortran...
Anyway, possible solution? Disable vectorisation in fortran!
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check # crashes
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -g # crashes with debug info
#GLOBAL_FLAG=-O2 -ffast-math -fbounds-check -g # does not crash
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -mno-sse3 # crashes
GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -fno-tree-vectorize # does not crash
#GLOBAL_FLAG=-O0 -g -fbounds-check # no crash
I tried and -fno-tree-vectorize
seems to remove the SIGFPE. (So @oliviermattelaer you were kind of right that the issue is SIMD, but actually in fortran).
NB This needs many more checks because I should check if the cudacpp builds ok: it should, as I think that GLOBAL_FLAG only touch fortran, not cudacpp (which I think is good)
@oliviermattelaer what do you think, is it ok to add -fno-tree-vectorize
to GLOBAL_FLAG if this removes the SIGFPE in Fortran? Thanks
Ouff this is really annoying.
Without -fno-tree-vectorize in GLOBAL FLAGS
With -fno-tree-vectorize (and also -g) in GLOBAL FLAGS
Without gdb
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7f9f01a23860 in ???
#1 0x7f9f01a22a05 in ???
#2 0x7f9f01654def in ???
#3 0x7f9f02193169 in ???
#4 0x7f9f02065575 in ???
#5 0x7f9f02190eaf in ???
#6 0x7f9f02194d4d in ???
#7 0x7f9f0219a3d4 in ???
#8 0x42e238 in smatrix1_multi_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/auto_dsig1.f:574
#9 0x42f844 in dsig1_vec_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/auto_dsig1.f:401
#10 0x4308a7 in dsigproc_vec_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/auto_dsig.f:1031
#11 0x4315e9 in dsig_vec_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/auto_dsig.f:327
#12 0x44763a in sample_full_
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/dsample.f:208
#13 0x42cd30 in driver
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:256
#14 0x40370e in main
at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:301
With gdb (strange I had to try with and without, not always the same result? ah of course, different random seeds maybe?)
Program received signal SIGFPE, Arithmetic exception.
0x000000000040f3fb in unwgt (px=..., wgt=5.1435160241293966e-05, numproc=1, ihel=2, icol=1, ivec=4) at unwgt.f:257
257 if (local_twgt .gt. 0) uwgt=uwgt/twgt/fudge
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0 0x000000000040f3fb in unwgt (px=..., wgt=5.1435160241293966e-05, numproc=1, ihel=2, icol=1, ivec=4) at unwgt.f:257
#1 0x000000000042f97a in dsig1_vec (all_pp=<error reading variable: value requires 2097152 bytes, which is more than max-value-size>,
all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>,
all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>,
all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>,
all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0,
all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384)
at auto_dsig1.f:447
#2 0x00000000004308a8 in dsigproc_vec (all_p=...,
all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>,
all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>,
all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1,
symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>,
imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384)
at auto_dsig.f:1031
#3 0x00000000004315ea in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1,
all_out=..., vecsize_used=16384) at auto_dsig.f:327
#4 0x000000000044763b in sample_full (ndim=4, ncall=1000, itmax=5, itmin=3, dsig=0x4317e0 <dsig>, ninvar=4, nconfigs=1,
vecsize_used=16384) at dsample.f:208
#5 0x000000000042cd31 in driver () at driver.f:256
#6 0x000000000040370f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:301
#7 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#8 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#9 0x0000000000403835 in _start ()
...
(gdb) l
252 yran = xran1(idum)
253 if (xwgt .gt. local_twgt*fudge*yran) then
254 uwgt = max(xwgt,local_twgt*fudge)
255 c Set sign of uwgt to sign of wgt
256 uwgt = dsign(uwgt,wgt)
257 if (local_twgt .gt. 0) uwgt=uwgt/twgt/fudge
258 c call write_event(p,uwgt)
259 c write(29,'(2e15.5)') matrix,wgt
260 c $B$ S-COMMENT_C $B$
261 call write_leshouche(p,uwgt,numproc,.True., ihel, icol, ivec)
(gdb) p twgt
$1 = 0
(gdb) p local_twgt
$2 = <optimized out>
And so again and again variables that are optimized out make debugging really complex.
Hi Andrea,
Thanks for looking at this, I would say that we should try to avoid the " -fno-tree-vectorize" since this is likely hidding the issue more than solving the issue.
I would say that this process seems to have multiple problem (likely independant): 1) one issue seems to be the coupling ordering in cudacpp which is not consistent with fortran ordering 2) a second issue seems to be in the fortran side (or at least related to the fortran side) where something goes wrong with the handling of symmetric channel (G3 is handling also G5 and the issue is likely in G5) this is made some weight to be always zero (not initialised?) which is likely triggering a lot of bad stuff in reaction...
The second is a clear "for me" since it is clearly related to fortran (even if it seems to work when using fortran matrix-element), will try to reproduce the issue in a pure fortran code (but likely tommorow)
Hi Olivier, thanks to you!
I would say that this process seems to have multiple problem (likely independant):
Totally agree
1. one issue seems to be the coupling ordering in cudacpp which is not consistent with fortran ordering
This I have not seen, but you and Stefan seem to know more about it. I am happy if you look at that!
2. a second issue seems to be in the fortran side (or at least related to the fortran side) where something goes wrong with the handling of symmetric channel (G3 is handling also G5 and the issue is likely in G5) this is made some weight to be always zero (not initialised?) which is likely triggering a lot of bad stuff in reaction...
This I also do not understand (maybe it is your correct explanation for what I see, but I am not sure). Again, I am happy if you continue looking at that!
I think I see two issues which are independent from one another, and may or may not be independent from those above.
For channel 3 in this susy_gg_t1t1, I sometimes get SIGFPE crashes in Fortran. Actually I made more tests than what described above, the only way I can get rid of them is by a global flag -O1! Which is much. I do not understand why, but adding -fno-tree-vectorize only fixes the rotxxx crash, but then I also get crashes in unwgt.f (and sometimes smatrix_multi) and the only why I can get rid of them is -O1. This is very difficult to debug because with -O2 and -O3 some variables are optimized out, so it's just a guess what happens. ALSO, it's weird that this happens only if you do things in a ceratin order, it may depend on the random seeds (which change every run of generate events), on the vegas grid maybe (or at least, I saw that if I run in channel 3 alone I get different things than running channel 1 then 3). Summary, a complete mess to debug. My suggestion for this is, let's sort out the pther issues, maybe the SIGFPE is caused by them and will disappear.
The other issue that is independent from SIGFPE is the original problem, that I was seeing no cross section. Once I use -O1 and I see no SIGFPE crashes, I still get no cross section, only in channel 1. So here is my question for you, could it be that channel 1 in susy_gg_t1t1 is suppressed? Should I just use another channel? Or is this realted to your points 1 and 2 above, some ordering that is wrong? I am asking because, for ALL other processes I was always able to use channel 1 for some basic tests. So I am surprised that here it does not work (I thought channel 1 was 'special' and was guaranteed to always be non suppressed in a way). Any suggestion there?
Anyway, this time I really leave it here, probably will resume end of June. Thanks Andrea
Ok found the issue (thanks @roiser for the help).
The issue is here that we do have 6 amplitude (so channel can go from 1 to 6) but that one amplitude does not have a channel associated to it (so only 5 channel). The information about which colour can be considered or not are therefore of length 5 but the code tries to read the sixth entry ... which means that we do have random behaviour.
Cheers,
Olivier
I have moved the SIGFPE crash analysis to #855.
I would keep this #826 only for the original issue: am empty cross section in iconfig=1 for susy_gg_t1t1
Hi,
yes the line is indeed:
const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
Hi @oliviermattelaer I am going through all your posts again. I think that what you see here looks exactly like #845. From my gdb
Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98d6f in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>,
allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080,
allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1189
1189 const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
I suggest that we refer to the zero cross section as 826, and the crash as 845, ok? Maybe they are related, maybe they are not...
And this crash 845 is most likely related to the color mismatch #856
About
Summary: here we go again, this is a SIGFPE that appears only in some situations, in particular it only appears in optimized code where it is very difficult to debug it. What is new is that this is a SIGFPE in Fortran, not in cudacpp: but strangely enough only appears if the MEs (another part of the code!) uses cudacpp and not fortran...
Anyway, possible solution? Disable vectorisation in fortran!
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check # crashes #GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -g # crashes with debug info #GLOBAL_FLAG=-O2 -ffast-math -fbounds-check -g # does not crash #GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -mno-sse3 # crashes GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -fno-tree-vectorize # does not crash #GLOBAL_FLAG=-O0 -g -fbounds-check # no crash
and
I think I see two issues which are independent from one another, and may or may not be independent from those above.
3. For channel 3 in this susy_gg_t1t1, I sometimes get SIGFPE crashes in Fortran. Actually I made more tests than what described above, the only way I can get rid of them is by a global flag -O1!
This is a DIFFERENT CRASH (#855, in fortran rotxxx) than that in #845 (in cudacpp sigmaKin). And it is what I suggest to fix adding volatile
in #857.
I would say that this process seems to have multiple problem (likely independant):
About this point and the four mentioned above, again I agree, trying to summarise
- one issue seems to be the coupling ordering in cudacpp which is not consistent with fortran ordering
I do not see this in my tests, can you give me a reproducer please?
- a second issue seems to be in the fortran side (or at least related to the fortran side) where something goes wrong with the handling of symmetric channel (G3 is handling also G5 and the issue is likely in G5) this is made some weight to be always zero (not initialised?) which is likely triggering a lot of bad stuff in reacti
Same thing, I do not see this in my tests, can you give me a reproducer please?
About the "not initialised", note that valgrind in one of my tests reported an unitialised variable. Try to use valgrind maybe?
- For channel 3 in this susy_gg_t1t1, I sometimes get SIGFPE crashes in Fortran.
This is confirmed as #855 and can be fixed with volatile in #857
- The other issue that is independent from SIGFPE is the original problem, that I was seeing no cross section.
This is still this present #826.
In addition, related to iconfig-channel mapping, as discussed in #853 and #852
"5a." There is a crash in sigmakin color choice. I moved this to #845.
"5b". There is a color mismatch in LHE tests #856. But it is most likely related to #845 above in "5a".
Hi Andrea,
I think that the code behaves strangely due to a out of bound issue (in the CPP part). That out of bound can corrupt some memory making the code (even the fortran part) leading to some random crash (compiler flag and machine specific way of crashing -> not reproducible). Since I/We have identified one of such issue, I think the first thing that we should do is to fix that one and then re-investigate the other issue (For me #852 fixes all crashing issue).
So for me, the priority is to merge #852. I will work on it now to include more comment on it as you want. Then after that my priority would be to understand which variable is used un-identified from valgrind since this will be a good hint from where the next issue is.
Hi Andrea,
I think that the code behaves strangely due to a out of bound issue (in the CPP part). That out of bound can corrupt some memory making the code (even the fortran part) leading to some random crash (compiler flag and machine specific way of crashing -> not reproducible). Since I/We have identified one of such issue, I think the first thing that we should do is to fix that one and then re-investigate the other issue (For me #852 fixes all crashing issue).
So for me, the priority is to merge #852. I will work on it now to include more comment on it as you want. Then after that my priority would be to understand which variable is used un-identified from valgrind since this will be a good hint from where the next issue is.
Hi Olivier, thanks for the message and sorry for not replying on this before.
Just replying here to try and summarise the various directions of work that this issue 826 triggered somehow. For me this was the work of this last (very busy) week.
About 'for me 852 fixes all issues': as discussed at/after the meeting this week, this was not the case for me. In particular, one specific crash was in rotxxx (#855): you had said that 852 fixed it for you, but I showed (#870) that it does not. Instead, I fixed this rotxxx crash by adding 'volatile' in fortran code (#857 and mg5amcnlo/mg5amcnlo#113, both merged)
(En passant, during this week I added tmad tests to the CI, so many if these issues - not all - can now more easily be seen there, #794, merged)
About your 852, this is certainly necessary to fix some iconfig-channel mappings that are relevant to fix #856 (LHE color mismatch). But by itself it is not enough: so I included that in #873, which also depends on mg5amcnlo/mg5amcnlo#115). I agree that now merging these is the highest priority. I would merge mg5amcnlo 115 and then 873, and this automatically merges your 852 in master. If this also helps to fix some segfaults, so much the better (I do not see that).
There is another crash #845, in sigmakin, only for SIMD 512z with FPTYPE=f. Initially I ha dthe same impression as you, namely that it was an out of bounds, but I am quite sure that it is a SIMD optimization instead. I fixed this in #874, also to be merged. I would merge this after 873 however, as it contains it internally, so it becomes easier to review the differences.
About the out of bound issues. As I said, I also had the impression we had some out of bounds. But then I ran many tests, and honestly I never found a segmentation fault from out of bounds: I only saw SIGFPE crashes (which are typically SIMD issues, and are fixed with volatile). I also used extensively valgrind. I found no issue in cudacpp with valgrind, but I fixed two minor issues in mg5amcnlo/mg5amcnlo#112 and mg5amcnlo/mg5amcnlo#110
About this specific 826! While this triggered all the work above, none of the work above is relevant to it I think. This 826 is about a zero cross section in susy process. IIUC you and Stefan have tracked this down to a coupling issue. I reported that in #862 (maybe a duplicate of this #826, maybe not). In any case Stefan (thanks) is working on it
Note, this week with the work above I found yet another new problem, a cross section mismatch #872. Together with this 826, this is the only issue that will normally show up in the new CI until it is fixed (I have mechanisms to bypass the errors however if required)
Voila this seems like a good summary of this week, replying to your points above.
Summary of the todo summary
This specific issue 826 about a zero cross section in susy process is fixed by PR #918 (thanks @oliviermattelaer @roiser). Indeed it is caused by a couplings order #862.
Closing this as fixed by PR #918 (to be merged soon). Code regenerated in #934.
This is also related to #748 (xsec mismatch in gqttq, which was also due to ther order of couplings)
In PR https://github.com/madgraph5/madgraph4gpu/pull/824 I fixed SUSY codegen, builds and internal tests cuda/cpp. But now I ALSO added the test comparing to fortran and this fails
In tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt