madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
29 stars 33 forks source link

No cross section in SUSY gg_t1t1 log file #826

Closed valassi closed 5 days ago

valassi commented 4 months ago

In PR https://github.com/madgraph5/madgraph4gpu/pull/824 I fixed SUSY codegen, builds and internal tests cuda/cpp. But now I ALSO added the test comparing to fortran and this fails

In tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt

*** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
--------------------
CUDACPP_RUNTIME_FBRIDGEMODE = (not set)
CUDACPP_RUNTIME_VECSIZEUSED = 8192
--------------------
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
--------------------
Executing ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_susyggt1t1_x1_cudacpp > /tmp/avalassi/outpu
t_susyggt1t1_x1_cudacpp'
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 4/4
 [XSECTION] VECSIZE_USED = 8192
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 1
 [XSECTION] ChannelId = 2
 [XSECTION] ERROR! No cross section in log file:
   /tmp/avalassi/output_susyggt1t1_x1_cudacpp
   ...
xqcutij # 3>     0.0     0.0
 RESET CUMULATIVE VARIABLE
 NGOODHEL =           4
 NCOMB =           4
 MULTI_CHANNEL = TRUE
 CHANNEL_ID =           2
 RESET CUMULATIVE VARIABLE
        4096  points passed the cut but all returned zero
 therefore considering this contribution as zero
 Deleting file events.lhe
oliviermattelaer commented 1 month ago

After PR #850, the code does provide a cross-section (even if the statement might be machine specific). But in any case the cross-section does not match (1.098 for fortran (LTS is at 1.101) versus 1.357 for C++) So still need investigation

valassi commented 1 month ago

Hi @oliviermattelaer thanks for the patch in PR https://github.com/madgraph5/madgraph4gpu/pull/850! This seems to fix the other issue #825 on cross section mismatch for susy_gg_tt (pending investgations in other processes).

After PR #850, the code does provide a cross-section (even if the statement might be machine specific). But in any case the cross-section does not match (1.098 for fortran (LTS is at 1.101) versus 1.357 for C++) So still need investigation

However I am puzzled by your statement. As mentioned in https://github.com/madgraph5/madgraph4gpu/pull/850#issuecomment-2139208110 the susy_gg_t1t1 test still gives no cross section in my test. Can you confirm you see a cross section?

And/or can you try to run this script (from epochX/cudacpp) and see what it gives? (Do a git diff afterwards)

./tmad/teeMadX.sh -susyggt1t1 +10x

I would be curious to see if in your environment this succeeds...

Thanks! Andrea

oliviermattelaer commented 1 month ago

actually none of the script ./tmad/teeMadX.sh are working on my laptop... (even eemumu) they all crashed due to the google test/cmake issue

So yes this is not working but not real information to take away. But I do confirm that running "as an user" provides a non zero (but wrong) cross-section. So I will start by investigate that missmatch and then hopefully this will fix your issue too (or we will need to iterate)

I have check subset of diagram (physical meaning not being the point) --cross means agreement--:

So it seems that the issue is quite subtle here since each diagram "alone" works but not when combined. I'm looking for a wrong phase for the moment. But I do not have any clear indication of the issue behind the above point.

oliviermattelaer commented 1 month ago

@roiser Looks like this could be something for you to investigate on the ordering of the coupling here is the value that I got for the coupling in fortran

c4 = (0, 1.27)
c_3v = (-1,12,0)
c_3s = (0, -1.12) 

(the exact value are not important since they are all running) here is the value that I have for the CPP code:

c4= (-1.14, 0)
c_3v = (0, 1.12)
c_3s = (0, -1.14)

So in fortran I do have

c_3s = i c_3v

while in CPP

c_3s = i c_4

Which sounds like an ordering issue. Second issue is that the coupling seems to have a phase between fortran and cpp (which is not really problematic since global phase disapear) but if you can check that you have not a swap of real/imaginary component at the same time.

Thanks,

Olivier

valassi commented 1 month ago

actually none of the script ./tmad/teeMadX.sh are working on my laptop... (even eemumu) they all crashed due to the google test/cmake issue

Hi Olivier, thanks. Two points

valassi commented 1 month ago

@oliviermattelaer I tried to 'run as a user'. I still get a crash. From our gpucpp branch (HEAD detached at f9f957918), inside ./bin/mg5_aMC I do

set stdout_level DEBUG
set zerowidth_tchannel F
import model MSSM_SLHA2
generate g g > t1 t1~
output madevent_simd susy_gg_t1t1.mad --hel_recycling=False --vector_size=32 
launch

Except for launch, this is what is used to generate the code in teh repo (see file https://github.com/madgraph5/madgraph4gpu/blob/master/epochX/cudacpp/susy_gg_t1t1.mad/mg5.in).

With default parameters for launch, this eventually gives me

...
Using random number seed offset = 21
INFO: Running Survey 
Creating Jobs
Working on SubProcesses
[Errno 2] No such file or directory: '/data/avalassi/GPU2023/madgraph4gpuX/MG5aMC/mg5amcnlo/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/Hel/selection'
INFO:     P1_gg_t1t1x  
Building madevent in madevent_interface.py with 'cpp' matrix elements
INFO:  Idle: 1,  Running: 1,  Completed: 0 [ current time: 13h22 ] 

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f335fa23860 in ???
#1  0x7f335fa22a05 in ???
#2  0x7f335f654def in ???
#3  0x7f33601722d5 in ???
#4  0x7f3360044575 in ???
#5  0x7f336016fef1 in ???
#6  0x7f3360173d5d in ???
#7  0x7f3360179363 in ???
#8  0x42044f in ???
#9  0x42158d in ???
#10  0x421de9 in ???
#11  0x4224c0 in ???
#12  0x432c88 in ???
#13  0x41fba4 in ???
#14  0x41ffe5 in ???
#15  0x7f335f63feaf in ???
#16  0x7f335f63ff5f in ???
#17  0x4036b4 in ???
#18  0xffffffffffffffff in ???
rm: cannot remove 'results.dat': No such file or directory
ERROR DETECTED
INFO:  Idle: 0,  Running: 1,  Completed: 1 [  0.19s  ] 
INFO:  Idle: 0,  Running: 0,  Completed: 2 [  0.21s  ] 
INFO: End survey 
refine 10000
Creating Jobs
INFO: Refine results to 10000 
INFO: Generating 10000.0 unweighted events. 
Error when reading /data/avalassi/GPU2023/madgraph4gpuX/MG5aMC/mg5amcnlo/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/G3/results.dat
Command "generate_events run_01" interrupted with error:
Exception : Reported error: End code 136.0 
...

Note: SIGFPE: Floating-point exception - erroneous arithmetic operation. is what I had seen also in gqttq once, then it had disappeared, see #845. Most likely here it is staying. I guess this is what should be debugged.

I am very puzzled that you do not see it. Maybe it is because you are on a Mac? Can you try on a Linux box please?

Thanks Andrea

oliviermattelaer commented 1 month ago

So yes on our haswell node of the cluster (since this maybe hardware specific) it does crash:

Backtrace for this error:
#0  0x7ff7388beb4f in ???
#1  0x7ff739888bc9 in _ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i._omp_fn.0
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/CPPProcess.cc:1155
#2  0x7ff7398326e5 in GOMP_parallel
    at ../../../libgomp/parallel.c:178
#3  0x7ff739888797 in _ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/CPPProcess.cc:1059
#4  0x7ff739891d0f in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/MatrixElementKernels.cc:115
#5  0x7ff7398944b9 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/Bridge.h:390
#6  0x7ff73989614f in fbridgesequence_
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/fbridge.cc:106
#7  0x42e62e in smatrix1_multi_
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/auto_dsig1.f:574
#8  0x42fd15 in dsig1_vec_
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/auto_dsig1.f:401
#9  0x430aec in dsigproc_vec_
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/auto_dsig.f:1031
#10  0x431510 in dsig_vec_
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/auto_dsig.f:327
#11  0x4449c5 in sample_full_
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/Source/dsample.f:208
#12  0x42dd22 in driver
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/driver.f:256
#13  0x42e16d in main
    at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/driver.f:301
Floating point exception (core dumped)
valassi commented 1 month ago

I am even more confused, but the issue I see in tmad tests is NOT a SIGFPE.

Essentially, to reproduce outside tmad tests:

In practice:

make -j BACKEND=cppnone
cat > input_susyggt1t1_x1_cudacpp << EOF
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
EOF
./madevent_cpp < input_susyggt1t1_x1_cudacpp

What I get is an error which ends like this

...
Running Configuration Number:    1
 Not subdividing B.W.
 Attempting mappinvarients           1           4
 Determine nb_t
 T-channel found:            0
 Completed mapping           4
 about to integrate            4        8192           1           1           4           1
 Keeping grid fixed.
  4 dimensions   8192 events  4 invarients  1 iterations  1 config(s),  (0.99)
 Using h-tuple random number sequence.
 Error opening grid
 Using Uniform Grid!          16
 Using uniform alpha   1.0000000000000000     
 Grid defined OK
 Masses: 0.000E+00 0.000E+00 0.400E+03 0.400E+03
 Set CM energy to      13000.00
 Mapping Graph           1  to config           1
 Determine nb_t
 T-channel found:            0
 Transforming s_hat 1/s            3   3.7807079264437879E-003   638939.63956899999        168999999.99999997     
 Error opening symfact.dat. No permutations used.
Using random seed offsets     1 :      1
  with seed                   21
 Ranmar initialization seeds       27402        9395
  Particle       3       4
      Et >     0.0     0.0
       E >     0.0     0.0
     Eta <    -1.0    -1.0
   xqcut:      0.0     0.0
d R # 3  >    -0.0     0.0
s min # 3>     0.0     0.0
xqcutij # 3>     0.0     0.0
 RESET CUMULATIVE VARIABLE
 NGOODHEL =           4
 NCOMB =           4
 MULTI_CHANNEL = TRUE
 CHANNEL_ID =           2
 RESET CUMULATIVE VARIABLE
        4096  points passed the cut but all returned zero
 therefore considering this contribution as zero
 Deleting file events.lhe
valassi commented 1 month ago

So yes on our haswell node of the cluster (since this maybe hardware specific) it does crash:

Thanks @oliviermattelaer ! We were writing at the same time.

Weird, so MANY issues

(I will not do much more before I leave on holiday in a few days, hopefully you'll find out more in the meantime! thanks)

valassi commented 1 month ago

In practice, go to epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x and then

make -j BACKEND=cppnone
cat > input_susyggt1t1_x1_cudacpp << EOF
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
EOF
./madevent_cpp < input_susyggt1t1_x1_cudacpp

@oliviermattelaer can you try also this one for curiosity please? both from linux and mac... thanks

valassi commented 1 month ago

Backtrace for this error:

0 0x7ff7388beb4f in ???

1 0x7ff739888bc9 in _ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i._omp_fn.0

at /auto/home/users/o/m/omatt/mg5amcnlo/PROC_MSSM_SLHA2_0/SubProcesses/P1_gg_t1t1x/CPPProcess.cc:1155

2 0x7ff7398326e5 in GOMP_parallel

at ../../../libgomp/parallel.c:178

And @oliviermattelaer other question, what is at this line? CPPProcess.cc:1155

Thanks Andrea

PS In my case 1154,55,56 are

#if defined MGONGPU_CPPSIMD
            const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
#else
valassi commented 1 month ago

Anyway, I confirm that I also reproduce "SIGFPE: erroneous arithmetic operation" from the repo

I have no idea why I get a SIGFPE crash in this mode and I get an empty cross section but no crash if I run madevent manually

oliviermattelaer commented 1 month ago

Hi,

yes the line is indeed:

const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );

and no surprise, targetamp[ncolor-1] is here zero...

 ievt 0 , ieppV, 0 , ncolor 2, max 0.000000
 ievt 1 , ieppV, 1 , ncolor 2, max 0.000000
 ievt 2 , ieppV, 2 , ncolor 2, max 0.000000
 ievt 3 , ieppV, 3 , ncolor 2, max 0.000000
 ievt 4 , ieppV, 0 , ncolor 2, max 0.000000
 ievt 5 , ieppV, 1 , ncolor 2, max 0.000000
 ievt 6 , ieppV, 2 , ncolor 2, max 0.000000
 ievt 7 , ieppV, 3 , ncolor 2, max 0.000000
 ievt 8 , ieppV, 0 , ncolor 2, max 0.000000
 ievt 9 , ieppV, 1 , ncolor 2, max 0.000000
 ievt 10 , ieppV, 2 , ncolor 2, max 0.000000
 ievt 11 , ieppV, 3 , ncolor 2, max 0.000000
 ievt 12 , ieppV, 0 , ncolor 2, max 0.000000
 ievt 13 , ieppV, 1 , ncolor 2, max 0.000000
 ievt 14 , ieppV, 2 , ncolor 2, max 0.000000
 ievt 15 , ieppV, 3 , ncolor 2, max 0.000000
 ievt 0 , ieppV, 0 , ncolor 2, max 3.101819
 ievt 1 , ieppV, 1 , ncolor 2, max 3.336748
 ievt 2 , ieppV, 2 , ncolor 2, max 2.629749
 ievt 3 , ieppV, 3 , ncolor 2, max 3.350298
 ievt 4 , ieppV, 0 , ncolor 2, max 4.813758
 ievt 5 , ieppV, 1 , ncolor 2, max 2.866929
 ievt 6 , ieppV, 2 , ncolor 2, max 2.657539
 ievt 7 , ieppV, 3 , ncolor 2, max 4.112587
 ievt 8 , ieppV, 0 , ncolor 2, max 8.225509
 ievt 9 , ieppV, 1 , ncolor 2, max 3.152658
 ievt 10 , ieppV, 2 , ncolor 2, max 2.698144
 ievt 11 , ieppV, 3 , ncolor 2, max 2.709947
 ievt 12 , ieppV, 0 , ncolor 2, max 2.629970
 ievt 13 , ieppV, 1 , ncolor 2, max 2.769736
 ievt 14 , ieppV, 2 , ncolor 2, max 2.622860
 ievt 15 , ieppV, 3 , ncolor 2, max 6.583436
 ievt 0 , ieppV, 0 , ncolor 2, max 0.000000
 ievt 1 , ieppV, 1 , ncolor 2, max 0.000000
 ievt 2 , ieppV, 2 , ncolor 2, max 0.000000
 ievt 3 , ieppV, 3 , ncolor 2, max 0.000000
 ievt 4 , ieppV, 0 , ncolor 2, max 0.000000
 ievt 5 , ieppV, 1 , ncolor 2, max 0.000000
 ievt 6 , ieppV, 2 , ncolor 2, max 0.000000
 ievt 7 , ieppV, 3 , ncolor 2, max 0.000000
 ievt 8 , ieppV, 0 , ncolor 2, max 0.000000
 ievt 9 , ieppV, 1 , ncolor 2, max 0.000000
 ievt 10 , ieppV, 2 , ncolor 2, max 0.000000
 ievt 11 , ieppV, 3 , ncolor 2, max 0.000000
 ievt 12 , ieppV, 0 , ncolor 2, max 0.000000
 ievt 13 , ieppV, 1 , ncolor 2, max 0.000000
 ievt 14 , ieppV, 2 , ncolor 2, max 0.000000
 ievt 15 , ieppV, 3 , ncolor 2, max 0.000000

What surprise/interest me is that it is 0 (or not) for a full block of 16 events... which might be a symfact related issue... And "YES" if I remove the symmetric channel this never happens ...

valassi commented 1 month ago

What surprise/interest me is that it is 0 (or not) for a full block of 16 events... which might be a symfact related issue...

Thanks Olivier. No I have rerun with cppnone in the runcards and then reran /bin/generate_events run01 (note: I need the patch in PR #851). I still get the same SIGFPE crash. So I assume that this is NOT a SIMD issue. But I will do more tests.

valassi commented 1 month ago

I have no idea why I get a SIGFPE crash in this mode and I get an empty cross section but no crash if I run madevent manually

Ok, interesting, I got this one.

Note: my 'tmad' tests always use channel=1. When doing a launch, it launches several processes, including channel 3, which gives the crash.

valassi commented 1 month ago

And gdb on channel 3 tells me that the crash is in the fortran code for phase space sampling?

Program received signal SIGFPE, Arithmetic exception.
0x000000000043809f in rotxxx_ ()
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0  0x000000000043809f in rotxxx_ ()
#1  0x0000000000405820 in gentcms_ ()
#2  0x00000000004067b2 in one_tree_ ()
#3  0x0000000000408c72 in gen_mom_ ()
#4  0x000000000040a0aa in x_to_f_arg_ ()
#5  0x0000000000444fe0 in sample_full_ ()
#6  0x000000000042bb39 in MAIN__ ()
#7  0x000000000040371f in main ()
valassi commented 1 month ago

And even more strange, if I use specifically the fortran MEs ie madevent_fortran then

valassi commented 1 month ago

More on the SIGFPE crash in madevent_cudacpp.

If I build the fortran part with -g:

#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check # crashes
GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -g # crashes with debug info

Then (the 'txt3` file is the one that ends with 3):


[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x> gdb ./madevent_cpp
...
(gdb) run < input_app.txt3
...
Setting grid   1    0.94518E-03   1
 Transforming s_hat 1/s            3   3.7807079264437879E-003   638939.63956899999        168999999.99999997     
 Error opening symfact.dat. No permutations used.
Using random seed offsets     3 :      1
  with seed                   57
 Ranmar initialization seeds       11126        9433

Program received signal SIGFPE, Arithmetic exception.
rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
1247              prot(1) = q(1)*q(3)/qq/qt*p1 -q(2)/qt*p(2) +q(1)/qq*p(3)
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0  rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
#1  0x0000000000405820 in gentcms (pa=..., pb=..., t=-41214.684204869853, phi=4.7054556515034918, ma2=0, m1=399.66849999999999, 
    m2=399.66849999999999, p1=..., pr=..., jac=1061858298.7999998) at genps.f:1480
#2  0x00000000004067b2 in one_tree (itree=..., tstrategy=<optimized out>, iconfig=3, nbranch=2, p=..., m=..., s=..., x=..., 
    jac=1061858298.7999998, pswgt=1) at genps.f:1167
#3  0x0000000000408c72 in gen_mom (iconfig=3, mincfig=3, maxcfig=3, invar=4, wgt=0.00020000000000000001, x=..., p1=...) at genps.f:68
#4  0x000000000040a0aa in x_to_f_arg (ndim=4, iconfig=3, mincfig=3, maxcfig=3, invar=4, wgt=0.00020000000000000001, x=..., p=...)
    at genps.f:60
#5  0x0000000000444fe0 in sample_full (ndim=4, ncall=1000, itmax=5, itmin=3, dsig=0x430440 <dsig>, ninvar=4, nconfigs=1, 
    vecsize_used=16384) at dsample.f:172
#6  0x000000000042bb39 in driver () at driver.f:256
#7  0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:301
#8  0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#9  0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x0000000000403845 in _start ()
(gdb) print q
$1 = (505.75540315099767, 0, 0, 505.75540315099767)
(gdb) print q(1)
$2 = 0
(gdb) print q
$3 = (505.75540315099767, 0, 0, 505.75540315099767)
(gdb) print qq
$4 = <optimized out>
(gdb) print qt
$5 = <optimized out>
(gdb) print p1
$6 = <optimized out>
(gdb) print p
$7 = (505.75540315099767, -0.28985446571100887, -41.805289511944629, 307.09257834957424)
(gdb) 
$8 = (505.75540315099767, -0.28985446571100887, -41.805289511944629, 307.09257834957424)
valassi commented 1 month ago

And not surprisingly having -O2 in make_opts means that the SIGFPE disappears. And here we go again, but this one is deep in the Fortran.

#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check # crashes
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -g # crashes with debug info
GLOBAL_FLAG=-O2 -ffast-math -fbounds-check -g # does not crash

In this case

./madevent_cpp < input_app.txt3

completes without a crash, and even gives a nice cross section.

(Note1 for myself: I tried to add gdb in survey.sh which is where the madevent executable is called. But this is very complex, also because you need an input redirection. And I managed to get my keyboard screen frozen. Instead: build with 'make -j' normally and then call madevent_cpp, it is easy to reproduce the issues) (Note2 for myself: the input_app.txt in SubProcesses/P1* does not have the channel line, and hence fails with a different error. You must add the channel line manually, or just get them from the G1, G2, G3 subdirectories)

Summary: here we go again, this is a SIGFPE that appears only in some situations, in particular it only appears in optimized code where it is very difficult to debug it. What is new is that this is a SIGFPE in Fortran, not in cudacpp: but strangely enough only appears if the MEs (another part of the code!) uses cudacpp and not fortran...

Anyway, possible solution? Disable vectorisation in fortran!

#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check # crashes
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -g # crashes with debug info
#GLOBAL_FLAG=-O2 -ffast-math -fbounds-check -g # does not crash
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -mno-sse3 # crashes
GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -fno-tree-vectorize # does not crash
#GLOBAL_FLAG=-O0 -g -fbounds-check # no crash

I tried and -fno-tree-vectorize seems to remove the SIGFPE. (So @oliviermattelaer you were kind of right that the issue is SIMD, but actually in fortran).

NB This needs many more checks because I should check if the cudacpp builds ok: it should, as I think that GLOBAL_FLAG only touch fortran, not cudacpp (which I think is good)

@oliviermattelaer what do you think, is it ok to add -fno-tree-vectorize to GLOBAL_FLAG if this removes the SIGFPE in Fortran? Thanks

valassi commented 1 month ago

Ouff this is really annoying.

Without -fno-tree-vectorize in GLOBAL FLAGS

With -fno-tree-vectorize (and also -g) in GLOBAL FLAGS

Without gdb

        Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

        Backtrace for this error:
        #0  0x7f9f01a23860 in ???
        #1  0x7f9f01a22a05 in ???
        #2  0x7f9f01654def in ???
        #3  0x7f9f02193169 in ???
        #4  0x7f9f02065575 in ???
        #5  0x7f9f02190eaf in ???
        #6  0x7f9f02194d4d in ???
        #7  0x7f9f0219a3d4 in ???
        #8  0x42e238 in smatrix1_multi_
                at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/auto_dsig1.f:574
        #9  0x42f844 in dsig1_vec_
                at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/auto_dsig1.f:401
        #10  0x4308a7 in dsigproc_vec_
                at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/auto_dsig.f:1031
        #11  0x4315e9 in dsig_vec_
                at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/auto_dsig.f:327
        #12  0x44763a in sample_full_
                at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/dsample.f:208
        #13  0x42cd30 in driver
                at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:256
        #14  0x40370e in main
                at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:301

With gdb (strange I had to try with and without, not always the same result? ah of course, different random seeds maybe?)

Program received signal SIGFPE, Arithmetic exception.
0x000000000040f3fb in unwgt (px=..., wgt=5.1435160241293966e-05, numproc=1, ihel=2, icol=1, ivec=4) at unwgt.f:257
257                 if (local_twgt .gt. 0) uwgt=uwgt/twgt/fudge
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0  0x000000000040f3fb in unwgt (px=..., wgt=5.1435160241293966e-05, numproc=1, ihel=2, icol=1, ivec=4) at unwgt.f:257
#1  0x000000000042f97a in dsig1_vec (all_pp=<error reading variable: value requires 2097152 bytes, which is more than max-value-size>, 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, 
    all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384)
    at auto_dsig1.f:447
#2  0x00000000004308a8 in dsigproc_vec (all_p=..., 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1, 
    symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, 
    imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384)
    at auto_dsig.f:1031
#3  0x00000000004315ea in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1, 
    all_out=..., vecsize_used=16384) at auto_dsig.f:327
#4  0x000000000044763b in sample_full (ndim=4, ncall=1000, itmax=5, itmin=3, dsig=0x4317e0 <dsig>, ninvar=4, nconfigs=1, 
    vecsize_used=16384) at dsample.f:208
#5  0x000000000042cd31 in driver () at driver.f:256
#6  0x000000000040370f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:301
#7  0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#8  0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#9  0x0000000000403835 in _start ()
...
(gdb) l
252              yran = xran1(idum)
253              if (xwgt .gt. local_twgt*fudge*yran) then
254                 uwgt = max(xwgt,local_twgt*fudge)
255     c           Set sign of uwgt to sign of wgt
256                 uwgt = dsign(uwgt,wgt)
257                 if (local_twgt .gt. 0) uwgt=uwgt/twgt/fudge
258     c            call write_event(p,uwgt)
259     c            write(29,'(2e15.5)') matrix,wgt
260     c $B$ S-COMMENT_C $B$
261                 call write_leshouche(p,uwgt,numproc,.True., ihel, icol, ivec)
(gdb) p twgt
$1 = 0
(gdb) p local_twgt
$2 = <optimized out>

And so again and again variables that are optimized out make debugging really complex.

oliviermattelaer commented 1 month ago

Hi Andrea,

Thanks for looking at this, I would say that we should try to avoid the " -fno-tree-vectorize" since this is likely hidding the issue more than solving the issue.

I would say that this process seems to have multiple problem (likely independant): 1) one issue seems to be the coupling ordering in cudacpp which is not consistent with fortran ordering 2) a second issue seems to be in the fortran side (or at least related to the fortran side) where something goes wrong with the handling of symmetric channel (G3 is handling also G5 and the issue is likely in G5) this is made some weight to be always zero (not initialised?) which is likely triggering a lot of bad stuff in reaction...

The second is a clear "for me" since it is clearly related to fortran (even if it seems to work when using fortran matrix-element), will try to reproduce the issue in a pure fortran code (but likely tommorow)

valassi commented 1 month ago

Hi Olivier, thanks to you!

I would say that this process seems to have multiple problem (likely independant):

Totally agree

1. one issue seems to be the coupling ordering in cudacpp which is not consistent with fortran ordering

This I have not seen, but you and Stefan seem to know more about it. I am happy if you look at that!

2. a second issue seems to be in the fortran side (or at least related to the fortran side) where something goes wrong with the handling of symmetric channel (G3 is handling also G5 and the issue is likely in G5) this is made some weight to be always zero (not initialised?) which is likely triggering a lot of bad stuff in reaction...

This I also do not understand (maybe it is your correct explanation for what I see, but I am not sure). Again, I am happy if you continue looking at that!

I think I see two issues which are independent from one another, and may or may not be independent from those above.

  1. For channel 3 in this susy_gg_t1t1, I sometimes get SIGFPE crashes in Fortran. Actually I made more tests than what described above, the only way I can get rid of them is by a global flag -O1! Which is much. I do not understand why, but adding -fno-tree-vectorize only fixes the rotxxx crash, but then I also get crashes in unwgt.f (and sometimes smatrix_multi) and the only why I can get rid of them is -O1. This is very difficult to debug because with -O2 and -O3 some variables are optimized out, so it's just a guess what happens. ALSO, it's weird that this happens only if you do things in a ceratin order, it may depend on the random seeds (which change every run of generate events), on the vegas grid maybe (or at least, I saw that if I run in channel 3 alone I get different things than running channel 1 then 3). Summary, a complete mess to debug. My suggestion for this is, let's sort out the pther issues, maybe the SIGFPE is caused by them and will disappear.

  2. The other issue that is independent from SIGFPE is the original problem, that I was seeing no cross section. Once I use -O1 and I see no SIGFPE crashes, I still get no cross section, only in channel 1. So here is my question for you, could it be that channel 1 in susy_gg_t1t1 is suppressed? Should I just use another channel? Or is this realted to your points 1 and 2 above, some ordering that is wrong? I am asking because, for ALL other processes I was always able to use channel 1 for some basic tests. So I am surprised that here it does not work (I thought channel 1 was 'special' and was guaranteed to always be non suppressed in a way). Any suggestion there?

Anyway, this time I really leave it here, probably will resume end of June. Thanks Andrea

oliviermattelaer commented 1 month ago

Ok found the issue (thanks @roiser for the help).

The issue is here that we do have 6 amplitude (so channel can go from 1 to 6) but that one amplitude does not have a channel associated to it (so only 5 channel). The information about which colour can be considered or not are therefore of length 5 but the code tries to read the sixth entry ... which means that we do have random behaviour.

Cheers,

Olivier

valassi commented 1 month ago

I have moved the SIGFPE crash analysis to #855.

I would keep this #826 only for the original issue: am empty cross section in iconfig=1 for susy_gg_t1t1

valassi commented 1 month ago

Hi,

yes the line is indeed:

const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );

Hi @oliviermattelaer I am going through all your posts again. I think that what you see here looks exactly like #845. From my gdb

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98d6f in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, 
    allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, 
    allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1189
1189                const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );

I suggest that we refer to the zero cross section as 826, and the crash as 845, ok? Maybe they are related, maybe they are not...

And this crash 845 is most likely related to the color mismatch #856

valassi commented 1 month ago

About

Summary: here we go again, this is a SIGFPE that appears only in some situations, in particular it only appears in optimized code where it is very difficult to debug it. What is new is that this is a SIGFPE in Fortran, not in cudacpp: but strangely enough only appears if the MEs (another part of the code!) uses cudacpp and not fortran...

Anyway, possible solution? Disable vectorisation in fortran!

#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check # crashes
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -g # crashes with debug info
#GLOBAL_FLAG=-O2 -ffast-math -fbounds-check -g # does not crash
#GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -mno-sse3 # crashes
GLOBAL_FLAG=-O3 -ffast-math -fbounds-check -fno-tree-vectorize # does not crash
#GLOBAL_FLAG=-O0 -g -fbounds-check # no crash

and

I think I see two issues which are independent from one another, and may or may not be independent from those above.

3. For channel 3 in this susy_gg_t1t1, I sometimes get SIGFPE crashes in Fortran. Actually I made more tests than what described above, the only way I can get rid of them is by a global flag -O1!

This is a DIFFERENT CRASH (#855, in fortran rotxxx) than that in #845 (in cudacpp sigmaKin). And it is what I suggest to fix adding volatile in #857.

valassi commented 1 month ago

I would say that this process seems to have multiple problem (likely independant):

About this point and the four mentioned above, again I agree, trying to summarise

  1. one issue seems to be the coupling ordering in cudacpp which is not consistent with fortran ordering

I do not see this in my tests, can you give me a reproducer please?

  1. a second issue seems to be in the fortran side (or at least related to the fortran side) where something goes wrong with the handling of symmetric channel (G3 is handling also G5 and the issue is likely in G5) this is made some weight to be always zero (not initialised?) which is likely triggering a lot of bad stuff in reacti

Same thing, I do not see this in my tests, can you give me a reproducer please?

About the "not initialised", note that valgrind in one of my tests reported an unitialised variable. Try to use valgrind maybe?

  1. For channel 3 in this susy_gg_t1t1, I sometimes get SIGFPE crashes in Fortran.

This is confirmed as #855 and can be fixed with volatile in #857

  1. The other issue that is independent from SIGFPE is the original problem, that I was seeing no cross section.

This is still this present #826.

In addition, related to iconfig-channel mapping, as discussed in #853 and #852

"5a." There is a crash in sigmakin color choice. I moved this to #845.

"5b". There is a color mismatch in LHE tests #856. But it is most likely related to #845 above in "5a".

oliviermattelaer commented 1 month ago

Hi Andrea,

I think that the code behaves strangely due to a out of bound issue (in the CPP part). That out of bound can corrupt some memory making the code (even the fortran part) leading to some random crash (compiler flag and machine specific way of crashing -> not reproducible). Since I/We have identified one of such issue, I think the first thing that we should do is to fix that one and then re-investigate the other issue (For me #852 fixes all crashing issue).

So for me, the priority is to merge #852. I will work on it now to include more comment on it as you want. Then after that my priority would be to understand which variable is used un-identified from valgrind since this will be a good hint from where the next issue is.

valassi commented 4 weeks ago

Hi Andrea,

I think that the code behaves strangely due to a out of bound issue (in the CPP part). That out of bound can corrupt some memory making the code (even the fortran part) leading to some random crash (compiler flag and machine specific way of crashing -> not reproducible). Since I/We have identified one of such issue, I think the first thing that we should do is to fix that one and then re-investigate the other issue (For me #852 fixes all crashing issue).

So for me, the priority is to merge #852. I will work on it now to include more comment on it as you want. Then after that my priority would be to understand which variable is used un-identified from valgrind since this will be a good hint from where the next issue is.

Hi Olivier, thanks for the message and sorry for not replying on this before.

Just replying here to try and summarise the various directions of work that this issue 826 triggered somehow. For me this was the work of this last (very busy) week.

Voila this seems like a good summary of this week, replying to your points above.

Summary of the todo summary

valassi commented 5 days ago

This specific issue 826 about a zero cross section in susy process is fixed by PR #918 (thanks @oliviermattelaer @roiser). Indeed it is caused by a couplings order #862.

Closing this as fixed by PR #918 (to be merged soon). Code regenerated in #934.

valassi commented 3 days ago

This is also related to #748 (xsec mismatch in gqttq, which was also due to ther order of couplings)