madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
28 stars 33 forks source link

tmad test crashes in rotxxx (SIGFPE erroneous arithmetic operation) #855

Closed valassi closed 1 week ago

valassi commented 1 month ago

"tmad test crashes for some iconfig (channel/iconfig mapping issues and SIGFPE erroneous arithmetic operation)"

Hi @oliviermattelaer this is a follow up to the discussions in #826 and PR #853.

I prefer to open this as a clean issue and investigate this independently of SUSY, or in any case of zero cross section #826.

In these discussions from your patch #853 I realised that we risk having a MAJOR problem not only for BSM but also for SM, namely: all of my 'tmad' tests test only iconfig=1. These were ok so far (in some cases by luck maybe), but for different iconfig (i.e. if we put a number different from 1 in the input_app.txt piped to madevent.

Indeed I found a crash on the first test I executed, ggttgg with iconfig=104.

 ./tmad/madX.sh -ggttgg -iconfig 104
...
On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
Working directory (run): /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg

*** (1) EXECUTE MADEVENT_FORTRAN (create results.dat) ***
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 64/64
 [XSECTION] VECSIZE_USED = 8192
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 104
 [XSECTION] ChannelId = 112
 [XSECTION] Cross section = 0.4632 [0.46320556621222242] fbridge_mode=0
 [UNWEIGHT] Wrote 11 events (found 187 events)
 [COUNTERS] PROGRAM TOTAL          :    4.4430s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2478s
 [COUNTERS] Fortran MEs      ( 1 ) :    4.1953s for     8192 events => throughput is 1.95E+03 events/s

*** (1) EXECUTE MADEVENT_FORTRAN x1 (create events.lhe) ***
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 64/64
 [XSECTION] VECSIZE_USED = 8192
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 104
 [XSECTION] ChannelId = 112
 [XSECTION] Cross section = 0.4632 [0.46320556621222242] fbridge_mode=0
 [UNWEIGHT] Wrote 11 events (found 168 events)
 [COUNTERS] PROGRAM TOTAL          :    4.4488s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2487s
 [COUNTERS] Fortran MEs      ( 1 ) :    4.2002s for     8192 events => throughput is 1.95E+03 events/s

*** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7effbd423860 in ???
#1  0x7effbd422a05 in ???
#2  0x7effbd054def in ???
#3  0x44b5ff in ???
#4  0x4087df in ???
#5  0x409848 in ???
#6  0x40bb83 in ???
#7  0x40d1a9 in ???
#8  0x45c804 in ???
#9  0x434269 in ???
#10  0x40371e in ???
#11  0x7effbd03feaf in ???
#12  0x7effbd03ff5f in ???
#13  0x403844 in ???
#14  0xffffffffffffffff in ???
./tmad/madX.sh: line 387: 780951 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttgg_x1_cudacpp > /tmp/avalassi/output_ggttgg_x1_cudacpp' failed

This uses a sightly modified script, I will pur it in a PR.

I guess that the solution goes through what you proposed in #852 and the additional modifications you and I discussed there.

(Note: the 'tlau' tests that I proposed in July last year just before my absence were supposed to test exactly this (see #711), i.e. test all possible iconfig at the same time in a user-like enviornment, for all processes, but using a short manageable time. I continue to think that allowing the possibility to run shorter generate_events tests is necessary to allow better testing. There was disagreement last year, I hope we can come back and agree on this).

valassi commented 1 month ago

In https://github.com/madgraph5/madgraph4gpu/pull/852#issuecomment-2143014502 Olivier suggested "you/we should compile with the C equivalent of -fbounds-check which is super usefull to spot segfault who by definition are hardware specific". I had a look but I am not sure there is an equivalent.

Instead I have run valgrind, this is interesting. This is a reproducer which mimics the tmad test above, but without using tmad tests

cd gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
make cleanall
make -j BACKEND=cppnone -f cudacpp.mk debug
make -j BACKEND=cppnone
cat > input_cudacpp_104 << EOF
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
104 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
EOF
./madevent_cpp < input_cudacpp_104
valgrind ./madevent_cpp < input_cudacpp_104

The valgrind output includes things like

...
==794089== Conditional jump or move depends on uninitialised value(s)
==794089==    at 0x426F03: setclscales_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x429569: update_scale_coupling_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x438857: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x45CC7A: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x434269: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x40371E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089== 
==794089== Warning: client switching stacks?  SP change: 0x1ffeffeeb8 --> 0x1ffec3eb80
==794089==          to suppress, use: --max-stackframe=3932984 or greater
==794089== Invalid write of size 8
==794089==    at 0x4366D4: dsig1_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x437C97: dsigproc_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x4388A7: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x45CC7A: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x434269: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x40371E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==  Address 0x1ffec3eba8 is on thread 1's stack
==794089==  in frame #0, created by dsig1_vec_ (???:)
==794089== 
==794089== Invalid write of size 8
==794089==    at 0x4366D9: dsig1_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x437C97: dsigproc_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x4388A7: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x45CC7A: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x434269: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x40371E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==  Address 0x1ffec3ebb0 is on thread 1's stack
==794089==  in frame #0, created by dsig1_vec_ (???:)
...
==794089== Invalid read of size 4
==794089==    at 0x436AE5: dsig1_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x437C97: dsigproc_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x4388A7: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x45CC7A: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x434269: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x40371E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==  Address 0x1ffec3ebcc is on thread 1's stack
==794089==  in frame #0, created by dsig1_vec_ (???:)
...
==794089== Invalid read of size 8
==794089==    at 0x6E032EF: memmove (vg_replace_strmem.c:1385)
==794089==    by 0x6E6D811: mg5amcCpu::Bridge<double>::cpu_sequence(double const*, double const*, double const*, double const*, unsigned int, double*, int*, int*, bool) (Bridge.h:376)
==794089==    by 0x6E6F37B: fbridgesequence_ (fbridge.cc:106)
==794089==    by 0x6E6F3F2: fbridgesequence_nomultichannel_ (fbridge.cc:132)
==794089==    by 0x4358D9: smatrix1_multi_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x436C74: dsig1_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x437C97: dsigproc_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x4388A7: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x45CC7A: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x434269: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x40371E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==  Address 0x1ffec7eec8 is on thread 1's stack
==794089==  in frame #5, created by dsig1_vec_ (???:)
...

Also I have rebuilt with -O3 -g in make_opts:

epochX/cudacpp/gg_ttgg.mad/Source/make_opts /tmp/git-blob-ieuRtt/make_opts e4b87ee6ad40ecb97ecbb40ae1811714ce5f1b46 100644 epochX/cudacpp/gg_ttgg.mad/Source/make_opts 0000000000000000000000000000000000000000 100644
4c4,5
< GLOBAL_FLAG=-O3 -ffast-math -fbounds-check
---
> ###GLOBAL_FLAG=-O3 -ffast-math -fbounds-check
> GLOBAL_FLAG=-O3 -g -ffast-math -fbounds-check

The crash now prints out where it happens, it is in rotxxx

Setting grid   1    0.17709E-03   1
Setting grid   2    0.17709E-03   1
Setting grid   3    0.22041E-03   1
 Transforming s_hat 1/s            9   8.8163313609467475E-004   119716.00000000000        168999999.99999997     
 Error opening symfact.dat. No permutations used.
Using random seed offsets   104 :      1
  with seed                   21
 Ranmar initialization seeds       27505        9395

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f6471c23860 in ???
#1  0x7f6471c22a05 in ???
#2  0x7f6471854def in ???
#3  0x44b5ff in rotxxx_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f:1247
#4  0x4087df in gentcms_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1480
#5  0x409848 in one_tree_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1167
#6  0x40bb83 in gen_mom_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:68
#7  0x40d1a9 in x_to_f_arg_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:60
#8  0x45c804 in sample_full_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/dsample.f:172
#9  0x434269 in driver
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:256
#10  0x40371e in main
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:301
Floating point exception (core dumped)

Note, rotxxx is what I had already foun dalso in susy tests https://github.com/madgraph5/madgraph4gpu/issues/826#issuecomment-2139578630

valassi commented 1 month ago

As discussed in #826 this is again a weird optimization issue: gdb gives

Program received signal SIGFPE, Arithmetic exception.
rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
1247              prot(1) = q(1)*q(3)/qq/qt*p1 -q(2)/qt*p(2) +q(1)/qq*p(3)
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) p qq qt p1
A syntax error in expression, near `qt p1'.
(gdb) p qq
$1 = <optimized out>
(gdb) p qt
$2 = <optimized out>
(gdb) p p1
$3 = <optimized out>

This was with -O3 -g. If I use lower optimization levels, the issue disappears.

As I have done withy many SIGFPEs in cudacpp, I tried adding volatile

--- a/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f
+++ b/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f
@@ -1201,7 +1201,7 @@ c       real    prot(0:3)      : four-momentum p in the rotated frame
 c
       implicit none
       double precision p(0:3),q(0:3),prot(0:3),qt2,qt,psgn,qq,p1
-
+      volatile qt, p1, qq
       double precision rZero, rOne
       parameter( rZero = 0.0d0, rOne = 1.0d0 )

Strangely enough. this prevents SIGFPE. But now the code seems stuck in an infinite loop?

valassi commented 1 month ago

I tried cuda to make it faster.

Again something strange, the code crashes without valgrind but does not crash with valgrind... (NB this is WITHOUT volatile)

cd gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
make cleanall
make -j BACKEND=cuda -f cudacpp.mk debug
make -j BACKEND=cuda
cat > input_cudacpp_104 << EOF
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
104 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
EOF
./madevent_cuda < input_cudacpp_104
valgrind ./madevent_cuda < input_cudacpp_104
valassi commented 1 month ago

Ok. In the cuda version, adding volatile in the Fortran removes SIGFPE and allows the program to reach the end.

So IS THIS A POSSIBLE FIX?

With cpp maybe I just needed to wait? Or is this going slower? I will try to rerun more tests and leave them running.

(In the meantime I will also try the susy_gg_t1t1 channel which in the past seemed problematic with SIGFPE).

valassi commented 1 month ago

This is fixed by #857 by adding volatile, as I had done for similar SIGFPE in cudacpp

valassi commented 1 month ago

I completed my tests in PR #857 and I confirm that it fixes this issue, closing

valassi commented 2 weeks ago

Reopening until PR #857 is merged - or until this is otherwise clarified

valassi commented 1 week ago

I change the name of this to indicate that this is ONLY about rotxxx crashes. This can be fixed using 'volatile' in PR #857 and https://github.com/mg5amcnlo/mg5amcnlo/pull/113

Conversely I removed "channel/iconfig mapping issues" from the name of this issue. Those "channel/iconfig mapping issues" are behind the LHE mismatch #856 and possibly the intermittent sigmakin crash #845.

valassi commented 5 days ago

Note, there is a crash #885 in master_june40 that I thought was related to this, but it most likely is unrelated (and is instead speciufic to master_june40)