madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
28 stars 33 forks source link

master_june24: Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3 #888

Open valassi opened 5 days ago

valassi commented 5 days ago

Another issue introduced in #830 and being reviewed in #882.

In WIP PR #882 for master_june24, I tried to use NB_WARP=512 and WARP_SIZE=16384 ie VECSIZE_MEMMAX=16384. This is https://github.com/madgraph5/madgraph4gpu/pull/882/commits/bede049e491a2aaedab10e9397ea4253fcd9df8b

In the CI tmad tests (which use VECSIZE_USED=32) I still get the crash of #885, but I also get the following: https://github.com/madgraph5/madgraph4gpu/actions/runs/9806731881/job/27079146521

*** (1) EXECUTE MADEVENT_FORTRAN (create results.dat) ***
At line 412 of file auto_dsig1.f
Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3

Error termination. Backtrace:
#0  0x7f74b5a23960 in ???
#1  0x7f74b5a244d9 in ???
#2  0x55edd8ae6fd9 in dsig1_vec_
#3  0x55edd8ae7de8 in dsigproc_vec_
#4  0x55edd8ae88e3 in dsig_vec_
#5  0x55edd8afec68 in sample_full_
#6  0x55edd8ae4cbd in MAIN__
#7  0x55edd8abc69e in main
ERROR! ' ./madevent_fortran < /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/input_gg_tt_ > /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/output_gg_tt_' failed

For reference, with the previous values NB_WARP=1, WARP_SIZE=16384, VECSIZE_MEMAMX=16384 (and always VECSIZE_USED=32), this was https://github.com/madgraph5/madgraph4gpu/pull/882/commits/64a7c0dda556ecdde6c872c43620b863efcd5ccc And I was getting no such 'Fortran runtime error in symconf' https://github.com/madgraph5/madgraph4gpu/actions/runs/9797840410/job/27055291574#step:12:77

*** (2-none) EXECUTE MADEVENT_CPP xQUICK (create events.lhe) ***

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7ff0a5423960 in ???
#1  0x7ff0a5422ac5 in ???
#2  0x7ff0a504251f in ???
#3  0x556bad8564aa in dsig1_vec_
#4  0x556bad857509 in dsigproc_vec_
#5  0x556bad8582b2 in dsig_vec_
#6  0x556bad86e5de in sample_full_
#7  0x556bad853d2a in MAIN__
#8  0x556bad82b6de in main
.github/workflows/testsuite_oneprocess.sh: line 289:  3672 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/input_gg_tt_none > /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/output_gg_tt_none' failed
roiser commented 5 days ago

Hi, I just looked at my tests that I did at the time, I set e.g.

set vector_size 32
set nb_warp 256

which e.g. then gave me a vector width of 8192, note this was when testing it with configs passed into bin/mg5_aMC