madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
28 stars 33 forks source link

valgrind issues #868

Closed valassi closed 1 week ago

valassi commented 2 weeks ago

Since we are not yet converging on some issues like the rotxxx segfault #855 and the possible fix with volatile in #857, in parallel I am also running some checks with valgrind.

On the current upstream/master 286280fa4e4b32c12dd35fbb4bcf75d9930d4582, using the same code I used for the #855 reproducer described here, https://github.com/madgraph5/madgraph4gpu/pull/857#issuecomment-2187007215 , I try this valgrind, to start with on Fortran ONLY.

cd madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
make cleanall
make -f cudacpp.mk gtestlibs
make -j BACKEND=cppnone -f cudacpp.mk debug
make -j BACKEND=cppnone
cat > input_cudacpp_104 << EOF
32 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
104 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
EOF
gdb -ex 'run < input_cudacpp_104' -ex where -ex 'set confirm off' -ex quit ./madevent_cpp

Note that 32 events are enough, the cpp still crashes.

I do this valgrind

valgrind --leak-check=full --gen-suppressions=all --log-file=memcheck.log ./madevent_fortran < input_cudacpp_104

And this tells me

more memcheck.log 
==3678768== Memcheck, a memory error detector
==3678768== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3678768== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3678768== Command: ./madevent_fortran
==3678768== Parent PID: 3638452
==3678768== 
==3678768== Conditional jump or move depends on uninitialised value(s)
==3678768==    at 0x425E73: setclscales_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg
_ttxgg/madevent_fortran)
==3678768==    by 0x4284D9: update_scale_coupling_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubP
rocesses/P1_gg_ttxgg/madevent_fortran)
==3678768==    by 0x436B87: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_tt
xgg/madevent_fortran)
==3678768==    by 0x45AFAA: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg
_ttxgg/madevent_fortran)
==3678768==    by 0x4331AD: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
/madevent_fortran)
==3678768==    by 0x40268E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/m
adevent_fortran)
==3678768== 
{
   <insert_a_suppression_name_here>
   Memcheck:Cond
   fun:setclscales_
   fun:update_scale_coupling_vec_
   fun:dsig_vec_
   fun:sample_full_
   fun:MAIN__
   fun:main
}
==3678768== 
==3678768== HEAP SUMMARY:
==3678768==     in use at exit: 552 bytes in 3 blocks
==3678768==   total heap usage: 137,537 allocs, 137,534 frees, 48,779,094 bytes allocated
==3678768== 
==3678768== 544 (32 direct, 512 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 3
==3678768==    at 0x719786F: malloc (vg_replace_malloc.c:381)
==3678768==    by 0x7404CC8: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3678768==    by 0x7647C63: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3678768==    by 0x7635319: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3678768==    by 0x76357FC: _gfortran_st_open (in /usr/lib64/libgfortran.so.5.0.0)
==3678768==    by 0x47AA9F: open_file_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_t
txgg/madevent_fortran)
==3678768==    by 0x432C91: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
/madevent_fortran)
==3678768==    by 0x40268E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/m
adevent_fortran)
==3678768== 
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: definite
   fun:malloc
   obj:/usr/lib64/libgfortran.so.5.0.0
   obj:/usr/lib64/libgfortran.so.5.0.0
   obj:/usr/lib64/libgfortran.so.5.0.0
   fun:_gfortran_st_open
   fun:open_file_
   fun:MAIN__
   fun:main
}
==3678768== LEAK SUMMARY:
==3678768==    definitely lost: 32 bytes in 1 blocks
==3678768==    indirectly lost: 512 bytes in 1 blocks
==3678768==      possibly lost: 0 bytes in 0 blocks
==3678768==    still reachable: 8 bytes in 1 blocks
==3678768==         suppressed: 0 bytes in 0 blocks
==3678768== Reachable blocks (those to which a pointer was found) are not shown.
==3678768== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3678768== 
==3678768== Use --track-origins=yes to see where uninitialised values come from
==3678768== For lists of detected and suppressed errors, rerun with: -s
==3678768== ERROR SUMMARY: 26058 errors from 2 contexts (suppressed: 0 from 0)
valassi commented 2 weeks ago

Adding -g to GLOBAL_FLAGS gives tiny more details

==3682257== Memcheck, a memory error detector
==3682257== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3682257== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3682257== Command: ./madevent_fortran
==3682257== Parent PID: 3638452
==3682257== 
==3682257== Conditional jump or move depends on uninitialised value(s)
==3682257==    at 0x425E73: setclscales_ (reweight.f:1230)
==3682257==    by 0x4284D9: update_scale_coupling_vec_ (reweight.f:1876)
==3682257==    by 0x436B87: dsig_vec_ (auto_dsig.f:316)
==3682257==    by 0x45AFAA: sample_full_ (dsample.f:208)
==3682257==    by 0x4331AD: MAIN__ (driver.f:256)
==3682257==    by 0x40268E: main (driver.f:301)
==3682257== 
{
   <insert_a_suppression_name_here>
   Memcheck:Cond
   fun:setclscales_
   fun:update_scale_coupling_vec_
   fun:dsig_vec_
   fun:sample_full_
   fun:MAIN__
   fun:main
}
==3682257== 
==3682257== HEAP SUMMARY:
==3682257==     in use at exit: 552 bytes in 3 blocks
==3682257==   total heap usage: 137,537 allocs, 137,534 frees, 48,779,094 bytes allocated
==3682257== 
==3682257== 544 (32 direct, 512 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 3
==3682257==    at 0x719786F: malloc (vg_replace_malloc.c:381)
==3682257==    by 0x7404CC8: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x7647C63: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x7635319: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x76357FC: _gfortran_st_open (in /usr/lib64/libgfortran.so.5.0.0)
==3682257==    by 0x47AA9F: open_file_ (open_file.f:40)
==3682257==    by 0x432C91: MAIN__ (driver.f:151)
==3682257==    by 0x40268E: main (driver.f:301)
==3682257== 
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: definite
   fun:malloc
   obj:/usr/lib64/libgfortran.so.5.0.0
   obj:/usr/lib64/libgfortran.so.5.0.0
   obj:/usr/lib64/libgfortran.so.5.0.0
   fun:_gfortran_st_open
   fun:open_file_
   fun:MAIN__
   fun:main
}
==3682257== LEAK SUMMARY:
==3682257==    definitely lost: 32 bytes in 1 blocks
==3682257==    indirectly lost: 512 bytes in 1 blocks
==3682257==      possibly lost: 0 bytes in 0 blocks
==3682257==    still reachable: 8 bytes in 1 blocks
==3682257==         suppressed: 0 bytes in 0 blocks
==3682257== Reachable blocks (those to which a pointer was found) are not shown.
==3682257== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3682257== 
==3682257== Use --track-origins=yes to see where uninitialised values come from
==3682257== For lists of detected and suppressed errors, rerun with: -s
==3682257== ERROR SUMMARY: 26058 errors from 2 contexts (suppressed: 0 from 0)
valassi commented 2 weeks ago

The leak is reported in https://github.com/mg5amcnlo/mg5amcnlo/issues/109 A fix for the leak is in https://github.com/mg5amcnlo/mg5amcnlo/pull/110

valassi commented 2 weeks ago

(Note: this is one example of #207 about testing code through valgrind)

valassi commented 2 weeks ago

The uninitialised value is reported in https://github.com/mg5amcnlo/mg5amcnlo/issues/111 A workaround (not a real fix) is in https://github.com/mg5amcnlo/mg5amcnlo/pull/112

After applying both patches above, there are now no valgrind issues in madevent_fortran

==3735418== Memcheck, a memory error detector
==3735418== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3735418== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3735418== Command: ./madevent_fortran
==3735418== Parent PID: 3638452
==3735418== 
==3735418== 
==3735418== HEAP SUMMARY:
==3735418==     in use at exit: 8 bytes in 1 blocks
==3735418==   total heap usage: 141,695 allocs, 141,694 frees, 50,553,628 bytes allocated
==3735418== 
==3735418== LEAK SUMMARY:
==3735418==    definitely lost: 0 bytes in 0 blocks
==3735418==    indirectly lost: 0 bytes in 0 blocks
==3735418==      possibly lost: 0 bytes in 0 blocks
==3735418==    still reachable: 8 bytes in 1 blocks
==3735418==         suppressed: 0 bytes in 0 blocks
==3735418== Rerun with --leak-check=full to see details of leaked memory
==3735418== 
==3735418== For lists of detected and suppressed errors, rerun with: -s
==3735418== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Note, the rotxxx crash is still there instead in madevent_cpp

gdb -ex 'run < input_cudacpp_104' -ex where -ex 'set confirm off' -ex quit ./madevent_cpp
...
Program received signal SIGFPE, Arithmetic exception.
rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
1247              prot(1) = q(1)*q(3)/qq/qt*p1 -q(2)/qt*p(2) +q(1)/qq*p(3)
#0  rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
#1  0x00000000004087e0 in gentcms (pa=..., pb=..., t=-181765.47706865534, phi=0.64468537567405615, ma2=0, m1=234.1712866912786, 
    m2=210.15563843880372, p1=..., pr=..., jac=3.0327734872026782e+25) at genps.f:1480
#2  0x0000000000409849 in one_tree (itree=..., tstrategy=<optimized out>, iconfig=104, nbranch=4, p=..., m=..., s=..., x=..., 
    jac=3.0327734872026782e+25, pswgt=1) at genps.f:1167
#3  0x000000000040bb84 in gen_mom (iconfig=104, mincfig=104, maxcfig=104, invar=10, wgt=0.03125, x=..., p1=...) at genps.f:68
#4  0x000000000040d1aa in x_to_f_arg (ndim=10, iconfig=104, mincfig=104, maxcfig=104, invar=10, wgt=0.03125, x=..., p=...)
    at genps.f:60
#5  0x000000000045c865 in sample_full (ndim=10, ncall=32, itmax=1, itmin=1, dsig=0x438b00 <dsig>, ninvar=10, nconfigs=1, 
    vecsize_used=16384) at dsample.f:172
#6  0x000000000043427a in driver () at driver.f:257
#7  0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302
#8  0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#9  0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x0000000000403845 in _start ()

I will now check valgrind also on madeven_cpp

valassi commented 2 weeks ago

I have tried to run valgrind on madevent_cpp but this hangs...

valgrind --track-origins=yes --gen-suppressions=all --max-stackframe=3932984 --log-file=memcheckc2.log ./madevent_cpp < input_cudacpp_104

After 200 minutes (more than 3 hours) this was still running...

==3737758== Process terminating with default action of signal 2 (SIGINT)
==3737758==    at 0x6E692E5: KernelAccessHelper<mg5amcCpu::MemoryAccessCouplingsBase, false>::kernelAccessRecord(double*) (MemoryAccessHelpers.h:116)
==3737758==    by 0x6E59E11: double& KernelAccessHelper<mg5amcCpu::MemoryAccessCouplingsBase, false>::kernelAccessField<int>(double*, int) (MemoryAccessHelpers.h:139)
==3737758==    by 0x6E6A02D: mg5amcCpu::KernelAccessCouplings<false>::kernelAccessIx2(double*, int) (MemoryAccessCouplings.h:185)
==3737758==    by 0x6E69D99: mg5amcCpu::KernelAccessCouplings<false>::kernelAccessIx2Const(double const*, int) (MemoryAccessCouplings.h:205)
==3737758==    by 0x6E694CB: mg5amcCpu::KernelAccessCouplings<false>::kernelAccessConst(double const*) (MemoryAccessCouplings.h:256)
==3737758==    by 0x6E662B9: void mg5amcCpu::FFV1_0<mg5amcCpu::KernelAccessWavefunctions<false>, mg5amcCpu::KernelAccessAmplitudes<false>, mg5amcCpu::KernelAccessCouplings<false> >(double const*, double const*, double const*, double const*, double, double*) (HelAmps_sm.h:1104)
==3737758==    by 0x6E485A4: mg5amcCpu::calculate_wavefunctions(int, double const*, double const*, double*, unsigned int, double*, double*, double*, int) (CPPProcess.cc:1174)
==3737758==    by 0x6E5914F: mg5amcCpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*, int) [clone ._omp_fn.0] (CPPProcess.cc:3223)
==3737758==    by 0x748F575: GOMP_parallel (in /usr/lib64/libgomp.so.1.0.0)
==3737758==    by 0x6E58BE2: mg5amcCpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*, int) (CPPProcess.cc:3203)
==3737758==    by 0x6E6B371: mg5amcCpu::MatrixElementKernelHost::computeMatrixElements(unsigned int) (MatrixElementKernels.cc:115)
==3737758==    by 0x6E6DAFB: mg5amcCpu::Bridge<double>::cpu_sequence(double const*, double const*, double const*, double const*, unsigned int, double*, int*, int*, bool) (Bridge.h:390)

I will abandon this line of tests forthe moment

valassi commented 1 week ago

A couple of valgrind issues (running vaglrind over fortran) have been addressed. They are merged in PR #869.

The issue remains that running valgrind over the cudacpp madevent instead hangs. This should eventually be opened in a new issue.

I am closing this for simplicity as the most urgen issus have gone.

valassi commented 1 week ago

PS A couple of comments from extra tests

So I would say that I completed this valgrind investigation for now...