Closed valassi closed 1 week ago
Adding -g to GLOBAL_FLAGS gives tiny more details
==3682257== Memcheck, a memory error detector
==3682257== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3682257== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3682257== Command: ./madevent_fortran
==3682257== Parent PID: 3638452
==3682257==
==3682257== Conditional jump or move depends on uninitialised value(s)
==3682257== at 0x425E73: setclscales_ (reweight.f:1230)
==3682257== by 0x4284D9: update_scale_coupling_vec_ (reweight.f:1876)
==3682257== by 0x436B87: dsig_vec_ (auto_dsig.f:316)
==3682257== by 0x45AFAA: sample_full_ (dsample.f:208)
==3682257== by 0x4331AD: MAIN__ (driver.f:256)
==3682257== by 0x40268E: main (driver.f:301)
==3682257==
{
<insert_a_suppression_name_here>
Memcheck:Cond
fun:setclscales_
fun:update_scale_coupling_vec_
fun:dsig_vec_
fun:sample_full_
fun:MAIN__
fun:main
}
==3682257==
==3682257== HEAP SUMMARY:
==3682257== in use at exit: 552 bytes in 3 blocks
==3682257== total heap usage: 137,537 allocs, 137,534 frees, 48,779,094 bytes allocated
==3682257==
==3682257== 544 (32 direct, 512 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 3
==3682257== at 0x719786F: malloc (vg_replace_malloc.c:381)
==3682257== by 0x7404CC8: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257== by 0x7647C63: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257== by 0x7635319: ??? (in /usr/lib64/libgfortran.so.5.0.0)
==3682257== by 0x76357FC: _gfortran_st_open (in /usr/lib64/libgfortran.so.5.0.0)
==3682257== by 0x47AA9F: open_file_ (open_file.f:40)
==3682257== by 0x432C91: MAIN__ (driver.f:151)
==3682257== by 0x40268E: main (driver.f:301)
==3682257==
{
<insert_a_suppression_name_here>
Memcheck:Leak
match-leak-kinds: definite
fun:malloc
obj:/usr/lib64/libgfortran.so.5.0.0
obj:/usr/lib64/libgfortran.so.5.0.0
obj:/usr/lib64/libgfortran.so.5.0.0
fun:_gfortran_st_open
fun:open_file_
fun:MAIN__
fun:main
}
==3682257== LEAK SUMMARY:
==3682257== definitely lost: 32 bytes in 1 blocks
==3682257== indirectly lost: 512 bytes in 1 blocks
==3682257== possibly lost: 0 bytes in 0 blocks
==3682257== still reachable: 8 bytes in 1 blocks
==3682257== suppressed: 0 bytes in 0 blocks
==3682257== Reachable blocks (those to which a pointer was found) are not shown.
==3682257== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3682257==
==3682257== Use --track-origins=yes to see where uninitialised values come from
==3682257== For lists of detected and suppressed errors, rerun with: -s
==3682257== ERROR SUMMARY: 26058 errors from 2 contexts (suppressed: 0 from 0)
The leak is reported in https://github.com/mg5amcnlo/mg5amcnlo/issues/109 A fix for the leak is in https://github.com/mg5amcnlo/mg5amcnlo/pull/110
(Note: this is one example of #207 about testing code through valgrind)
The uninitialised value is reported in https://github.com/mg5amcnlo/mg5amcnlo/issues/111 A workaround (not a real fix) is in https://github.com/mg5amcnlo/mg5amcnlo/pull/112
After applying both patches above, there are now no valgrind issues in madevent_fortran
==3735418== Memcheck, a memory error detector
==3735418== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3735418== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3735418== Command: ./madevent_fortran
==3735418== Parent PID: 3638452
==3735418==
==3735418==
==3735418== HEAP SUMMARY:
==3735418== in use at exit: 8 bytes in 1 blocks
==3735418== total heap usage: 141,695 allocs, 141,694 frees, 50,553,628 bytes allocated
==3735418==
==3735418== LEAK SUMMARY:
==3735418== definitely lost: 0 bytes in 0 blocks
==3735418== indirectly lost: 0 bytes in 0 blocks
==3735418== possibly lost: 0 bytes in 0 blocks
==3735418== still reachable: 8 bytes in 1 blocks
==3735418== suppressed: 0 bytes in 0 blocks
==3735418== Rerun with --leak-check=full to see details of leaked memory
==3735418==
==3735418== For lists of detected and suppressed errors, rerun with: -s
==3735418== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Note, the rotxxx crash is still there instead in madevent_cpp
gdb -ex 'run < input_cudacpp_104' -ex where -ex 'set confirm off' -ex quit ./madevent_cpp
...
Program received signal SIGFPE, Arithmetic exception.
rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
1247 prot(1) = q(1)*q(3)/qq/qt*p1 -q(2)/qt*p(2) +q(1)/qq*p(3)
#0 rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
#1 0x00000000004087e0 in gentcms (pa=..., pb=..., t=-181765.47706865534, phi=0.64468537567405615, ma2=0, m1=234.1712866912786,
m2=210.15563843880372, p1=..., pr=..., jac=3.0327734872026782e+25) at genps.f:1480
#2 0x0000000000409849 in one_tree (itree=..., tstrategy=<optimized out>, iconfig=104, nbranch=4, p=..., m=..., s=..., x=...,
jac=3.0327734872026782e+25, pswgt=1) at genps.f:1167
#3 0x000000000040bb84 in gen_mom (iconfig=104, mincfig=104, maxcfig=104, invar=10, wgt=0.03125, x=..., p1=...) at genps.f:68
#4 0x000000000040d1aa in x_to_f_arg (ndim=10, iconfig=104, mincfig=104, maxcfig=104, invar=10, wgt=0.03125, x=..., p=...)
at genps.f:60
#5 0x000000000045c865 in sample_full (ndim=10, ncall=32, itmax=1, itmin=1, dsig=0x438b00 <dsig>, ninvar=10, nconfigs=1,
vecsize_used=16384) at dsample.f:172
#6 0x000000000043427a in driver () at driver.f:257
#7 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302
#8 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#9 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x0000000000403845 in _start ()
I will now check valgrind also on madeven_cpp
I have tried to run valgrind on madevent_cpp but this hangs...
valgrind --track-origins=yes --gen-suppressions=all --max-stackframe=3932984 --log-file=memcheckc2.log ./madevent_cpp < input_cudacpp_104
After 200 minutes (more than 3 hours) this was still running...
==3737758== Process terminating with default action of signal 2 (SIGINT)
==3737758== at 0x6E692E5: KernelAccessHelper<mg5amcCpu::MemoryAccessCouplingsBase, false>::kernelAccessRecord(double*) (MemoryAccessHelpers.h:116)
==3737758== by 0x6E59E11: double& KernelAccessHelper<mg5amcCpu::MemoryAccessCouplingsBase, false>::kernelAccessField<int>(double*, int) (MemoryAccessHelpers.h:139)
==3737758== by 0x6E6A02D: mg5amcCpu::KernelAccessCouplings<false>::kernelAccessIx2(double*, int) (MemoryAccessCouplings.h:185)
==3737758== by 0x6E69D99: mg5amcCpu::KernelAccessCouplings<false>::kernelAccessIx2Const(double const*, int) (MemoryAccessCouplings.h:205)
==3737758== by 0x6E694CB: mg5amcCpu::KernelAccessCouplings<false>::kernelAccessConst(double const*) (MemoryAccessCouplings.h:256)
==3737758== by 0x6E662B9: void mg5amcCpu::FFV1_0<mg5amcCpu::KernelAccessWavefunctions<false>, mg5amcCpu::KernelAccessAmplitudes<false>, mg5amcCpu::KernelAccessCouplings<false> >(double const*, double const*, double const*, double const*, double, double*) (HelAmps_sm.h:1104)
==3737758== by 0x6E485A4: mg5amcCpu::calculate_wavefunctions(int, double const*, double const*, double*, unsigned int, double*, double*, double*, int) (CPPProcess.cc:1174)
==3737758== by 0x6E5914F: mg5amcCpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*, int) [clone ._omp_fn.0] (CPPProcess.cc:3223)
==3737758== by 0x748F575: GOMP_parallel (in /usr/lib64/libgomp.so.1.0.0)
==3737758== by 0x6E58BE2: mg5amcCpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*, int) (CPPProcess.cc:3203)
==3737758== by 0x6E6B371: mg5amcCpu::MatrixElementKernelHost::computeMatrixElements(unsigned int) (MatrixElementKernels.cc:115)
==3737758== by 0x6E6DAFB: mg5amcCpu::Bridge<double>::cpu_sequence(double const*, double const*, double const*, double const*, unsigned int, double*, int*, int*, bool) (Bridge.h:390)
I will abandon this line of tests forthe moment
A couple of valgrind issues (running vaglrind over fortran) have been addressed. They are merged in PR #869.
The issue remains that running valgrind over the cudacpp madevent instead hangs. This should eventually be opened in a new issue.
I am closing this for simplicity as the most urgen issus have gone.
PS A couple of comments from extra tests
So I would say that I completed this valgrind investigation for now...
Since we are not yet converging on some issues like the rotxxx segfault #855 and the possible fix with volatile in #857, in parallel I am also running some checks with valgrind.
On the current upstream/master 286280fa4e4b32c12dd35fbb4bcf75d9930d4582, using the same code I used for the #855 reproducer described here, https://github.com/madgraph5/madgraph4gpu/pull/857#issuecomment-2187007215 , I try this valgrind, to start with on Fortran ONLY.
Note that 32 events are enough, the cpp still crashes.
I do this valgrind
And this tells me