madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

Test our code through valgrind #207

Open valassi opened 3 years ago

valassi commented 3 years ago

This is just a placeholder about valgrind.

Presently this is just to report one issue - in case we find this again: valgrind does not work with AVX512.

On a working master (valgrind fails, but the app without valgrind succeds)

~/GPU2020/madgraph4gpuTer/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum> valgrind ./build.512y/check.exe -p 32 32 1
==27277== Memcheck, a memory error detector
==27277== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==27277== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==27277== Command: ./build.512y/check.exe -p 32 32 1
==27277== 
ERROR! The application is built for skylake-avx512 (AVX512VL) but the host does not support it
==27277== 

On a WIP branch where I am debugging a segfault

valgrind ./build.512y/check.exe -p 32 32 1
==28948== Memcheck, a memory error detector
==28948== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28948== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==28948== Command: ./build.512y/check.exe -p 32 32 1
==28948== 
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFE 0x28 0x7F 0x87 0x18 0x0 0x0 0x0
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==28948== valgrind: Unrecognised instruction at address 0x4121a4.
==28948==    at 0x4121A4: Proc::CPPProcess::CPPProcess(int, int, int, bool, bool) (CPPProcess.cc:221)
==28948==    by 0x404C35: main (check.cc:287)
==28948== Your program just tried to execute an instruction that Valgrind
==28948== did not recognise.  There are two possible reasons for this.
==28948== 1. Your program has a bug and erroneously jumped to a non-code
==28948==    location.  If you are running Memcheck and you just saw a
==28948==    warning about a bad jump, it's probably your program's fault.
==28948== 2. The instruction is legitimate but Valgrind doesn't handle it,
==28948==    i.e. it's Valgrind's fault.  If you think this is the case or
==28948==    you are not sure, please let us know and we'll try to fix it.
==28948== Either way, Valgrind will now raise a SIGILL signal which will
==28948== probably kill your program.
==28948== 
==28948== Process terminating with default action of signal 4 (SIGILL)
==28948==  Illegal opcode at address 0x4121A4
==28948==    at 0x4121A4: Proc::CPPProcess::CPPProcess(int, int, int, bool, bool) (CPPProcess.cc:221)
==28948==    by 0x404C35: main (check.cc:287)
==28948== 

This is well known

hageboeck commented 3 years ago

I recommend address sanitizer. It's much faster, since it's not a VM, and finds more bugs, because it can also survey stack access. The only downside is that you have to recompile before being able to use it.

See e.g.

valassi commented 3 months ago

I have used valgrind to analyses some specific issues in #868