Closed valassi closed 5 months ago
NB this happens both for AVX=none and for instance AVX=512y... so it is easier to debug it for none
Interesting. This seems to be a proper underflow FPE. The float range stops at E-38. Here there is a multiplication of E-19 by E-20 which would be below that...
[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/heft_gg_bb.sa/SubProcesses/P1_Sigma_heft_gg_bbx> CUDACPP_RUNTIME_ENABLEFPE=on gdb --args ./check.exe --common -p 2 64 2
GNU gdb (GDB) Red Hat Enterprise Linux 10.2-10.el9
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./check.exe...
(gdb) run
Starting program: /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/heft_gg_bb.sa/SubProcesses/P1_Sigma_heft_gg_bbx/check.exe --common -p 2 64 2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
WARNING! CUDACPP_RUNTIME_ENABLEFPE is set: enable Floating Point Exceptions
(ompnumthreadsNotSetMeansOneThread) DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
(ompnumthreadsNotSetMeansOneThread) omp_get_max_threads() = 1
INFO: The application does not require the host to support any AVX feature
INFO: The application does not require the host to support any AVX feature
Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7ed466d in mg5amcCpu::calculate_wavefunctions (ihel=0, allmomenta=0x459040, allcouplings=0x45b0c0, allMEs=0x4576c0, jamp2_sv=0x0, ievt00=2)
at CPPProcess.cc:434
434 fptype2_sv deltaMEs2 = ( jampRi_sv * ztempR_sv + jampIi_sv * ztempI_sv );
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0 0x00007ffff7ed466d in mg5amcCpu::calculate_wavefunctions (ihel=0, allmomenta=0x459040, allcouplings=0x45b0c0, allMEs=0x4576c0, jamp2_sv=0x0,
ievt00=2) at CPPProcess.cc:434
#1 0x00007ffff7ed4d7b in mg5amcCpu::sigmaKin_getGoodHel (allmomenta=0x459040, allcouplings=0x45b0c0, allMEs=0x4576c0, isGoodHel=0x458cc0, nevt=128)
at CPPProcess.cc:787
#2 0x00007ffff7ee0391 in mg5amcCpu::MatrixElementKernelHost::computeGoodHelicities (this=0x457e80) at MatrixElementKernels.cc:70
#3 0x0000000000408f48 in main (argc=6, argv=0x7fffffffdae8) at check_sa.cc:678
(gdb) p jampRi_sv
$1 = -1.9187456e-20
(gdb) p ztempR_sv
$2 = -1.15124736e-19
(gdb) p jampIi_sv
$3 = -5.37382571e-14
(gdb) p ztempI_sv
$4 = -3.22429529e-13
(gdb)
NB This is for .sa built in debug mode
make cleanall; make -j FPTYPE=f AVX=none debug
I tried to add some protections before computing deltaMEs (essentially, flushing to zero ztemp if it and/or jamp2 is to low). This is not enough, there are then also other FPEs before
Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7ed5a2b in mg5amcCpu::cxabs2 (c=...) at ../../src/mgOnGpuVectors.h:896
896 return cxreal( c ) * cxreal( c ) + cximag( c ) * cximag( c );
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0 0x00007ffff7ed5a2b in mg5amcCpu::cxabs2 (c=...) at ../../src/mgOnGpuVectors.h:896
#1 0x00007ffff7ed448d in mg5amcCpu::calculate_wavefunctions (ihel=0, allmomenta=0x459040, allcouplings=0x45b0c0, allMEs=0x4576c0,
jamp2_sv=0x7fffffffc464, ievt00=2) at CPPProcess.cc:336
#2 0x00007ffff7ed56af in mg5amcCpu::_ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfPiS3_i._omp_fn.0(void) () at CPPProcess.cc:1048
#3 0x00007ffff18f2576 in GOMP_parallel () from /lib64/libgomp.so.1
#4 0x00007ffff7ed53a6 in mg5amcCpu::sigmaKin (allmomenta=0x459040, allcouplings=0x45b0c0, allrndhel=0x457940, allrndcol=0x456580, allMEs=0x4576c0,
allselhel=0x456800, allselcol=0x456a80, nevt=128) at CPPProcess.cc:1026
#5 0x00007ffff7ee08c7 in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x457e80, channelId=0) at MatrixElementKernels.cc:85
#6 0x0000000000408ff6 in main (argc=6, argv=0x7fffffffdae8) at check_sa.cc:689
(gdb) up
#1 0x00007ffff7ed448d in mg5amcCpu::calculate_wavefunctions (ihel=0, allmomenta=0x459040, allcouplings=0x45b0c0, allMEs=0x4576c0,
jamp2_sv=0x7fffffffc464, ievt00=2) at CPPProcess.cc:336
336 jamp2_sv[ncolor * iParity + icolC] += cxabs2( jamp_sv[icolC] );
(gdb) l
331
332 // *** COLOR CHOICE BELOW ***
333 // Store the leading color flows for choice of color
334 if( jamp2_sv ) // disable color choice if nullptr
335 for( int icolC = 0; icolC < ncolor; icolC++ )
336 jamp2_sv[ncolor * iParity + icolC] += cxabs2( jamp_sv[icolC] );
337
338 // *** COLOR MATRIX BELOW ***
339 // (This method used to be called CPPProcess::matrix_1_gg_bbx()?)
340
(gdb) p jamp_sv
$1 = {{m_real = -2.53915787e-05, m_imag = 7.25820684e-08}, {m_real = 3.40938568e-05, m_imag = 3.62874317e-08}, {m_real = -1.9187456e-20,
m_imag = -5.37382571e-14}}
I give an update here and will probably resume next week.
This is a very complex issue with numerical precision, testing the limits of our "mixed" precision approach.
This has appeared in one particular HEFT gg_bb with MIW<=1, where one diagram with suppressed contributions is allowed, but potentially could have appeared elsewhere.
The problem is that when we start manipulating numbers in the E-19 to E-16 range things become very difficult. FLT_MIN is 1.2E-38, so the sqrt of that is around E-19. As soon as you multiply two numbers around E-19, you risk getting an underflow because a float does not manage to handle below E-38. So in particular amplitudes (jamp_sv etc) in that E-19 range are complicated, because we must square them.
I would say that there are two possible approaches:
--
Some details on strategy 1.
After various attempts, I made a patch where a lambda "underflowFTZ" flushes to zero all jamps below E-19. See https://github.com/valassi/madgraph4gpu/commit/a34076b566fe684748c007ef47c2fd3014547bee
For FPTYPE=f, this seems to work, I get no FPEs in any tests (luck?). The FTZ is done immediately after computing Feynman diagrams (which were themselves computed in float).
For FPTYPE=m, I still get a few nasty issues. The FTZ is done a bit later (because jamps are computed in double and must be used in double for some cases), but essentially the color matrix uses jamps which are float, and have also been flushed to zero if below E-19.
The weird corner cases are that:
This is similar to the problems I had seen for all the other FPEs from invalid and divide by zero.
I have two questions here
On one, where I am now is that I am doing some debugging to try and understand why none and avx2 differ, for instance. This is WIP, but it would seem that some jamps (computed in double precision!) differ at the level of E-16?
In my check.exe -p 2 64 2 --common
test, for ievt00=75 ihel=3 I get amp_sv[icol=0]
I imagine that this later explains how come a ztemp E-23 appears only in avx2 and not in none. There must be some very string cancellations here and there (both within the calculations of amp_sv from Feynman diagrams, and later in the calculation of ztemp from jamps?).
Anyway: what I will try next week is:
This might seem an overkill, but I do not think it is. If a jamp E-16 gives ztemp E-23 (i.e. E-7 lower due to cancellations), there are some things we need to better control even if we set a tolerance much higher, with jamp E-12 or E-10. This is especially important if we want to use mixed precision as default.
--
About strategy 2.
The alternative could indeed be to rely on the compiler/processor to do the FTZ, and simply switch off underflow FPEs. We lose some understanding/control, but it may be much easier, if it is possible.
Various notes:
A few interesting links
Hi,
I have checked the situation on a pure fortran side to check the physics reason: so for g g > b b~ in the standard model, the full matrix-element square is
Matrix element = 0.53277385984554848 GeV^ 0
in the HEFT it is
Matrix element = 0.53277205264991745
So the final difference is quite small... (pure higgs contribution is: Matrix element = 7.7547177645227751E-008 GeV^ 0 and interference between higgs and QCD is -1.8847428088393383E-006)
If you look at the jamp level you will see that the heft case has three jamp, while the SM has two. So this "explains" why some jamp are so small (and for some helicity numerically zero).
So yes some way of filtering/ignoring those make sense. So for me strategy 1 or 2 are both valid.
Thanks Olivier. I did a few more tests.
About strategy 2, I wanted to check if it makes sense to enable underflow or not. I went back to try and reproduce the original issues I found in #701. THis is what I did
git reset --hard d1d87b649c661e6ad302e26767baaf7ce70b8058
git cherry-pick da7df0f90c788774646a71962c95a8b3fe544008 10b099ef337289c8b46bacb49d82b5bdebad2760
./CODEGEN/generateAndCompare.sh nobm_pp_ttW
git cherry-pick d4ddce6c26cdf77d1f165deb1985d4ea0963909a d870e34d3ab63e3b5b2ce1a02d065bcf2c88e805
HRDCOD=1 tlau/lauX.sh -CPP nobm_pp_ttW.mad/
This is enough to reproduce the printouts
Using random number seed offset = 21
INFO: Running Survey
Creating Jobs
Working on SubProcesses
INFO: P1_gu_ttxwpd
INFO: Building madevent in madevent_interface.py with 'CPP' matrix elements
INFO: P1_gd_ttxwmu
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
...
Then I even reproduced it here on a subset,
> nobm_gu_ttxwpd)
> cmd="import model sm-no_b_mass
> generate g u > t t~ w+ d"
> ;;
I instrumented the latter code with what I am now using more generally
inline void fpeEnable()
{
static bool first = true; // FIXME: quick and dirty hack to do this only once (can be removed when separate C++/CUDA builds are implemented)
if( ! first ) return;
first = false;
#ifndef __APPLE__ // on MacOS feenableexcept is not defined #730
int fpes = fegetexcept();
std::cerr << "fpeEnable: analyse fegetexcept()=" << fpes << std::endl;
std::cerr << "fpeEnable: FE_DIVBYZERO is" << ( ( fpes & FE_DIVBYZERO ) ? " " : " NOT " ) << "enabled" << std::endl;
std::cerr << "fpeEnable: FE_INEXACT is" << ( ( fpes & FE_INEXACT ) ? " " : " NOT " ) << "enabled" << std::endl;
std::cerr << "fpeEnable: FE_INVALID is" << ( ( fpes & FE_INVALID ) ? " " : " NOT " ) << "enabled" << std::endl;
std::cerr << "fpeEnable: FE_OVERFLOW is" << ( ( fpes & FE_OVERFLOW ) ? " " : " NOT " ) << "enabled" << std::endl;
std::cerr << "fpeEnable: FE_UNDERFLOW is" << ( ( fpes & FE_UNDERFLOW ) ? " " : " NOT " ) << "enabled" << std::endl;
const char* enableFPEc = getenv( "CUDACPP_RUNTIME_ENABLEFPE" );
const bool enableFPE = ( enableFPEc != 0 ) && ( std::string( enableFPEc ) != "" );
if( enableFPE )
{
std::cerr << "fpeEnable: CUDACPP_RUNTIME_ENABLEFPE is set, enable additional FPEs if not already done" << std::endl;
feenableexcept( FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW ); // debug #701
fpes = fegetexcept();
std::cerr << "fpeEnable: analyse fegetexcept()=" << fpes << std::endl;
std::cerr << "fpeEnable: FE_DIVBYZERO is" << ( ( fpes & FE_DIVBYZERO ) ? " " : " NOT " ) << "enabled" << std::endl;
std::cerr << "fpeEnable: FE_INEXACT is" << ( ( fpes & FE_INEXACT ) ? " " : " NOT " ) << "enabled" << std::endl;
std::cerr << "fpeEnable: FE_INVALID is" << ( ( fpes & FE_INVALID ) ? " " : " NOT " ) << "enabled" << std::endl;
std::cerr << "fpeEnable: FE_OVERFLOW is" << ( ( fpes & FE_OVERFLOW ) ? " " : " NOT " ) << "enabled" << std::endl;
std::cerr << "fpeEnable: FE_UNDERFLOW is" << ( ( fpes & FE_UNDERFLOW ) ? " " : " NOT " ) << "enabled" << std::endl;
}
else
{
std::cerr << "fpeEnable: CUDACPP_RUNTIME_ENABLEFPE is NOT set, do not enable additional FPEs" << std::endl;
}
#else
std::cerr << "fpeEnable: fegetexcept and feenableexcept are not available on MacOS, keep default FPE settings" << std::endl;
#endif
}
Now, what happens is that these traps are actually all disabled...
HRDCOD=1 tlau/lauX.sh -CPP nobm_gu_ttxwpd.mad
...
Using random number seed offset = 21
INFO: Running Survey
Creating Jobs
Working on SubProcesses
INFO: P1_gu_ttxwpd
INFO: Building madevent in madevent_interface.py with 'CPP' matrix elements
INFO: Idle: 2, Running: 3, Completed: 0 [ current time: 22h03 ]
...
fpeEnable: analyse fegetexcept()=0
fpeEnable: FE_DIVBYZERO is NOT enabled
fpeEnable: FE_INEXACT is NOT enabled
fpeEnable: FE_INVALID is NOT enabled
fpeEnable: FE_OVERFLOW is NOT enabled
fpeEnable: FE_UNDERFLOW is NOT enabled
fpeEnable: CUDACPP_RUNTIME_ENABLEFPE is NOT set, do not enable additional FPEs
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
...
I would say that:
About strategy 1, some tests seem to show that ztemp E-23 comes from adding/subtracting jamps E-16
calculate_wavefunctions... icol=1
calculate_wavefunctions... icol=2
calculate_wavefunctions: ievt00=72
calculate_wavefunctions: ihel=3
DIAGRAM2 ievt00=72 ihel=3
amp_sv: { (6.75968e-17, 1.27204), (2.0279e-16, 0.978039), (-7.43565e-16, 1.14048), (-5.40774e-16, 1.44726) }
DIAGRAM3 ievt00=72 ihel=3
amp_sv: { (-1.27202, 1.35194e-16), (-0.978028, 4.73178e-16), (-1.14047, -1.48713e-15), (-1.44724, -6.75968e-16) }
AFTER_FEYNMAN ievt00=72 ihel=3 icol=0
jamp_sv: { (-1.64981e-05, -6.75968e-17), (-1.09635e-05, -2.70387e-16), (-1.36621e-05, 7.43565e-16), (-2.33358e-05, 1.35>
DIAGRAM2 ievt00=72 ihel=3
amp_sv: { (-1.68992e-16, 1.19207), (-2.0279e-16, 0.922304), (-1.01395e-16, 1.40348), (3.37984e-17, 1.16874) }
DIAGRAM3 ievt00=72 ihel=3
amp_sv: { (-1.19205, -3.37984e-16), (-0.92222, -4.05581e-16), (-1.40346, -2.0279e-16), (-1.16868, 6.75968e-17) }
AFTER_FEYNMAN ievt00=72 ihel=3 icol=0
jamp_sv: { (-1.46773e-05, 1.68992e-16), (-8.34606e-05, 2.0279e-16), (-2.08334e-05, 1.01395e-16), (-5.96745e-05, -3.3798>
calculate_wavefunctions... icol=0
OFF-DIAGONAL ievt00=72 ihel=3 icol=0 jcol=1
ztempR_sv: { -8.79898e-05, -5.84719e-05, -7.28645e-05, -0.000124457, -7.82787e-05, -0.000445123, -0.000111112, -0.00031>
ztempI_sv: { -3.60516e-16, -1.44207e-15, 3.96568e-15, 7.21033e-16, 9.01291e-16, 1.08155e-15, 5.40774e-16, -1.80258e-16 }
OFF-DIAGONAL ievt00=72 ihel=3 icol=0 jcol=2
ztempR_sv: { -0.000156501, -0.000161565, -0.000155596, -0.000172894, -0.000155289, -0.000458665, -0.000165366, -0.00033>
ztempI_sv: { -2.70387e-16, -1.17168e-15, 2.97426e-15, -2.14884e-23, 6.75968e-16, 8.11162e-16, 4.05581e-16, -1.35194e-16>
BEFORE_DELTAMES2 ievt00=72 ihel=3 icol=0
jampRi_sv: { -1.64981e-05, -1.09635e-05, -1.36621e-05, -2.33358e-05, -1.46773e-05, -8.34606e-05, -2.08334e-05, -5.96745>
ztempR_sv: { -0.000156501, -0.000161565, -0.000155596, -0.000172894, -0.000155289, -0.000458665, -0.000165366, -0.00033>
jampIi_sv: { -6.75968e-17, -2.70387e-16, 7.43565e-16, 1.35194e-16, 1.68992e-16, 2.0279e-16, 1.01395e-16, -3.37984e-17 }
ztempI_sv: { -2.70387e-16, -1.17168e-15, 2.97426e-15, -2.14884e-23, 6.75968e-16, 8.11162e-16, 4.05581e-16, -1.35194e-16>
Floating point exception (core dumped)
Specifically, one ztemp was 7.21033e-16 and then becomes -2.14884e-23 after a sum.
I would say that I also need to flush-to-zero ztemp. In principle this should be enough? (The color matrix itself should have values which are O(1) or similar I hope).
Ok I believe this is finally fixed in MR #832.
I implemented my "strategy 1" above: I also flush-to-zero manually the ztemp if it is below sqrt(flt_min). A delicate point is that in mised mode the number of SIMD elements is neppV2=2*neppV (neppV is the number of double elements for Feynman, neppV2 is the number of float elements for color algebra).
I did a few checks on strategy 2 as discussed above, but I think it's better to avoid the underflows in the code rather than handling them or ignoring them.
Closing this.
Reopening this issue after reviewing MR #832 with @oliviermattelaer (thanks Olivier)
A couple of things to do in particular
About strategy 2, finally I found the relevant info https://stackoverflow.com/questions/44308577/ieee-underflow-flag-ieee-denormal-in-fortran-77 https://gcc.gnu.org/onlinedocs/gfortran/Debugging-Options.html I am doing some small scale tests.
I would say that definitely it seems a good idea to
-ffpe-summary=<list>
where the list does not include denormal)
- well maybe I got it wrong all the time: Fortran does not enable FPE traps, it simply prints out a warning/debug message at the end (on STOP? https://stackoverflow.com/a/39368925)... so these were never errors in Fortran, just warnings?
Yes definitely this is the case, as well explained in https://stackoverflow.com/questions/44308577/ieee-underflow-flag-ieee-denormal-in-fortran-77
I have now tried to add -ffpe-trap=invalid,zero,overflow
to the fortran compilation, and the exceptions do send SIGFPE. As a strategy, I will not add these flags to fortran, but rather cal feenableexcept in CPPProcess.cc (so that I get the same behavioir also in c++/cuda only tests)
I have now implemented a new strategy for this #831 in PR #832. See https://github.com/madgraph5/madgraph4gpu/pull/832#issuecomment-2053717726
Essentially:
I think this is now really resolved and can be closed. Will be fixed in #832
By the way, as mentioned in #832: FLT_MIN/10 is a denormal that is still usable, even if not exact. It fires underflow, but it is better than a zero.
After fixing #828, I regenerated heft_gg_bb.
Unfortunately, the tput tests for float fail with a FPE. This boils down to this reproducer: