Underflow Floating Point Exception for float and mixed FPTYPE in HEFT gg to bb

valassi commented 6 months ago

After fixing #828, I regenerated heft_gg_bb.

Unfortunately, the tput tests for float fail with a FPE. This boils down to this reproducer:

make -j FPTYPE=f
CUDACPP_RUNTIME_ENABLEFPE=on ./check.exe --common -p 2 64 2
...
WARNING! CUDACPP_RUNTIME_ENABLEFPE is set: enable Floating Point Exceptions
(ompnumthreadsNotSetMeansOneThread) DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
(ompnumthreadsNotSetMeansOneThread) omp_get_max_threads() = 1
INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it
INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it
Floating Point Exception (CPU)

valassi commented 6 months ago

NB this happens both for AVX=none and for instance AVX=512y... so it is easier to debug it for none

valassi commented 6 months ago

Interesting. This seems to be a proper underflow FPE. The float range stops at E-38. Here there is a multiplication of E-19 by E-20 which would be below that...

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/heft_gg_bb.sa/SubProcesses/P1_Sigma_heft_gg_bbx> CUDACPP_RUNTIME_ENABLEFPE=on gdb --args ./check.exe --common -p 2 64 2
GNU gdb (GDB) Red Hat Enterprise Linux 10.2-10.el9
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./check.exe...
(gdb) run
Starting program: /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/heft_gg_bb.sa/SubProcesses/P1_Sigma_heft_gg_bbx/check.exe --common -p 2 64 2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
WARNING! CUDACPP_RUNTIME_ENABLEFPE is set: enable Floating Point Exceptions
(ompnumthreadsNotSetMeansOneThread) DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
(ompnumthreadsNotSetMeansOneThread) omp_get_max_threads() = 1
INFO: The application does not require the host to support any AVX feature
INFO: The application does not require the host to support any AVX feature

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7ed466d in mg5amcCpu::calculate_wavefunctions (ihel=0, allmomenta=0x459040, allcouplings=0x45b0c0, allMEs=0x4576c0, jamp2_sv=0x0, ievt00=2)
    at CPPProcess.cc:434
434             fptype2_sv deltaMEs2 = ( jampRi_sv * ztempR_sv + jampIi_sv * ztempI_sv );
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0  0x00007ffff7ed466d in mg5amcCpu::calculate_wavefunctions (ihel=0, allmomenta=0x459040, allcouplings=0x45b0c0, allMEs=0x4576c0, jamp2_sv=0x0, 
    ievt00=2) at CPPProcess.cc:434
#1  0x00007ffff7ed4d7b in mg5amcCpu::sigmaKin_getGoodHel (allmomenta=0x459040, allcouplings=0x45b0c0, allMEs=0x4576c0, isGoodHel=0x458cc0, nevt=128)
    at CPPProcess.cc:787
#2  0x00007ffff7ee0391 in mg5amcCpu::MatrixElementKernelHost::computeGoodHelicities (this=0x457e80) at MatrixElementKernels.cc:70
#3  0x0000000000408f48 in main (argc=6, argv=0x7fffffffdae8) at check_sa.cc:678
(gdb) p jampRi_sv
$1 = -1.9187456e-20
(gdb) p ztempR_sv 
$2 = -1.15124736e-19
(gdb) p jampIi_sv
$3 = -5.37382571e-14
(gdb) p ztempI_sv
$4 = -3.22429529e-13
(gdb)

NB This is for .sa built in debug mode

make cleanall; make -j FPTYPE=f AVX=none debug

valassi commented 6 months ago

I tried to add some protections before computing deltaMEs (essentially, flushing to zero ztemp if it and/or jamp2 is to low). This is not enough, there are then also other FPEs before

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7ed5a2b in mg5amcCpu::cxabs2 (c=...) at ../../src/mgOnGpuVectors.h:896
896         return cxreal( c ) * cxreal( c ) + cximag( c ) * cximag( c );
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0  0x00007ffff7ed5a2b in mg5amcCpu::cxabs2 (c=...) at ../../src/mgOnGpuVectors.h:896
#1  0x00007ffff7ed448d in mg5amcCpu::calculate_wavefunctions (ihel=0, allmomenta=0x459040, allcouplings=0x45b0c0, allMEs=0x4576c0, 
    jamp2_sv=0x7fffffffc464, ievt00=2) at CPPProcess.cc:336
#2  0x00007ffff7ed56af in mg5amcCpu::_ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfPiS3_i._omp_fn.0(void) () at CPPProcess.cc:1048
#3  0x00007ffff18f2576 in GOMP_parallel () from /lib64/libgomp.so.1
#4  0x00007ffff7ed53a6 in mg5amcCpu::sigmaKin (allmomenta=0x459040, allcouplings=0x45b0c0, allrndhel=0x457940, allrndcol=0x456580, allMEs=0x4576c0, 
    allselhel=0x456800, allselcol=0x456a80, nevt=128) at CPPProcess.cc:1026
#5  0x00007ffff7ee08c7 in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x457e80, channelId=0) at MatrixElementKernels.cc:85
#6  0x0000000000408ff6 in main (argc=6, argv=0x7fffffffdae8) at check_sa.cc:689
(gdb) up
#1  0x00007ffff7ed448d in mg5amcCpu::calculate_wavefunctions (ihel=0, allmomenta=0x459040, allcouplings=0x45b0c0, allMEs=0x4576c0, 
    jamp2_sv=0x7fffffffc464, ievt00=2) at CPPProcess.cc:336
336               jamp2_sv[ncolor * iParity + icolC] += cxabs2( jamp_sv[icolC] );
(gdb) l
331
332           // *** COLOR CHOICE BELOW ***
333           // Store the leading color flows for choice of color
334           if( jamp2_sv ) // disable color choice if nullptr
335             for( int icolC = 0; icolC < ncolor; icolC++ )
336               jamp2_sv[ncolor * iParity + icolC] += cxabs2( jamp_sv[icolC] );
337
338           // *** COLOR MATRIX BELOW ***
339           // (This method used to be called CPPProcess::matrix_1_gg_bbx()?)
340
(gdb) p jamp_sv
$1 = {{m_real = -2.53915787e-05, m_imag = 7.25820684e-08}, {m_real = 3.40938568e-05, m_imag = 3.62874317e-08}, {m_real = -1.9187456e-20, 
    m_imag = -5.37382571e-14}}

valassi commented 6 months ago

I give an update here and will probably resume next week.

This is a very complex issue with numerical precision, testing the limits of our "mixed" precision approach.

This has appeared in one particular HEFT gg_bb with MIW<=1, where one diagram with suppressed contributions is allowed, but potentially could have appeared elsewhere.

The problem is that when we start manipulating numbers in the E-19 to E-16 range things become very difficult. FLT_MIN is 1.2E-38, so the sqrt of that is around E-19. As soon as you multiply two numbers around E-19, you risk getting an underflow because a float does not manage to handle below E-38. So in particular amplitudes (jamp_sv etc) in that E-19 range are complicated, because we must square them.

I would say that there are two possible approaches:

Manually flush-to-zero amplitudes below some threshold, and check that this is enough to make underflow FPEs go away. Yesterday I worked on this approach. It works most of the time, but some corner cases still give FPEs which are very hard to get rid of. More details below.
Another approach could be to try and disable underflow FPEs. Essentially: rather than manually flush-to-zero values below some threshold, rely on the compiler/processor infrastructure for FTZ/DAZ ("flush to zero, denormals are zero") which treats floats below E-28 as zero without throwing SIGFPE. This I have not tried (because I am under the assumption that linking fortran and c++ will automatically switch on all FPEs, as we have seen for invalid/overflow/dividebyzero FPEs), but remains an option.

--

Some details on strategy 1.

After various attempts, I made a patch where a lambda "underflowFTZ" flushes to zero all jamps below E-19. See https://github.com/valassi/madgraph4gpu/commit/a34076b566fe684748c007ef47c2fd3014547bee

For FPTYPE=f, this seems to work, I get no FPEs in any tests (luck?). The FTZ is done immediately after computing Feynman diagrams (which were themselves computed in float).

For FPTYPE=m, I still get a few nasty issues. The FTZ is done a bit later (because jamps are computed in double and must be used in double for some cases), but essentially the color matrix uses jamps which are float, and have also been flushed to zero if below E-19.

The weird corner cases are that:

I still get FPEs only for avx2 and avx512, in no debug mode
That is: in debug mode, I get no FPEs
That is, in no simd and sse4, I get no FPEs

This is similar to the problems I had seen for all the other FPEs from invalid and divide by zero.

I have two questions here

One, why avx2 differs from none (I see an E-23 ztemp only in avx2)
Two, why avx2 in any case gives an E-23 ztemp (which I naively thought would disappear from my FTZ of jamps at E-19)

On one, where I am now is that I am doing some debugging to try and understand why none and avx2 differ, for instance. This is WIP, but it would seem that some jamps (computed in double precision!) differ at the level of E-16?

In my check.exe -p 2 64 2 --common test, for ievt00=75 ihel=3 I get amp_sv[icol=0]

none: DIAGRAM2 (4.05581e-16,1.44726), DIAGRAM3 (-1.44724,5.40774e-16)
avx2: DIAGRAM2 (-5.40774e-16,1.44726) DIAGRAM3 (-1.44724,-6.75968e-16) }

I imagine that this later explains how come a ztemp E-23 appears only in avx2 and not in none. There must be some very string cancellations here and there (both within the calculations of amp_sv from Feynman diagrams, and later in the calculation of ztemp from jamps?).

Anyway: what I will try next week is:

For problem one (avx2 different from none), increase my tolerance, and manually FTZ at E-15 rather than E-19. And hope that this is ok down the line and I get no more ztemp E-23.
For problem two (why do I get ztemp E-23 from jamps E-16), better debug that it is really a ztemp that goes down to E-23. And if this is confirmed, maybe add a manual FTZ for ztemp (an dthen E-19 may be enough).

This might seem an overkill, but I do not think it is. If a jamp E-16 gives ztemp E-23 (i.e. E-7 lower due to cancellations), there are some things we need to better control even if we set a tolerance much higher, with jamp E-12 or E-10. This is especially important if we want to use mixed precision as default.

--

About strategy 2.

The alternative could indeed be to rely on the compiler/processor to do the FTZ, and simply switch off underflow FPEs. We lose some understanding/control, but it may be much easier, if it is possible.

Various notes:

in principle, we can keep overflow, invalid, division by zero fpes, and only selectively disable underflows (note that the intel compiler has some options for that?
It would be useful to do some tests with a simple fortran plus c++ test executable and check which FPEs actually are switched on and off by default
Note that feenableexcept/fedisableexcept is not supported on APPLE
Various compiler options are compiler dependent (and arm and x86 are probably different too)
There are some peculiar flags that can be passsed to SIMD intrinsics specifically for FPEs, so it is not impossible that this explains some of the differences I see between SIMD modes (or why debug vs non debug makes a difference)
Could instruments the FPEhandler to also use

A few interesting links

oliviermattelaer commented 6 months ago

Hi,

I have checked the situation on a pure fortran side to check the physics reason: so for g g > b b~ in the standard model, the full matrix-element square is

 Matrix element =   0.53277385984554848       GeV^           0

in the HEFT it is

 Matrix element =   0.53277205264991745

So the final difference is quite small... (pure higgs contribution is: Matrix element = 7.7547177645227751E-008 GeV^ 0 and interference between higgs and QCD is -1.8847428088393383E-006)

If you look at the jamp level you will see that the heft case has three jamp, while the SM has two. So this "explains" why some jamp are so small (and for some helicity numerically zero).

So yes some way of filtering/ignoring those make sense. So for me strategy 1 or 2 are both valid.

valassi commented 6 months ago

Thanks Olivier. I did a few more tests.

About strategy 2, I wanted to check if it makes sense to enable underflow or not. I went back to try and reproduce the original issues I found in #701. THis is what I did

git reset --hard d1d87b649c661e6ad302e26767baaf7ce70b8058
git cherry-pick da7df0f90c788774646a71962c95a8b3fe544008 10b099ef337289c8b46bacb49d82b5bdebad2760
./CODEGEN/generateAndCompare.sh nobm_pp_ttW
git cherry-pick d4ddce6c26cdf77d1f165deb1985d4ea0963909a d870e34d3ab63e3b5b2ce1a02d065bcf2c88e805
HRDCOD=1 tlau/lauX.sh -CPP nobm_pp_ttW.mad/

This is enough to reproduce the printouts

Using random number seed offset = 21
INFO: Running Survey 
Creating Jobs
Working on SubProcesses
INFO:     P1_gu_ttxwpd  
INFO: Building madevent in madevent_interface.py with 'CPP' matrix elements 
INFO:     P1_gd_ttxwmu  
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
...

Then I even reproduced it here on a subset,

>     nobm_gu_ttxwpd)
>       cmd="import model sm-no_b_mass
>       generate g u > t t~ w+ d"
>       ;;

I instrumented the latter code with what I am now using more generally

inline void fpeEnable()
{
  static bool first = true; // FIXME: quick and dirty hack to do this only once (can be removed when separate C++/CUDA builds are implemented)
  if( ! first ) return;
  first = false;
#ifndef __APPLE__ // on MacOS feenableexcept is not defined #730
  int fpes = fegetexcept();
  std::cerr << "fpeEnable: analyse fegetexcept()=" << fpes << std::endl;
  std::cerr << "fpeEnable:     FE_DIVBYZERO is" << ( ( fpes & FE_DIVBYZERO ) ? " " : " NOT " ) << "enabled" << std::endl;
  std::cerr << "fpeEnable:     FE_INEXACT is" << ( ( fpes & FE_INEXACT ) ? " " : " NOT " ) << "enabled" << std::endl;
  std::cerr << "fpeEnable:     FE_INVALID is" << ( ( fpes & FE_INVALID ) ? " " : " NOT " ) << "enabled" << std::endl;
  std::cerr << "fpeEnable:     FE_OVERFLOW is" << ( ( fpes & FE_OVERFLOW ) ? " " : " NOT " ) << "enabled" << std::endl;
  std::cerr << "fpeEnable:     FE_UNDERFLOW is" << ( ( fpes & FE_UNDERFLOW ) ? " " : " NOT " ) << "enabled" << std::endl;
  const char* enableFPEc = getenv( "CUDACPP_RUNTIME_ENABLEFPE" );
  const bool enableFPE = ( enableFPEc != 0 ) && ( std::string( enableFPEc ) != "" );
  if( enableFPE )
  {
    std::cerr << "fpeEnable: CUDACPP_RUNTIME_ENABLEFPE is set, enable additional FPEs if not already done" << std::endl;
    feenableexcept( FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW ); // debug #701
    fpes = fegetexcept();
    std::cerr << "fpeEnable: analyse fegetexcept()=" << fpes << std::endl;
    std::cerr << "fpeEnable:     FE_DIVBYZERO is" << ( ( fpes & FE_DIVBYZERO ) ? " " : " NOT " ) << "enabled" << std::endl;
    std::cerr << "fpeEnable:     FE_INEXACT is" << ( ( fpes & FE_INEXACT ) ? " " : " NOT " ) << "enabled" << std::endl;
    std::cerr << "fpeEnable:     FE_INVALID is" << ( ( fpes & FE_INVALID ) ? " " : " NOT " ) << "enabled" << std::endl;
    std::cerr << "fpeEnable:     FE_OVERFLOW is" << ( ( fpes & FE_OVERFLOW ) ? " " : " NOT " ) << "enabled" << std::endl;
    std::cerr << "fpeEnable:     FE_UNDERFLOW is" << ( ( fpes & FE_UNDERFLOW ) ? " " : " NOT " ) << "enabled" << std::endl;
  }
  else
  {
    std::cerr << "fpeEnable: CUDACPP_RUNTIME_ENABLEFPE is NOT set, do not enable additional FPEs" << std::endl;
  }
#else
  std::cerr << "fpeEnable: fegetexcept and feenableexcept are not available on MacOS, keep default FPE settings" << std::endl;
#endif
}

Now, what happens is that these traps are actually all disabled...

HRDCOD=1 tlau/lauX.sh -CPP nobm_gu_ttxwpd.mad
...
Using random number seed offset = 21
INFO: Running Survey 
Creating Jobs
Working on SubProcesses
INFO:     P1_gu_ttxwpd  
INFO: Building madevent in madevent_interface.py with 'CPP' matrix elements 
INFO:  Idle: 2,  Running: 3,  Completed: 0 [ current time: 22h03 ] 
...
fpeEnable: analyse fegetexcept()=0
fpeEnable:     FE_DIVBYZERO is NOT enabled
fpeEnable:     FE_INEXACT is NOT enabled
fpeEnable:     FE_INVALID is NOT enabled
fpeEnable:     FE_OVERFLOW is NOT enabled
fpeEnable:     FE_UNDERFLOW is NOT enabled
fpeEnable: CUDACPP_RUNTIME_ENABLEFPE is NOT set, do not enable additional FPEs
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
...

I would say that:

well maybe I got it wrong all the time: Fortran does not enable FPE traps, it simply prints out a warning/debug message at the end (on STOP? https://stackoverflow.com/a/39368925)... so these were never errors in Fortran, just warnings?
in any case, it is good to avoid them, and test them in c++ by enabling errors from SIGFPE signals
the Fortran actually was signalling also underflow, as per the output above, so it does make sense to avoid underflows too?
in any case this exercise is useful to understand the limits of numerical precision in FPTYPE=f and m

valassi commented 6 months ago

About strategy 1, some tests seem to show that ztemp E-23 comes from adding/subtracting jamps E-16

calculate_wavefunctions... icol=1
calculate_wavefunctions... icol=2
calculate_wavefunctions: ievt00=72
calculate_wavefunctions: ihel=3
DIAGRAM2 ievt00=72 ihel=3
amp_sv: { (6.75968e-17, 1.27204), (2.0279e-16, 0.978039), (-7.43565e-16, 1.14048), (-5.40774e-16, 1.44726) }
DIAGRAM3 ievt00=72 ihel=3
amp_sv: { (-1.27202, 1.35194e-16), (-0.978028, 4.73178e-16), (-1.14047, -1.48713e-15), (-1.44724, -6.75968e-16) }
AFTER_FEYNMAN ievt00=72 ihel=3 icol=0
jamp_sv: { (-1.64981e-05, -6.75968e-17), (-1.09635e-05, -2.70387e-16), (-1.36621e-05, 7.43565e-16), (-2.33358e-05, 1.35>
DIAGRAM2 ievt00=72 ihel=3
amp_sv: { (-1.68992e-16, 1.19207), (-2.0279e-16, 0.922304), (-1.01395e-16, 1.40348), (3.37984e-17, 1.16874) }
DIAGRAM3 ievt00=72 ihel=3
amp_sv: { (-1.19205, -3.37984e-16), (-0.92222, -4.05581e-16), (-1.40346, -2.0279e-16), (-1.16868, 6.75968e-17) }
AFTER_FEYNMAN ievt00=72 ihel=3 icol=0
jamp_sv: { (-1.46773e-05, 1.68992e-16), (-8.34606e-05, 2.0279e-16), (-2.08334e-05, 1.01395e-16), (-5.96745e-05, -3.3798>
calculate_wavefunctions... icol=0
OFF-DIAGONAL ievt00=72 ihel=3 icol=0 jcol=1
ztempR_sv: { -8.79898e-05, -5.84719e-05, -7.28645e-05, -0.000124457, -7.82787e-05, -0.000445123, -0.000111112, -0.00031>
ztempI_sv: { -3.60516e-16, -1.44207e-15, 3.96568e-15, 7.21033e-16, 9.01291e-16, 1.08155e-15, 5.40774e-16, -1.80258e-16 }
OFF-DIAGONAL ievt00=72 ihel=3 icol=0 jcol=2
ztempR_sv: { -0.000156501, -0.000161565, -0.000155596, -0.000172894, -0.000155289, -0.000458665, -0.000165366, -0.00033>
ztempI_sv: { -2.70387e-16, -1.17168e-15, 2.97426e-15, -2.14884e-23, 6.75968e-16, 8.11162e-16, 4.05581e-16, -1.35194e-16>
BEFORE_DELTAMES2 ievt00=72 ihel=3 icol=0
jampRi_sv: { -1.64981e-05, -1.09635e-05, -1.36621e-05, -2.33358e-05, -1.46773e-05, -8.34606e-05, -2.08334e-05, -5.96745>
ztempR_sv: { -0.000156501, -0.000161565, -0.000155596, -0.000172894, -0.000155289, -0.000458665, -0.000165366, -0.00033>
jampIi_sv: { -6.75968e-17, -2.70387e-16, 7.43565e-16, 1.35194e-16, 1.68992e-16, 2.0279e-16, 1.01395e-16, -3.37984e-17 }
ztempI_sv: { -2.70387e-16, -1.17168e-15, 2.97426e-15, -2.14884e-23, 6.75968e-16, 8.11162e-16, 4.05581e-16, -1.35194e-16>
Floating point exception (core dumped)

Specifically, one ztemp was 7.21033e-16 and then becomes -2.14884e-23 after a sum.

I would say that I also need to flush-to-zero ztemp. In principle this should be enough? (The color matrix itself should have values which are O(1) or similar I hope).

valassi commented 6 months ago

Ok I believe this is finally fixed in MR #832.

I implemented my "strategy 1" above: I also flush-to-zero manually the ztemp if it is below sqrt(flt_min). A delicate point is that in mised mode the number of SIMD elements is neppV2=2*neppV (neppV is the number of double elements for Feynman, neppV2 is the number of float elements for color algebra).

I did a few checks on strategy 2 as discussed above, but I think it's better to avoid the underflows in the code rather than handling them or ignoring them.

Closing this.

valassi commented 6 months ago

Reopening this issue after reviewing MR #832 with @oliviermattelaer (thanks Olivier)

A couple of things to do in particular

(Strategy 1) Investigate doing the FTZ not on absolute jamps, but on relative jamps to the max jamp in each event. Whether we do a manual undrflowFTZ cut or not, this would make the code more robust. To do in practice, print the max jamp in each event for the problematic cases I had. Also, in that case the cut probably should be around E-15 rather than E-19 (15 is the number of digits of a double, so subtracting X-X can give 0 to E-15 precision), in mixed FPTYPE. This however would lead to even more complex code.
(Strategy 2) To avoid even more complex code, maybe easiest is to let the compiler/hardware manage the underflows, and then print out a warning at the end as it seemed to happen in IEEE FPE warnings from fortran. In practice: in feenableexcept, enable only 3 FPES (divbyzero, overflow, invalid) not 4 (remove underflow). Then have a hook at the end to print out (maybe just use fetestexcept/feclearexcept? for instance in the matrixelementkernel destructor... or otherwise as a hook at the end of our well identified executables). By the way, I guess in this case I would also remove the env variable and simply enable those three FPE traps all the time.

valassi commented 5 months ago

About strategy 2, finally I found the relevant info https://stackoverflow.com/questions/44308577/ieee-underflow-flag-ieee-denormal-in-fortran-77 https://gcc.gnu.org/onlinedocs/gfortran/Debugging-Options.html I am doing some small scale tests.

I would say that definitely it seems a good idea to

fix invalid, overflow, divbyzero
ignore underflow
ignore denormal (which cannot be controlled by feclearexcept in C++, so probably means -ffpe-summary=<list> where the list does not include denormal)

valassi commented 5 months ago

well maybe I got it wrong all the time: Fortran does not enable FPE traps, it simply prints out a warning/debug message at the end (on STOP? https://stackoverflow.com/a/39368925)... so these were never errors in Fortran, just warnings?

Yes definitely this is the case, as well explained in https://stackoverflow.com/questions/44308577/ieee-underflow-flag-ieee-denormal-in-fortran-77

I have now tried to add -ffpe-trap=invalid,zero,overflow to the fortran compilation, and the exceptions do send SIGFPE. As a strategy, I will not add these flags to fortran, but rather cal feenableexcept in CPPProcess.cc (so that I get the same behavioir also in c++/cuda only tests)

valassi commented 5 months ago

I have now implemented a new strategy for this #831 in PR #832. See https://github.com/madgraph5/madgraph4gpu/pull/832#issuecomment-2053717726

Essentially:

do not fix underflows, let them happen and just report them
by default send SIGFPE on invalid, divbyzerlo, overflow
I have not modified any compiler options in fortran or c++

I think this is now really resolved and can be closed. Will be fixed in #832

valassi commented 5 months ago

By the way, as mentioned in #832: FLT_MIN/10 is a denormal that is still usable, even if not exact. It fires underflow, but it is better than a zero.

madgraph5 / madgraph4gpu

Underflow Floating Point Exception for float and mixed FPTYPE in HEFT gg to bb #831