madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

Fix SIGFPE crash (855) in rotxxx by adding volatile in aloha_functions.f #857

Closed valassi closed 3 months ago

valassi commented 4 months ago

Hi @oliviermattelaer I think this is ready for review, can you please have a look?

This fixes the SIGFPE crash in rotxxx (#855, also mentioned in #826 - and maybe this is also related to what you discuss as a crash in #852?). It uses an ugly workaround of just adding 'volatile' to disable optimizations (from SIMD?) deep inside the fortran code. Ugly, but it works, and has no performance penalties I think here. And this is also exactly the same technique I used to fix many SIGFPEs in cudacpp in the ixx/oxx functions last here.

It also slightly modifies my test scripts to enable testing a different config which is not 1. I do not understand why the SIGFPE crash only seems to happen with some specific iconfig (which may also have other issues), but it seems wuite clear that the issues are independent. Rephrasing: I would add volatile in any case, and then go on and investigate channel/iconfig mappings as in your #853.

Note, this IMO brings us one step of progress, but other issues remain open

I will still run large scale tests tonight, but the code will almost certainly not change. So this is ready for review. Thanks. Andrea

valassi commented 4 months ago

Hi @oliviermattelaer I completed my tests and I confirm that adding volatile fixes this SIGFPE in rotxxx (the code was crashing without and is not crashing with volatile).

There is no performance change, or at most a 1-2% negligible change. See 346c9df7b195fa61ba5850533aef3eb350694358

Can you please review and approve? Thanks Andrea

valassi commented 4 months ago

I have updated this after the merge of #860. I will rerun a few tests tonight.

@oliviermattelaer this is also ready for review and merge in any case. Thanks

In practice, the only functional change is here. This prevents optimizations and avoids a crash in rotxxx.

diff --git b/epochX/cudacpp/gg_tt.mad/Source/DHELAS/aloha_functions.f a/epochX/cudacpp/gg_tt.mad/Source/DHELAS/aloha_functions.f
index 657387a58..d0ec1dbde 100644
--- b/epochX/cudacpp/gg_tt.mad/Source/DHELAS/aloha_functions.f
+++ a/epochX/cudacpp/gg_tt.mad/Source/DHELAS/aloha_functions.f
@@ -1201,7 +1201,7 @@ c       real    prot(0:3)      : four-momentum p in the rotated frame
 c
       implicit none
       double precision p(0:3),q(0:3),prot(0:3),qt2,qt,psgn,qq,p1
-
+      volatile qt, p1, qq ! prevent optimizations with -O3 (workaround for SIGFPE #855)
       double precision rZero, rOne
       parameter( rZero = 0.0d0, rOne = 1.0d0 )
oliviermattelaer commented 4 months ago

Hi,

I have some doubt that this is needed (I do not need such line on haswell/M1 architecture). Also this is a pure fortran line that does not crash in pure fortran... So I would wait that we aggree on #852 (or any derivate that you want) before agreeing on this one.

Anyway in which architecture are you compiling (and version of GCC), such that I can give a try to reproduce this.

valassi commented 4 months ago

I have some doubt that this is needed (I do not need such line on haswell/M1 architecture).

Hi Olivier, thanks for the reply.

I disagree, I think we need this - or at least: I have clear evidence that one platform it crashes without this patch and it does not crash with the patch.

And: the crash #855 is clearly in rotxxx.

Also: cudacpp code is filled with these patches. All SIGFPE patches I developed last year use this. If anyone has a better solution lets discuss this, but having crashes fixed in the meantime is top priority.

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp> grep -i volatile gg_tt.mad -r
gg_tt.mad/SubProcesses/testxxx.cc:    volatile fptype m2 = fpmax( p0 * p0 - p1 * p1 - p2 * p2 - p3 * p3, 0 ); // see #736
gg_tt.mad/src/mgOnGpuVectors.h:  fpsqrt( const volatile fptype_v& v ) // volatile fixes #736
gg_tt.mad/src/mgOnGpuVectors.h:      volatile fptype outi = 0; // volatile fixes #736
gg_tt.mad/src/HelAmps_sm.h:    // Variables xxxDENOM are declared as 'volatile' to make sure they are not optimized away on clang! (#724)
gg_tt.mad/src/HelAmps_sm.h:    // A few additional variables are declared as 'volatile' to avoid sqrt-of-negative-number FPEs (#736)
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv p2 = pvec1 * pvec1 + pvec2 * pvec2 + pvec3 * pvec3; // volatile fixes #736
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ppDENOM = fpternary( pp != 0, pp, 1. );    // hack: ppDENOM[ieppV]=1 if pp[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v pp3DENOM = fpternary( pp3 != 0, pp3, 1. ); // hack: pp3DENOM[ieppV]=1 if pp3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v chi0r2 = pp3 * 0.5 / ppDENOM;              // volatile fixes #736
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv p0p3 = fpmax( pvec0 + pvec3, 0 ); // volatile fixes #736
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv sqp0p3 = fpternary( ( pvec1 == 0. and pvec2 == 0. and pvec3 < 0. ),
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv sqp0p3DENOM = fpternary( sqp0p3 != 0, (fptype_sv)sqp0p3, 1. ); // hack: dummy sqp0p3DENOM[ieppV]=1 if sqp0p3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:    // Variables xxxDENOM are declared as 'volatile' to make sure they are not optimized away on clang! (#724)
gg_tt.mad/src/HelAmps_sm.h:    // A few additional variables are declared as 'volatile' to avoid sqrt-of-negative-number FPEs (#736)
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv pt2 = ( pvec1 * pvec1 ) + ( pvec2 * pvec2 );
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv p2 = pt2 + ( pvec3 * pvec3 ); // volatile fixes #736
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ppDENOM = fpternary( pp != 0, pp, 1. ); // hack: ppDENOM[ieppV]=1 if pp[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ptDENOM = fpternary( pt != 0, pt, 1. );                                                     // hack: ptDENOM[ieppV]=1 if pt[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv pt2 = pvec1 * pvec1 + pvec2 * pvec2; // volatile fixes #736
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ptDENOM = fpternary( pt != 0, pt, 1. );                             // hack: ptDENOM[ieppV]=1 if pt[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:    // Variables xxxDENOM are declared as 'volatile' to make sure they are not optimized away on clang! (#724)
gg_tt.mad/src/HelAmps_sm.h:    // A few additional variables are declared as 'volatile' to avoid sqrt-of-negative-number FPEs (#736)
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv p2 = pvec1 * pvec1 + pvec2 * pvec2 + pvec3 * pvec3; // volatile fixes #736
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ppDENOM = fpternary( pp != 0, pp, 1. );    // hack: ppDENOM[ieppV]=1 if pp[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v pp3DENOM = fpternary( pp3 != 0, pp3, 1. ); // hack: pp3DENOM[ieppV]=1 if pp3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v chi0r2 = pp3 * 0.5 / ppDENOM;              // volatile fixes #736
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv p0p3 = fpmax( pvec0 + pvec3, 0 ); // volatile fixes #736
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv sqp0p3 = fpternary( ( pvec1 == 0. and pvec2 == 0. and pvec3 < 0. ),
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v sqp0p3DENOM = fpternary( sqp0p3 != 0, (fptype_sv)sqp0p3, 1. ); // hack: sqp0p3DENOM[ieppV]=1 if sqp0p3[ieppV]==0

About the platform dependency, haswell vs m1 etc. I am really speculating here, but I think that this (at least, my cidacpp ixxx/oxxx volatile last year were of this sort) depends on optimizations of SIMD instructions which are by definition processor dependent. So I am not at all surprised that different processors give different behaviours - I would be surpised of the contrary.

For Fortran, in addition, I'd happily disable all SIMD instead, as in any case we do not need that. Or as I said some where else, an alternative to using volatile is just using -O1 or -O0 globally. Essentially volatile is just doing this locally, less intrusive.

Anyway. Lets discuss it in one month if you prefer. But having this not patched is a bad idea in my opinion, if a clearly demonstrated reproducible functional workaround exists.

Thanks!

valassi commented 4 months ago

Anyway in which architecture are you compiling (and version of GCC), such that I can give a try to reproduce this.

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp> cat /proc/cpuinfo | grep model | sort -u
model           : 85
model name      : Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp> uname -a
Linux itscrd90.cern.ch 5.14.0-284.18.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 29 17:06:27 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp> cat /etc/redhat-release 
AlmaLinux release 9.2 (Turquoise Kodkod)
[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp> gcc --version
gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I can easily give you acess if you want...

valassi commented 4 months ago

PS I think this is where I initially started using volatile: https://github.com/madgraph5/madgraph4gpu/issues/727#issuecomment-1643657582

Ouff I finally found another exotic solution: the 'volatile' keyword.

My reasoning was, I wanted to make sure I could outsmart the smart compiler and prevent it from doing stupid things.

* I looked at the gcc compiler options and realised this probably had to do with DCE (dead code elimination), https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, but I could not find an easy way to disable that

* I googled around and found this intresting presentation on DCE (thanks to the author!): https://www.inf.ed.ac.uk/teaching/courses/ct/19-20/slides/llvm-4-deadcode.pdf... this mentions the volatile keyword, which I had never really investigated

* Indeed, 'volatile' seems to be what I wanted: a way to tell the compiler "hey you, you MUST compute this variable even if you think you can optimize it away"...

* and, it seems to work :-)

Olivier, you seemed to like volatile for cuda/cpp in #727. In Fortran is just the same ;-)

valassi commented 4 months ago

About the platform dependency, haswell vs m1 etc. I am really speculating here, but I think that this (at least, my cidacpp ixxx/oxxx volatile last year were of this sort) depends on optimizations of SIMD instructions which are by definition processor dependent. So I am not at all surprised that different processors give different behaviours - I would be surpised of the contrary.

Last comment. I tried to look back what I had read last year that convinced me of this, but I cannot find it.

Anyway google 'floating point exception simd' gives a few interesting reads, like

And also google 'mxcsr simd volatile'

oliviermattelaer commented 4 months ago

Hi Andrea (for after your holliday),

Thanks for giving me access to your machine for the test. I have tested PR #850 on your machine and I do not reproduce your crash in rotxxx. When you are back form holliday, could you reproduce this yourself, since,to me, it seems that this crash is due to a memory corruption related to the coloramps.h

valassi commented 3 months ago

Hi Andrea (for after your holliday),

Thanks for giving me access to your machine for the test. I have tested PR #850 on your machine and I do not reproduce your crash in rotxxx. When you are back form holliday, could you reproduce this yourself, since,to me, it seems that this crash is due to a memory corruption related to the coloramps.h

Hi Olivier, I am back. Thanks a lot for trying to reproduce this.

I tried again in a supposedly clean environment and I reproduce it all the time.

I created the following file /data/olmattel/bug855cpp.sh in a directory for you

# SOURCE THIS FILE FROM A CLEAN ENVIRONMENT (/bin/bash --norc --noprofile)
echo "DEBUG: print out env..."
env
echo "DEBUG: print out env... DONE"

cd /data/olmattel/BUG855
rm -rf madgraph4gpu
git clone https://github.com/madgraph5/madgraph4gpu.git
cd madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
export CC=/usr/bin/gcc
export CXX=/usr/bin/g++
export FC=/usr/bin/gfortran
make cleanall
make -f cudacpp.mk gtestlibs
make -j BACKEND=cppnone -f cudacpp.mk debug
make -j BACKEND=cppnone
cat > input_cudacpp_104 << EOF
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
104 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
EOF
gdb -ex 'run < input_cudacpp_104' -ex where -ex 'set confirm off' -ex quit ./madevent_cpp

You should normally just

ssh olmattel@itscrd90
bash --norc --noprofile
. /data/olmattel/bug855cpp.sh

and the crash should appear

I test it with sudo as

sudo bash
su -P olmattel -c 'bash --norc --noprofile'
. /data/olmattel/bug855cpp.sh 

And this is what I get

...
 Ranmar initialization seeds       27505        9395

Program received signal SIGFPE, Arithmetic exception.
0x000000000044b5ff in rotxxx_ ()
#0  0x000000000044b5ff in rotxxx_ ()
#1  0x00000000004087e0 in gentcms_ ()
#2  0x0000000000409849 in one_tree_ ()
#3  0x000000000040bb84 in gen_mom_ ()
#4  0x000000000040d1aa in x_to_f_arg_ ()
#5  0x000000000045c805 in sample_full_ ()
#6  0x000000000043426a in MAIN__ ()
#7  0x000000000040371f in main ()

Can you confirm? Thanks Andrea

valassi commented 3 months ago

By the way:

Program received signal SIGFPE, Arithmetic exception.
rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
1247              prot(1) = q(1)*q(3)/qq/qt*p1 -q(2)/qt*p(2) +q(1)/qq*p(3)
#0  rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
#1  0x00000000004087e0 in gentcms (pa=..., pb=..., t=-181765.47706865534, phi=0.64468537567405615, ma2=0, 
    m1=234.1712866912786, m2=210.15563843880372, p1=..., pr=..., jac=3.0327734872026782e+25) at genps.f:1480
#2  0x0000000000409849 in one_tree (itree=..., tstrategy=<optimized out>, iconfig=104, nbranch=4, p=..., m=..., s=..., x=..., 
    jac=3.0327734872026782e+25, pswgt=1) at genps.f:1167
#3  0x000000000040bb84 in gen_mom (iconfig=104, mincfig=104, maxcfig=104, invar=10, wgt=0.03125, x=..., p1=...) at genps.f:68
#4  0x000000000040d1aa in x_to_f_arg (ndim=10, iconfig=104, mincfig=104, maxcfig=104, invar=10, wgt=0.03125, x=..., p=...)
    at genps.f:60
#5  0x000000000045c805 in sample_full (ndim=10, ncall=32, itmax=1, itmin=1, dsig=0x438aa0 <dsig>, ninvar=10, nconfigs=1, 
    vecsize_used=16384) at dsample.f:172
#6  0x000000000043426a in driver () at driver.f:256
#7  0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:301
#8  0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#9  0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x0000000000403845 in _start ()
valassi commented 3 months ago

Note: I finally managed to add my tmad tests to the CI in https://github.com/madgraph5/madgraph4gpu/pull/794 Interestingly, many tests fail with rotxxx crashes (even in situations where my manual tests were not failing?). Example https://github.com/madgraph5/madgraph4gpu/actions/runs/9686490186/job/26729084480#step:12:182 See https://private-user-images.githubusercontent.com/3473550/343335749-dad4af6e-338b-4d37-88db-b37b15f0af60.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk0MzY2NTIsIm5iZiI6MTcxOTQzNjM1MiwicGF0aCI6Ii8zNDczNTUwLzM0MzMzNTc0OS1kYWQ0YWY2ZS0zMzhiLTRkMzctODhkYi1iMzdiMTVmMGFmNjAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYyNiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MjZUMjExMjMyWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YjY3YWQ2M2RjZDUwMTc0MmY2MjRjYjEzMWFhYjhiY2M5MmQ4ZTlmMWUyNjJhOWI5NGJkMTJiZDFiNWYwOWYyZiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.cn8b1KgX7b3pEhhj2q0h10uRLgOV7SjniLqjRY8XyM4

oliviermattelaer commented 3 months ago

Hi,

As said in private by email, this should be done in the madgraph5 repo and not via a patch here. Otherwise, I have no objection to include this "volatile" such that you can move forward.

But to be honest, I'm worried about why we have such sigfpe suddenly here (and not in the pure fortran code). It does not make any sense and likely means that we are not fixing the real issue but ony hidding some of the sympton. So let move forward by 1) do another PR in mg5amcnlo repo 2) keeping in mind that we have likely another bug somewhere ...

Cheers,

Olivier

valassi commented 3 months ago

Hi,

As said in private by email, this should be done in the madgraph5 repo and not via a patch here. Otherwise, I have no objection to include this "volatile" such that you can move forward.

But to be honest, I'm worried about why we have such sigfpe suddenly here (and not in the pure fortran code). It does not make any sense and likely means that we are not fixing the real issue but ony hidding some of the sympton. So let move forward by

1. do another PR in mg5amcnlo repo

2. keeping in mind that we have likely another bug somewhere ...

Cheers,

Olivier

Thanks Olivier!

OK I will do the rotxxx patch in the upstream mg5amcnlo. I reopen this one anyway (in WIP). I will include here the upgrade to the new mg5amcnlo when it is done. (And I am curious to see already what my new CI gives if the volatile are included).

valassi commented 3 months ago

But to be honest, I'm worried about why we have such sigfpe suddenly here (and not in the pure fortran code).

On this specific point: note that there is one difference:

It is far fetched, but one possible explanation is

valassi commented 3 months ago

I have merged the latest upstream/master and have rerun the CI which now includes tmad tests.

In upstream/master, the CI had 12 failures: 9 from rotxxx crashes, and 3 from no cross section in susy.

In this branch, now the 9 rotxxx failures have disappeared, while the 3 no cross sections have stayed. THIS SHOWS THAT ADDING VOLATILE DOES WORK. In addition, 3 new issues have appeared in pp_tt012j (#872).

(Note also: after removing rotxxx crashes, I was expecting other issues to appear, but they did not apart from #872 above. In particular I thought that #856 would appear. Maybe these only exist in manual tests, I will cross check... or maybe for that one I need to use iconfig 104.)

valassi commented 3 months ago

OK I will do the rotxxx patch in the upstream mg5amcnlo. I reopen this one anyway (in WIP). I will include here the upgrade to the new mg5amcnlo when it is done. (And I am curious to see already what my new CI gives if the volatile are included).

I have now completed the move of 'volatile' from cudacpp patch.common to the upstream mg5amcnlo in the gpucpp branch, pending this PR https://github.com/mg5amcnlo/mg5amcnlo/pull/113

The generated code is the same (well I just changed a comment), the results do not change

As mentiond above, the rotxxx crashes disappear from the CI, while a new xsec discrepancy #872 appears. I will also try to make issue #856 appear in the CI (and possibly another one).

This will soon be ready for review.

valassi commented 3 months ago

I configured the tmad test for gg_ttgg to use iconfig=104. Issue https://github.com/madgraph5/madgraph4gpu/issues/856 (LHE color mismatch in gg_ttgg for iconfig=104) is now appearing in the CI. (NB THIS ONLY APPEARS IF ROTXXX IS FIXED) See https://github.com/madgraph5/madgraph4gpu/issues/856#issuecomment-2194973059

NB there are now 9 errors in the CI, three fptype for each of the following three issues

valassi commented 3 months ago

I have added the 3x3 issues mentioned above to the list f known issues to bypass. In that case, the CI tests succeed:

image

I will now disable these bypasses (so that tests fail in the CI). Then this is ready to be merged.

valassi commented 3 months ago

I have disabled bypasses, now 9 tests fail as expected. This is ready to be merged.

@oliviermattelaer can you please review/approve? NB: there is no CODEGEN change. Effectively this is ONLY the update to the upstream including volatile (and regenerated code, plus history and logs).

oliviermattelaer commented 3 months ago

Yes let's merge this,

Olivier

valassi commented 3 months ago

Thanks! I will merge this now Andrea