madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Single precision average ME is not the same for CUDA and C++ in single-precision (ggttgg and eemumu) #212

Open valassi opened 3 years ago

valassi commented 3 years ago

As discussed in PR #211, single precision average ME is not the same for CUDA and C++ in single-precision ggttgg

See for instance https://github.com/valassi/madgraph4gpu/commit/a75ee3b6ba38d0be49f294c241a5e8b0682c84df#diff-45e40fdc2f6b7c71419c9f5e7e36267d7951e21c32488d6ecf35de3ec28ced57

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = FLOAT (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 6.610975e+05                 )  sec^-1
MeanMatrixElemValue        = ( 4.059594e+00 +- 2.368052e+00 )  GeV^-4
TOTAL       :     5.920932 sec
    15,536,792,487      cycles                    #    2.654 GHz
    28,689,538,755      instructions              #    1.85  insn per cycle
       6.207201648 seconds time elapsed

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = FLOAT (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.786471e+03                 )  sec^-1
MeanMatrixElemValue        = ( 4.060118e+00 +- 2.367901e+00 )  GeV^-4
TOTAL       :     9.183867 sec
    24,604,689,155      cycles                    #    2.677 GHz
    73,872,471,813      instructions              #    3.00  insn per cycle
       9.193035302 seconds time elapsed

In double precision, results are similar to those, butnot the same, and they are the same as each other to more digits https://github.com/valassi/madgraph4gpu/commit/33e7c04ecddb596ee7eba390f0a55435a31e6287#diff-45e40fdc2f6b7c71419c9f5e7e36267d7951e21c32488d6ecf35de3ec28ced57

perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.438062e+05                 )  sec^-1
MeanMatrixElemValue        = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     5.929722 sec
    14,377,877,684      cycles                    #    2.653 GHz
    24,406,140,862      instructions              #    1.70  insn per cycle
       6.229614368 seconds time elapsed

FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.825369e+03                 )  sec^-1
MeanMatrixElemValue        = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     8.991304 sec
    24,089,227,557      cycles                    #    2.677 GHz
    73,968,938,757      instructions              #    3.07  insn per cycle
       8.999893583 seconds time elapsed

Note that for eemumu, in single precision the same average ME is printed out (if I remember correctly?)

NO, I remember badly. On eemumu, on MANY more events, I get a different number of NaNs! And as a consequence also a different average ME https://github.com/madgraph5/madgraph4gpu/commit/7173757e7575bc946f27ab93ed8a121d387bbfee#diff-6716e7ab4317b4e76c92074d38021be37ad0eda68f248fb16f11e679f26114a6

On lxplus770.cern.ch (T4):
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.3.58]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.304735e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     1.016515 sec
real    0m1.137s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 8.3.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.190025e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
TOTAL       :     7.257611 sec
real    0m7.274s
=Symbols in CPPProcess.o= (~sse4:  540) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 8.3.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 9.141683e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.633856 sec
real    0m2.651s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2941) (512y:   89) (512z:    0)
-------------------------------------------------------------------------

So there is clearly some numerical precision to investigate also for eemumu

valassi commented 2 years ago

(This is related to #5 by the way).

Having a quick update on this after a few months. The issue is always there: in single precision, there are a few Nans, example https://github.com/madgraph5/madgraph4gpu/blob/a698c62b25b3c89d0b1e9567de06e97a514b8586/epochX/cudacpp/tput/logs_eemumu_manu/log_eemumu_manu_f_inl0_hrd0.txt#L112

I had even done some minimal debugging at some point (mainly to understand how to detect "NaN" at all, when fast math is enabled! See https://github.com/madgraph5/madgraph4gpu/blob/a698c62b25b3c89d0b1e9567de06e97a514b8586/epochX/cudacpp/ee_mumu/SubProcesses/CrossSectionKernels.cc#L129

There is some interesting work to be done here, which however is largely debugging. For instance:

This is not an academic exercise. The final goal of this study is to try and understand if the matrix element calculations can be moved from double to single precision. This would mean a factor 2 speedup both in vectorized c++ (twice as many elements in SIMD vectors) and in CUDA (typically, twice as many FLOPs on Nvidia data center cards)

valassi commented 2 years ago

(This is also related to #117 where fast math first appeared..)

valassi commented 2 years ago

I have just made a small test in a PT that I am about to merge https://github.com/madgraph5/madgraph4gpu/pull/379/commits/45b7b3303d8e700b21bbf66eab4ba334b01a39e4

I have disabled fast math in eemumu and run double and float, results