madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Validate clang-style "no cxtype ref" vectorization and use it as default #172

Closed valassi closed 3 years ago

valassi commented 3 years ago

This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.

There are presently two slightly different vectorisation implementations

In both implementations (take double with AVX2, ie 4 doubles per vector, as an example

A small difference between the two implementations is the following

However:

So, this issue is just about understanding if there is a bug and where. Maybe I just read the results in the wrong way and there is no issue.

valassi commented 3 years ago

Compare the commit logs of these two commits

With cxtype_ref, gcc/double https://github.com/madgraph5/madgraph4gpu/commit/8edae31eae6b10b9bb5269b567a3ebcf5e91d8e4

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.305527e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.191895 sec
real    0m7.202s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.118856e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.908404 sec
real    0m1.201s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.531723e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.845304 sec
real    0m4.855s
=Symbols in CPPProcess.o= (~sse4: 3277) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------

Without cxtype_ref, gcc/double https://github.com/madgraph5/madgraph4gpu/commit/4d6870dd5c678d5aabc65bc5fbe1358c05f75e6f

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.306067e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.003754 sec
real    0m7.011s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.188863e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.148541 sec
real    0m1.448s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.489082e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.705370 sec
real    0m4.713s
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------

The relevant lines are

MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0

They should be strictly identical, not just statistically compatible

valassi commented 3 years ago

A small comment: I was so sure that this should not make a difference in the 'none' implementation, that I did not print out the tag "[cxtype_ref=YES]" or "[cxtype_ref=NO]" in that case. Maybe better to add it. Well, on eo fthe things to cross-check...

valassi commented 3 years ago

This is peculiar. I cannot reproduce it.

I went back to https://github.com/madgraph5/madgraph4gpu/commit/4d6870dd5c678d5aabc65bc5fbe1358c05f75e6f which for gcc was giving 1.372113e-02, I now instead get the expected 1.371706e-02... I also checked the same commit with clang, there I do get 1.372113e-02 (an dthe printout says clang, so it's not a mismatch in the compiler printout).

Did I mix random numbers from two compilers?...

In any case, with the current latest master https://github.com/madgraph5/madgraph4gpu/commit/da19d3c62424be65184d5cb2d0432b996f38882f, I get the expected 1.371706e-02 on gcc and 1.372113e-02 on clang

valassi commented 3 years ago

This is completely understood now, there is no bug. The problem is that clang11 and clang12 are not supported by cuda11, so in that case I build on common random numbers, which give of course a different physics. I checked the first 32 ransom numbers were different, and the curand seeds are not used, then it was obvious... Not clear wjhy I saw it in gcc at some point, maybe I built it with my usual "export CUDA_HOME=invalid" hack that I need on clang 11 and 12.

About the second issue, whether the clang version can be used in production also for gcc, this is now validated. On ecould use that version. However, it is very tinily slower (a few permille). And I lile the original operator[] idea. I will keep as is for the moment.

En passant, I hav evalidated the latest cvmfs installs of clang11.1 and clang12.0 in issue #182.

In a PR #187 I have committed a few tests and minor patches.

This can be closed. Not a bug.

valassi commented 1 month ago

See additional comments in #1004. There were issues in the braket implementation on gcc14.2 (now fixed), so one can ask the question whether we should use the 'clang' no-bracket version also in gcc. I still prefr to keep the bracket version in gcc for now.