Validate clang-style "no cxtype ref" vectorization and use it as default

valassi commented 3 years ago

This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.

There are presently two slightly different vectorisation implementations

the original one developed on gcc
a recent one where I had to tweak a few things for clang

In both implementations (take double with AVX2, ie 4 doubles per vector, as an example

floats are 4-vectors FFFF
complex numbers are implemented as two float vectors RRRRIIII (and not as RIRIRIRI)
compiler vector extensions (those of gcc, or those of clang) are used for vectors

A small difference between the two implementations is the following

the original gcc version introduces a "cxtype_ref" class that is only a wrapper to two float non-const references, so that cxtype_v[0] returns a reference to R...I... in the RRRRIIII: this is the operator[], https://github.com/madgraph5/madgraph4gpu/blob/dfcc0f96a5d06cad32a3161fc0aa94569359e3b9/epoch1/cuda/ee_mumu/src/mgOnGpuVectors.h#L70
in clang this is not possible because "non-const reference cannot bind to vector element"
initially I tried to use in clang an operator[] returning a pair of values, rather than a pair of references, however this led to wrong results (only in the testxxx tests! not in eemumu ME averages??)... the issue is that in some places it was still used as one would use a non-const reference (and I was surprised that the code built at all)
anyway, in the end, I realised that this operator[] was really only needed in a minimal number of places, so I built a clang implementation where there is no such operator[]
now all the tests pass, and actually this implementation can only work on gcc, and it is much simpler, so I'd like to use it as default (removing cxtype_ref)

However:

I observed that the average ME that is printed out is different whether this implementation is used or not
What is really puzzling is that this is true also for the 'none' implementation (where there are no vectors at all, no compiler vector extensions, no need for any such thing)

So, this issue is just about understanding if there is a bug and where. Maybe I just read the results in the wrong way and there is no issue.

valassi commented 3 years ago

Compare the commit logs of these two commits

With cxtype_ref, gcc/double https://github.com/madgraph5/madgraph4gpu/commit/8edae31eae6b10b9bb5269b567a3ebcf5e91d8e4

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.305527e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.191895 sec
real    0m7.202s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.118856e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.908404 sec
real    0m1.201s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.531723e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.845304 sec
real    0m4.855s
=Symbols in CPPProcess.o= (~sse4: 3277) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------

Without cxtype_ref, gcc/double https://github.com/madgraph5/madgraph4gpu/commit/4d6870dd5c678d5aabc65bc5fbe1358c05f75e6f

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.306067e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.003754 sec
real    0m7.011s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.188863e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.148541 sec
real    0m1.448s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.489082e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.705370 sec
real    0m4.713s
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------

The relevant lines are

MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0

They should be strictly identical, not just statistically compatible

valassi commented 3 years ago

A small comment: I was so sure that this should not make a difference in the 'none' implementation, that I did not print out the tag "[cxtype_ref=YES]" or "[cxtype_ref=NO]" in that case. Maybe better to add it. Well, on eo fthe things to cross-check...

valassi commented 3 years ago

This is peculiar. I cannot reproduce it.

I went back to https://github.com/madgraph5/madgraph4gpu/commit/4d6870dd5c678d5aabc65bc5fbe1358c05f75e6f which for gcc was giving 1.372113e-02, I now instead get the expected 1.371706e-02... I also checked the same commit with clang, there I do get 1.372113e-02 (an dthe printout says clang, so it's not a mismatch in the compiler printout).

Did I mix random numbers from two compilers?...

In any case, with the current latest master https://github.com/madgraph5/madgraph4gpu/commit/da19d3c62424be65184d5cb2d0432b996f38882f, I get the expected 1.371706e-02 on gcc and 1.372113e-02 on clang

valassi commented 3 years ago

This is completely understood now, there is no bug. The problem is that clang11 and clang12 are not supported by cuda11, so in that case I build on common random numbers, which give of course a different physics. I checked the first 32 ransom numbers were different, and the curand seeds are not used, then it was obvious... Not clear wjhy I saw it in gcc at some point, maybe I built it with my usual "export CUDA_HOME=invalid" hack that I need on clang 11 and 12.

About the second issue, whether the clang version can be used in production also for gcc, this is now validated. On ecould use that version. However, it is very tinily slower (a few permille). And I lile the original operator[] idea. I will keep as is for the moment.

En passant, I hav evalidated the latest cvmfs installs of clang11.1 and clang12.0 in issue #182.

In a PR #187 I have committed a few tests and minor patches.

This can be closed. Not a bug.

valassi commented 1 month ago

See additional comments in #1004. There were issues in the braket implementation on gcc14.2 (now fixed), so one can ask the question whether we should use the 'clang' no-bracket version also in gcc. I still prefr to keep the bracket version in gcc for now.

madgraph5 / madgraph4gpu

Validate clang-style "no cxtype ref" vectorization and use it as default #172