Single precision color algebra

valassi commented 2 years ago

As discussed at the hackathon from Olivier's and Zenny's ideas/work - we can move color algebra to single precision because here we get always E-5 precision or better.

In the code: agreed to keep a switch anyway to support both. I propose #ifdef MGONGPU_COLORALGEBRA_FLOAT.

We need a float_sv type, where float_v has neppV twice the neppV for standard fptypes. I will take care of that.

Zenny will work on the implementation. We need something like

#ifdef __CUDACC__
      typedef float float_sv;
#else
      typedef fptype_sv float_sv; // AV FIXME!
#endif      
#ifdef MGONGPU_COLORALGEBRA_FLOAT
      typedef float_sv fptype2_sv;
#else
      typedef fptype_sv fptype2_sv;
#endif
      fptype2_sv fjampR_sv = (fptype2_sv)( cxreal( jamp_sv ) );
      fptype2_sv fjampI_sv = (fptype2_sv)( cximag( jamp_sv ) );
      ... color algebra...;
      deltaMEs += unchanged (sum a float to a double)

valassi commented 2 years ago

This is complex for vector C++. Suppose you have avx2, so vectors of 4 doubles or 8 floats would be optimal (there is NO INTEREST in floats for color algebra unless you can use wider SIMD vectors). The problem is that the handling of FFVs gives you a "page" of 4 events. I have done some relatively simple gymanstics to save the previous 4 events, and do the color algebra and the ME update only once every two pages. This functionaly works... but it is SLOWER than using double all the way! The problem is that at some point I must merge two 4-vectors into one 8-vector and similarly at some point split one 8-vector into two 4-vectors... I am using explicit loops, as I thought that in any case these operations are a small overhead, but somehow the whole thing is slower.

One avenue to explore is to do these 4/8 conversions using __builtin_shufflevector (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html), but in my gcc11.2 this does not work. Apparently it is supported in gcc12, because it comes from clang, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88601.

Not sure what i will do. Most likely I will just implement it for CUDA as Zenny had initially done, and reevaluate C++ much later...

valassi commented 2 years ago

Hm __builtin_shuffle should work... but it will take some time

oliviermattelaer commented 2 years ago

Hi Andrea,

The splitting of the color algebra from the FFV structure for GPU should help to split the issue here. Also, one solution is to put the page size as required float and run the FFV structure on two consecutive vector.

valassi commented 2 years ago

Hi Olivier, thanks :-)

Yes indeed the splitting color algebra will help, if nothing else because of easier bookkeeping and also faster builds

The solution you mention with two consecutive vectors is what I already implemented, indeed it is viable. The builtin shuffle is what I need in addition (with two consecutive vectors) to make it go faster, otherwise it is slower as I miss SIMD in many important parts. It is in my private repo and needs cleanup and rebases, will push it to somehwere visible tomorrow.

valassi commented 1 year ago

As discussed in the meeting today, the "mixed precision mode" is fully implemented in the hack3 MR #548. That MR also includes the tables of results that we presented at ACAT in October 2022.

Looking at the code, I would summarize the changes as follows:

there are now two typedefs: fptype is used to compute the partial amplitudes jamp through helicity amplitudes, while fptype2 is used ONLY for the JMJ color matrix multiplication... and of course there are fptype_v and fptype2_v etc
the makefiles now recognise FPTYPE values d, f, m: the "m" stands for mixed and uses fptype=double and fptype2=float
one important detail for c++ vectorization is that the SIMD vectors in mixed mode have different sizes, eg for avx2 there are 4 doubles per vector in jamps but 8 floats per vector in the color matrix: this means that the color matrix multiplication is done only once every two events, and there is the need for a queing mechanism
specifically, two new functions are needed to convert one 8-float vector into two 4-double vectors and viceversa: this is done by the fpvsplit0/fpvsplit1 and fpvmerge functions, respectively
the imeplementations of the functions above use low-tech solutions, with different ifdefs for ftype_v sizes of 2,4,8... this avoids the need for all __builtin_shuffle that I had mentioned above
there are now "m" logs and correspodingly the summary tables contain "m" results... but this is ONLY in tmad, I have not done it for tput (yet?)
there are -mix|-mixonly options in various scripts

Voila thats all. I think this can be closed when #548 is merged, so I will link it there.

madgraph5 / madgraph4gpu

Single precision color algebra #537