madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Single precision color algebra #537

Closed valassi closed 1 year ago

valassi commented 2 years ago

As discussed at the hackathon from Olivier's and Zenny's ideas/work - we can move color algebra to single precision because here we get always E-5 precision or better.

In the code: agreed to keep a switch anyway to support both. I propose #ifdef MGONGPU_COLORALGEBRA_FLOAT.

We need a float_sv type, where float_v has neppV twice the neppV for standard fptypes. I will take care of that.

Zenny will work on the implementation. We need something like

#ifdef __CUDACC__
      typedef float float_sv;
#else
      typedef fptype_sv float_sv; // AV FIXME!
#endif      
#ifdef MGONGPU_COLORALGEBRA_FLOAT
      typedef float_sv fptype2_sv;
#else
      typedef fptype_sv fptype2_sv;
#endif
      fptype2_sv fjampR_sv = (fptype2_sv)( cxreal( jamp_sv ) );
      fptype2_sv fjampI_sv = (fptype2_sv)( cximag( jamp_sv ) );
      ... color algebra...;
      deltaMEs += unchanged (sum a float to a double)
valassi commented 2 years ago

This is complex for vector C++. Suppose you have avx2, so vectors of 4 doubles or 8 floats would be optimal (there is NO INTEREST in floats for color algebra unless you can use wider SIMD vectors). The problem is that the handling of FFVs gives you a "page" of 4 events. I have done some relatively simple gymanstics to save the previous 4 events, and do the color algebra and the ME update only once every two pages. This functionaly works... but it is SLOWER than using double all the way! The problem is that at some point I must merge two 4-vectors into one 8-vector and similarly at some point split one 8-vector into two 4-vectors... I am using explicit loops, as I thought that in any case these operations are a small overhead, but somehow the whole thing is slower.

One avenue to explore is to do these 4/8 conversions using __builtin_shufflevector (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html), but in my gcc11.2 this does not work. Apparently it is supported in gcc12, because it comes from clang, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88601.

Not sure what i will do. Most likely I will just implement it for CUDA as Zenny had initially done, and reevaluate C++ much later...

valassi commented 2 years ago

Hm __builtin_shuffle should work... but it will take some time

oliviermattelaer commented 2 years ago

Hi Andrea,

The splitting of the color algebra from the FFV structure for GPU should help to split the issue here. Also, one solution is to put the page size as required float and run the FFV structure on two consecutive vector.

valassi commented 2 years ago

Hi Olivier, thanks :-)

Yes indeed the splitting color algebra will help, if nothing else because of easier bookkeeping and also faster builds

The solution you mention with two consecutive vectors is what I already implemented, indeed it is viable. The builtin shuffle is what I need in addition (with two consecutive vectors) to make it go faster, otherwise it is slower as I miss SIMD in many important parts. It is in my private repo and needs cleanup and rebases, will push it to somehwere visible tomorrow.

valassi commented 1 year ago

As discussed in the meeting today, the "mixed precision mode" is fully implemented in the hack3 MR #548. That MR also includes the tables of results that we presented at ACAT in October 2022.

Looking at the code, I would summarize the changes as follows:

Voila thats all. I think this can be closed when #548 is merged, so I will link it there.