Closed valassi closed 1 year ago
This is complex for vector C++. Suppose you have avx2, so vectors of 4 doubles or 8 floats would be optimal (there is NO INTEREST in floats for color algebra unless you can use wider SIMD vectors). The problem is that the handling of FFVs gives you a "page" of 4 events. I have done some relatively simple gymanstics to save the previous 4 events, and do the color algebra and the ME update only once every two pages. This functionaly works... but it is SLOWER than using double all the way! The problem is that at some point I must merge two 4-vectors into one 8-vector and similarly at some point split one 8-vector into two 4-vectors... I am using explicit loops, as I thought that in any case these operations are a small overhead, but somehow the whole thing is slower.
One avenue to explore is to do these 4/8 conversions using __builtin_shufflevector
(https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html), but in my gcc11.2 this does not work. Apparently it is supported in gcc12, because it comes from clang, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88601.
Not sure what i will do. Most likely I will just implement it for CUDA as Zenny had initially done, and reevaluate C++ much later...
Hm __builtin_shuffle
should work... but it will take some time
Hi Andrea,
The splitting of the color algebra from the FFV structure for GPU should help to split the issue here. Also, one solution is to put the page size as required float and run the FFV structure on two consecutive vector.
Hi Olivier, thanks :-)
Yes indeed the splitting color algebra will help, if nothing else because of easier bookkeeping and also faster builds
The solution you mention with two consecutive vectors is what I already implemented, indeed it is viable. The builtin shuffle is what I need in addition (with two consecutive vectors) to make it go faster, otherwise it is slower as I miss SIMD in many important parts. It is in my private repo and needs cleanup and rebases, will push it to somehwere visible tomorrow.
As discussed in the meeting today, the "mixed precision mode" is fully implemented in the hack3 MR #548. That MR also includes the tables of results that we presented at ACAT in October 2022.
Looking at the code, I would summarize the changes as follows:
Voila thats all. I think this can be closed when #548 is merged, so I will link it there.
As discussed at the hackathon from Olivier's and Zenny's ideas/work - we can move color algebra to single precision because here we get always E-5 precision or better.
In the code: agreed to keep a switch anyway to support both. I propose
#ifdef MGONGPU_COLORALGEBRA_FLOAT
.We need a float_sv type, where float_v has neppV twice the neppV for standard fptypes. I will take care of that.
Zenny will work on the implementation. We need something like