Open valassi opened 2 years ago
Renamed as "Simplify color algebra (A-iB)(M)(A+iB) as AMA+BMB and in addition use the fact M is symmetric "
From Oliver's idea: we can gain a factor two by using the fact that a_i*M_ij*a_j + a_j*M_ij*a_i = 2*a_i*a_j*M_ij
!
Conversely, my initial assumption that computing only AMA+BMB instead of AMA+BMB +iAMB - iBMA gains a factor two is wrong! We were in any case calculating ztemps as MA and MB, so the part I omitted was just two vector*vector multiplications, which computationally are negligible (thanks Olivier!).
There is indeed the possibility to gain a factor two, but this comes from the extra rewriting suggested above for 2*a_i*a_j*M_ij
The speedup due to a symmetric M will be in PR #545. Note that initially this was in a commit with also single precision color algebra, I have moved that to #537 which will be in a separate PR.
Uff... also this one is much more complex than it seemed.
Taken alone (without the mov eto float), this rewriting of the algorithm does go faster on C++ (a bit, maybe 10%), but it is SLOWER on cuda. And all of this changes depending on the process, and on the grid size. For ggttgg with a full grid, or ggttggg, CUDA is slower. So this must be reverted...
One thing that I plan - but it will take time, probably after ACAT - is to use some constexpr. We should really COUNT the numbers of operations. Maybe it is the "2 *" that is actually one more cycle. Using that once and for all in a constexpr (from cf define another constexpr cf2) should speed this up, and maybe would help even in C++. I do not thinks the compiler can optimize this away. Or maybe it is the defining of new variables (but I tried to work around it, no luck).
So I would say: for the hack2 PR, I will add this to C++, not to cuda. Then later on we reconsider. And then a hack3 PR where we move also to float and mixed, but here it is C++ that gets more complicated. No rocke science anywhere, but a lot of places where we should just be slow and careful, nit something to do in a rush
See the results here https://github.com/madgraph5/madgraph4gpu/pull/545/commits/167a31af426ec98000ffe0312487869470df0bba
I have now reverted to the old version for cuda - will instead apply float color algebra there later on
I have committed my last changes for the MR #545. In the end I used only on c++ the fact that the matrix is symmetric (and also a constexpr), while on CUDA this seems slower. In any case this might go to cublas and tensor cores on cuda... For CUDA for the moment however I did split real/imag, this is harmles (and needed for single precision color matrix).
I am keeping this open anyway for the moment... may need some rethinking...
One thing that I plan - but it will take time, probably after ACAT - is to use some constexpr.
To be clear: this is now done, and it is a bit faster in C++, but still slower in CUDA, so I disabled that in CUDA.
https://github.com/valassi/madgraph4gpu/commit/7d3b46409b65bd12de1ff5970cfe470ae576760b
I have done some test (with the fortran code) I have tested the time to run one PS point for g g > 5 g (more than 70% of the time in the color matrix) here is the result
Given that the "theoretical" gain time for AMA + BMB compare to (A-iB)M(A+iB) is very small. I would not do it since my guess that this is why the gain are not as good as expected here.
Note, that the "For CUDA for the moment however I did split real/imag, this is harmles (and needed for single precision color matrix)." is not true, we can do (A-iB)M(A+iB) within single precision (zenny implemented both but that's a detail)
Hi Olivier, thanks.
Listen, I am about to merge #545, but I will keep this open fo rfurther investigations. The MR #545 includes the splitting of real and imaginary in both c++ and cuda, and it includes the use of a trinagular matrix in c++ but not in cuda. Overall, c++ is a bit faster, while cuda remains the same. Using the triangular matrix in cuda would go slower.
So, essentially I have done more changes for c++ thatn for cuda, and I will keep this open. Then we can iterate. But I want to move on to other stuff (without the risk of losing this old work), and since a bit of improvements are there, letrs use them already.
Andrea
Just a brief reminder after the discussion at the last meeting.
It was not 100% clear if this change (merged in PR #457) made sense, for instance
Note in any case that