madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Simplify color algebra (A-iB)(M)(A+iB) as AMA+BMB and in addition use the fact M is symmetric [done/faster for c++, notdone/slower for cuda] #475

Open valassi opened 2 years ago

valassi commented 2 years ago

Just a brief reminder after the discussion at the last meeting.

It was not 100% clear if this change (merged in PR #457) made sense, for instance

Note in any case that

valassi commented 2 years ago

Renamed as "Simplify color algebra (A-iB)(M)(A+iB) as AMA+BMB and in addition use the fact M is symmetric "

From Oliver's idea: we can gain a factor two by using the fact that a_i*M_ij*a_j + a_j*M_ij*a_i = 2*a_i*a_j*M_ij !

Conversely, my initial assumption that computing only AMA+BMB instead of AMA+BMB +iAMB - iBMA gains a factor two is wrong! We were in any case calculating ztemps as MA and MB, so the part I omitted was just two vector*vector multiplications, which computationally are negligible (thanks Olivier!).

There is indeed the possibility to gain a factor two, but this comes from the extra rewriting suggested above for 2*a_i*a_j*M_ij

valassi commented 2 years ago

The speedup due to a symmetric M will be in PR #545. Note that initially this was in a commit with also single precision color algebra, I have moved that to #537 which will be in a separate PR.

valassi commented 2 years ago

Uff... also this one is much more complex than it seemed.

Taken alone (without the mov eto float), this rewriting of the algorithm does go faster on C++ (a bit, maybe 10%), but it is SLOWER on cuda. And all of this changes depending on the process, and on the grid size. For ggttgg with a full grid, or ggttggg, CUDA is slower. So this must be reverted...

One thing that I plan - but it will take time, probably after ACAT - is to use some constexpr. We should really COUNT the numbers of operations. Maybe it is the "2 *" that is actually one more cycle. Using that once and for all in a constexpr (from cf define another constexpr cf2) should speed this up, and maybe would help even in C++. I do not thinks the compiler can optimize this away. Or maybe it is the defining of new variables (but I tried to work around it, no luck).

So I would say: for the hack2 PR, I will add this to C++, not to cuda. Then later on we reconsider. And then a hack3 PR where we move also to float and mixed, but here it is C++ that gets more complicated. No rocke science anywhere, but a lot of places where we should just be slow and careful, nit something to do in a rush

valassi commented 2 years ago

See the results here https://github.com/madgraph5/madgraph4gpu/pull/545/commits/167a31af426ec98000ffe0312487869470df0bba

I have now reverted to the old version for cuda - will instead apply float color algebra there later on

valassi commented 2 years ago

I have committed my last changes for the MR #545. In the end I used only on c++ the fact that the matrix is symmetric (and also a constexpr), while on CUDA this seems slower. In any case this might go to cublas and tensor cores on cuda... For CUDA for the moment however I did split real/imag, this is harmles (and needed for single precision color matrix).

I am keeping this open anyway for the moment... may need some rethinking...

valassi commented 2 years ago

One thing that I plan - but it will take time, probably after ACAT - is to use some constexpr.

To be clear: this is now done, and it is a bit faster in C++, but still slower in CUDA, so I disabled that in CUDA.

https://github.com/valassi/madgraph4gpu/commit/7d3b46409b65bd12de1ff5970cfe470ae576760b

oliviermattelaer commented 2 years ago

I have done some test (with the fortran code) I have tested the time to run one PS point for g g > 5 g (more than 70% of the time in the color matrix) here is the result

  1. normal computation: 0.532s
  2. symmetric matrix-element (with the 2* without treatement) -> 0.32s
  3. symmetric but using AMA + BMB: -> 0.43s a. but jamp where complex number and both loop where doing real() ...

Given that the "theoretical" gain time for AMA + BMB compare to (A-iB)M(A+iB) is very small. I would not do it since my guess that this is why the gain are not as good as expected here.

Note, that the "For CUDA for the moment however I did split real/imag, this is harmles (and needed for single precision color matrix)." is not true, we can do (A-iB)M(A+iB) within single precision (zenny implemented both but that's a detail)

valassi commented 1 year ago

Hi Olivier, thanks.

Listen, I am about to merge #545, but I will keep this open fo rfurther investigations. The MR #545 includes the splitting of real and imaginary in both c++ and cuda, and it includes the use of a trinagular matrix in c++ but not in cuda. Overall, c++ is a bit faster, while cuda remains the same. Using the triangular matrix in cuda would go slower.

So, essentially I have done more changes for c++ thatn for cuda, and I will keep this open. Then we can iterate. But I want to move on to other stuff (without the risk of losing this old work), and since a bit of improvements are there, letrs use them already.

Andrea