Color algebra optimisations (use tensor cores?)

valassi commented 3 years ago

Hi @roiser @oliviermattelaer I open a generic issue about color algebra optimisations.

(Hi @ingvildh, this is what we discussed this morning and where I suggested that maybe the A100 tensor cores could be interesting)

This can be considred one issue in the epic of gg to ttgg optimisations, issue #146. I put some snippets of the code using a commit I used there, https://github.com/madgraph5/madgraph4gpu/commit/dd8711d0aa22c85429802b23de3169a68a97f298

This only exists for QCD processes, eg gg to tt(g)(g), not eemumu. It is one of the only places where calculating the "matrix element" (of the scattering matrix) actually involves matrix multiplications.

The more particles in the final state, the larger the "color matrix" involved.

For ggttgg, IIUC this is essentially this snippet here: https://github.com/madgraph5/madgraph4gpu/blob/dd8711d0aa22c85429802b23de3169a68a97f298/epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gCPPProcess.cu#L1133

 const int ncolor = 24; 
...
// The color matrix;
  static const double denom[ncolor] = {54, 54, 54, 54, 54, 54, 54, 54, 54, 54,
      54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54};
  static const double cf[ncolor][ncolor] = {{512, -64, -64, 8, 8, 80, -64, 8,
      8, -1, -1, -10, 8, -1, 80, -10, 71, 62, -1, -10, -10, 62, 62, -28}, {-64,
      512, 8, 80, -64, 8, 8, -64, -1, -10, 8, -1, -1, -10, -10, 62, 62, -28, 8,
      -1, 80, -10, 71, 62}, {-64, 8, 512, -64, 80, 8, 8, -1, 80, -10, 71, 62,
      -64, 8, 8, -1, -1, -10, -10, -1, 62, -28, -10, 62}, {8, 80, -64, 512, 8,
      -64, -1, -10, -10, 62, 62, -28, 8, -64, -1, -10, 8, -1, -1, 8, 71, 62,
      80, -10}, {8, -64, 80, 8, 512, -64, -1, 8, 71, 62, 80, -10, -10, -1, 62,
      -28, -10, 62, -64, 8, 8, -1, -1, -10}, {80, 8, 8, -64, -64, 512, -10, -1,
      62, -28, -10, 62, -1, 8, 71, 62, 80, -10, 8, -64, -1, -10, 8, -1}, {-64,
      8, 8, -1, -1, -10, 512, -64, -64, 8, 8, 80, 80, -10, 8, -1, 62, 71, -10,
      62, -1, -10, -28, 62}, {8, -64, -1, -10, 8, -1, -64, 512, 8, 80, -64, 8,
      -10, 62, -1, -10, -28, 62, 80, -10, 8, -1, 62, 71}, {8, -1, 80, -10, 71,
      62, -64, 8, 512, -64, 80, 8, 8, -1, -64, 8, -10, -1, 62, -28, -10, -1,
      62, -10}, {-1, -10, -10, 62, 62, -28, 8, 80, -64, 512, 8, -64, -1, -10,
      8, -64, -1, 8, 71, 62, -1, 8, -10, 80}, {-1, 8, 71, 62, 80, -10, 8, -64,
      80, 8, 512, -64, 62, -28, -10, -1, 62, -10, 8, -1, -64, 8, -10, -1},
      {-10, -1, 62, -28, -10, 62, 80, 8, 8, -64, -64, 512, 71, 62, -1, 8, -10,
      80, -1, -10, 8, -64, -1, 8}, {8, -1, -64, 8, -10, -1, 80, -10, 8, -1, 62,
      71, 512, -64, -64, 8, 8, 80, 62, -10, -28, 62, -1, -10}, {-1, -10, 8,
      -64, -1, 8, -10, 62, -1, -10, -28, 62, -64, 512, 8, 80, -64, 8, -10, 80,
      62, 71, 8, -1}, {80, -10, 8, -1, 62, 71, 8, -1, -64, 8, -10, -1, -64, 8,
      512, -64, 80, 8, -28, 62, 62, -10, -10, -1}, {-10, 62, -1, -10, -28, 62,
      -1, -10, 8, -64, -1, 8, 8, 80, -64, 512, 8, -64, 62, 71, -10, 80, -1, 8},
      {71, 62, -1, 8, -10, 80, 62, -28, -10, -1, 62, -10, 8, -64, 80, 8, 512,
      -64, -1, 8, -10, -1, -64, 8}, {62, -28, -10, -1, 62, -10, 71, 62, -1, 8,
      -10, 80, 80, 8, 8, -64, -64, 512, -10, -1, -1, 8, 8, -64}, {-1, 8, -10,
      -1, -64, 8, -10, 80, 62, 71, 8, -1, 62, -10, -28, 62, -1, -10, 512, -64,
      -64, 8, 8, 80}, {-10, -1, -1, 8, 8, -64, 62, -10, -28, 62, -1, -10, -10,
      80, 62, 71, 8, -1, -64, 512, 8, 80, -64, 8}, {-10, 80, 62, 71, 8, -1, -1,
      8, -10, -1, -64, 8, -28, 62, 62, -10, -10, -1, -64, 8, 512, -64, 80, 8},
      {62, -10, -28, 62, -1, -10, -10, -1, -1, 8, 8, -64, 62, 71, -10, 80, -1,
      8, 8, 80, -64, 512, 8, -64}, {62, 71, -10, 80, -1, 8, -28, 62, 62, -10,
      -10, -1, -1, 8, -10, -1, -64, 8, 8, -64, 80, 8, 512, -64}, {-28, 62, 62,
      -10, -10, -1, 62, 71, -10, 80, -1, 8, -10, -1, -1, 8, 8, -64, 80, 8, 8,
      -64, -64, 512}};

  // Sum and square the color flows to get the matrix element
  for(int icol = 0; icol < ncolor; icol++ )
  {
    cxtype ztemp = cxmake(0, 0); 
    for(int jcol = 0; jcol < ncolor; jcol++ )
      ztemp = ztemp + cf[icol][jcol] * jamp[jcol]; 
    meHelSum = meHelSum + cxreal(ztemp * conj(jamp[icol]))/denom[icol]; 
  }

So here in practice it is a quadratic form using a 24x24 matrix. I think that for gg to ttggg or ttgggg the dimensionality can still increase enormously.

See for instance O. Mattelaer, K. Ostrolenk, Speeding up MadGraph5_aMC@NLO, MCNET-21-01 (2021). https://arxiv.org/abs/2102.00773 Here it is found that for three gluons ie gg to ttggg, the color algebra takes 60% of the computation overall (in the production Fortran code).

Olivier had mentioned that the problem is not only the dimension of the matrix, but also the fact that for each event the calculation is repeated for each helicity (which here are 64 https://github.com/madgraph5/madgraph4gpu/blob/dd8711d0aa22c85429802b23de3169a68a97f298/epoch2/cuda/gg_ttgg/src/mgOnGpuConfig.h#L52)

There are many different schemas for color algebra, but I think we are using the 'color flow' formalism: F. Maltoni, K. Paul, T. Stelzer, S. Willenbrock, Color-ﬂow decomposition of QCD amplitudes, Phys. Rev. D 67 (2003) 014026. https://doi.org/10.1103/PhysRevD.67.014026

This 2009 paper also make many useful comments about the complexity of porting the color algebra to GPUs K. Hagiwara et al., Calculation of HELAS amplitudes for QCD processes using graphics processing unit (GPU), Eur. Phys. J. C 70 (2010) 513. https://doi.org/10.1140/epjc/s10052-010-1465-5

I think that what would need to be done is

first, simply time this section of the code, with respect to the whole of the calculate_wavefunctions function, where this is inclued (and which does two very diferent things: first, calculate all Feynman diagrams; second, do this color algebra multioplication)
second, try to understand if there are ways to speed it up using technology, eg using tensor cores #118 (note that there is also a lot of physics research on new algorithms to make this faster, but this is not the point here)

oliviermattelaer commented 3 years ago

Looks like madgraph is using the fundamental basis and not the color-flow basis. I still have to better understand that point since I was convinced that we were using the colour-flow basis.

valassi commented 2 years ago

Hi @roiser as discussed, I suggest you use this ticket to document progress in tensor cores for color algebra if you like? (There is also the more general #118 on tensor cores if you want, but for color algebra specifically I think this one is better). Thanks! Andrea

valassi commented 2 years ago

One simple change that may be useful for tensor cores but I fnd useful anyway is streamlining (A-iB)(M)(A+iB) as AMA+BMB #457

I thought this would give a performance boost, it does not. But I would still keep that

madgraph5 / madgraph4gpu

Color algebra optimisations (use tensor cores?) #155