madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

Reordering TMP lines in FFV functions affects performance by a few percent #285

Open valassi opened 2 years ago

valassi commented 2 years ago

This is a followup of #283, itself a followup of #277.

I realised that there is a small, but reproducible performance difference for cuda ggttgg simply from reordering the TML lines calculating TMP values in FFV functions.

The following, apparently harmless, commit loses around 2% of performance: https://github.com/madgraph5/madgraph4gpu/commit/7d63addb

See the details (using cuda 11.0 and gcc9.2)

cd epochX/cudacpp/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg

git log --oneline 7d63addb
7d63addb [epochX2] cosmetics on epoch2 HelAmps_sm.cu - reorder TMP and merge to single long lines
b985377e [epochX2] from "P1[0] =  V1[0].real()" to "P1[0] = + V1[0].real()" - better formatting without CppWriter

  git reset --hard 7d63addb
  make clean; make; ./gcheck.exe -p 2048 256 1 | egrep '(nvcc|EvtsPerSec|TOTAL       :)'
EvtsPerSec[Rnd+Rmb+ME](123)= ( 5.003405e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 5.006455e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 5.046764e+05                 )  sec^-1
TOTAL       :     7.326985 sec

  git reset --hard b985377e
  make clean; make; ./gcheck.exe -p 2048 256 1 | egrep '(nvcc|EvtsPerSec|TOTAL       :)'
EvtsPerSec[Rnd+Rmb+ME](123)= ( 5.073673e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 5.076741e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 5.118003e+05                 )  sec^-1
TOTAL       :     7.335236 sec

This ~1.4% difference is very small, but reproducible. I had actually already noticed this in eemumu, where some reorderings were giving a few % changes.

I am filing this for the record. I doubt that this will ever be a problem, but who knows. Especially, I do not see any way to predict in advance how to code-generate the most performing order. Presently we are ordering TMP lines in a reproducible way (TMP0, TMP1, TMP2 etc). This is probably best kept this way.

valassi commented 2 years ago

It would still be interesting to understand the root cause, I guess this has something to do with memory access and caches.

valassi commented 2 years ago

The main performance regression in the code (missing static, see #283) has been fixed in #281. This issue with TMP order will not be fixed. I am closing this.