Open valassi opened 2 years ago
It would still be interesting to understand the root cause, I guess this has something to do with memory access and caches.
The main performance regression in the code (missing static, see #283) has been fixed in #281. This issue with TMP order will not be fixed. I am closing this.
This is a followup of #283, itself a followup of #277.
I realised that there is a small, but reproducible performance difference for cuda ggttgg simply from reordering the TML lines calculating TMP values in FFV functions.
The following, apparently harmless, commit loses around 2% of performance: https://github.com/madgraph5/madgraph4gpu/commit/7d63addb
See the details (using cuda 11.0 and gcc9.2)
This ~1.4% difference is very small, but reproducible. I had actually already noticed this in eemumu, where some reorderings were giving a few % changes.
I am filing this for the record. I doubt that this will ever be a problem, but who knows. Especially, I do not see any way to predict in advance how to code-generate the most performing order. Presently we are ordering TMP lines in a reproducible way (TMP0, TMP1, TMP2 etc). This is probably best kept this way.