1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?)

valassi commented 3 years ago

This is a followup to https://github.com/madgraph5/madgraph4gpu/issues/116#issuecomment-780071843

On a very old version of the code, while trying to understand another issue (BEFORE I understood it is actually a hardware problem on a VM, tsc clock not used, hence large overhead of system calls), I tried to evaluate if gcc9 could fix the issue. On the buggy hardware node this had no effect, but on a good node instead gcc9 was a factor 1.5 faster than gcc8 (5.5E5 throughput instead of 3.5E5).

The difference is in . /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0/x86_64-centos7/setup.sh . /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh

Again, this was on an old version of the code. But one should understand if this speedup also exists for more recent versions - or in any case it would be nice to understand what is going on.

valassi commented 3 years ago

A flamegraph is again useful.

This is pmpe04 with gcc8

time ./check.exe 393216
***********************************
NumberOfEntries       = 393216
TotalTimeInWaveFuncs  = 1.023022e+00 sec
MeanTimeInWaveFuncs   = 2.601678e-06 sec
StdDevTimeInWaveFuncs = 5.867377e-07 sec
MinTimeInWaveFuncs    = 2.242000e-06 sec
MaxTimeInWaveFuncs    = 9.060000e-05 sec
-----------------------------------
NumMatrixElements     = 393216
MatrixElementsPerSec  = 3.843672e+05 sec^-1
***********************************
NumMatrixElements     = 393216
MeanMatrixElemValue   = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue    = 6.071582e-03 GeV^0
MaxMatrixElemValue    = 3.374887e-02 GeV^0

real    0m1.667s
user    0m1.617s
sys     0m0.046s

perf-pmpe04-gcc8

This is pmpe04 with gcc9

time ./check.exe 393216
***********************************
NumberOfEntries       = 393216
TotalTimeInWaveFuncs  = 7.300220e-01 sec
MeanTimeInWaveFuncs   = 1.856542e-06 sec
StdDevTimeInWaveFuncs = 4.367662e-07 sec
MinTimeInWaveFuncs    = 1.561000e-06 sec
MaxTimeInWaveFuncs    = 8.322300e-05 sec
-----------------------------------
NumMatrixElements     = 393216
MatrixElementsPerSec  = 5.386358e+05 sec^-1
***********************************
NumMatrixElements     = 393216
MeanMatrixElemValue   = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue    = 6.071582e-03 GeV^0
MaxMatrixElemValue    = 3.374887e-02 GeV^0

real    0m1.378s
user    0m1.320s
sys     0m0.055s

perf-pmpe04-gcc9

The difference between the two is a 0.6 second overhead (on top of 1.3s hard CPU) spent in __muldc3.

Note this post that links __muldc3 to complex number multiplication https://stackoverflow.com/a/49438578

Takeaways:

probably gcc9 handles better our complex number multiplication (in that old code...)
may need to check our handling of nans and inf in complex numbers, too

Ok this is essentially understood. Needs to be revisited on the latest code.

As for gcc8 or gcc9, it looks like it is better to use gcc9 for any performance tests for the paper? (If the downside is only cuda-gdb thi smight be ok... and again, to be understood if/why I had issues in cuda-gdb with gcc9)

valassi commented 3 years ago

Suggestion by Olivier: check compilation flags... (fast math?). Also in his c++ to fortran comparison he had observed issues that may be related...

valassi commented 3 years ago

I confirm that I still see a factor 2+ (even more than a factor 1.5!!) between gcc8 and gcc9, even on the current latest master. This clearly means that we should use gcc9 and not gcc8.

This is the latest log and the code I use: https://github.com/madgraph5/madgraph4gpu/commit/0b4280bb7b068cd60d8ac40dcdac21fefe2290a9

with gcc8: total 18.3s, ME 15.9s
with gcc9: total 9.9s, MEs 7.5s

Flamegraph for gcc8:

Flamegraph for gcc9:

A few additional observations:

not clear why in the flamegraph my main sigmakin function is 'unknown'
there are a few other functions worth investigatingin gcc9, e.g. divdc3 but also ieee754_log_avx or __cos_avx or __sin_avx

@oliviermattelaer , I think you are right that, in gcc8, fast math would also solve the issue. However, that would result in an "incorrect" handling of nan and inf. I think that using gcc9 is a much better option: this is explained here, where the patch was created (note that it was not backported to gcc8), https://gcc.gnu.org/bugzilla//show_bug.cgi?id=70291, or see also https://stackoverflow.com/questions/49438158/

By the way, I am now using cuda 11.0, which is happy with gcc9 (while cuda 10.2 requires gcc8). My new reference will therefore be

itscrd70, using centos70, to avoid the TSC issue #116 in centos8
gcc9, to avoid this issue #117 with muldc3 in gcc8 complex numbers
cuda 11.0, because cuda10.2 is no good with gcc9

valassi commented 3 years ago

Note also that gcc10 is not supported by cuda 11.0 yet. I will stick with gcc9 and not try gcc10 yet.

oliviermattelaer commented 3 years ago

Interesting.

But in madgraph nan/Inf should not appear at any stage so we can/should use hard flag(or other tricks) for that. (For example PY8 has it's own complex multiplication class to avoid all those handling/slow down of NaN/Inf.

On 15 Mar 2021, at 16:42, Andrea Valassi @.**@.>> wrote:

Note also that gcc10 is not supported by cuda 11.0 yet. I will stick with gcc9 and not try gcc10 yet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/madgraph5/madgraph4gpu/issues/117#issuecomment-799522414, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6535RMZB6OTRC4D3UDU43TDYTGZANCNFSM4XX7C7BQ.

valassi commented 3 years ago

Hi Olivier, thanks. Ok I will also try the fast math and see if it speeds things up then!

valassi commented 3 years ago

Thanks Olivier, you are right :-)

Using fast math speeds up both gcc8 (from 4.0E5MEs/s to 1.09E6 MEs/s) and gcc9 (from 8.4 MEs/s to 1.16E6 MEs/s). The difference between gcc8 and gcc9 decreases a lot, but gcc9 is still a bit faster.

Flamegraph gcc8

Flamegraph gcc9

One thing that is peculiar is that FFV1P0_3 has completely disappeared from gcc9 fast math. Maybe it is somehow optimized away? Of course the code must pass through there. Very strange... I will try to see if there is anything we can do to improve the perf. Also to get rid of thos 'unknown'...

More generally, @oliviermattelaer : I am revisiting all past numbers, but for the moment note that these 1.1E6 throughputs in C++ (without vectorization and without openmp) are already quite better than what I observe in Fortran, around 6E5 at most (timing only the ME part in a production madevent run). This is almost a factor 2 better in C++ than fortran. Is this possible?

Are you using fastmath on Fortran by the way? It does not look like it, https://github.com/madgraph5/madgraph4gpu/blob/master/epoch1/gridpack/eemumu/madevent/Source/makefile This is a gridpack I got essentially out-of-the-box from Madgraph 2.9.2, so I assume it should have all the fastest options? Or can you suggest some better flags to use to compare our C++/CUDA and Fortran?

valassi commented 3 years ago

(I tried yum install elfutils-libelf-devel libunwind-devel audit-libs-devel slang-devel to get rid of unknown, but no effect, https://unix.stackexchange.com/questions/276179/missing-stack-symbols-with-perf-events-perf-report-despite-fno-omit-frame-poi)

valassi commented 3 years ago

Ok I found it, abnd of course it was on Brendan Gregg's webpage all the time: http://www.brendangregg.com/perf.html#StackTraces This is a known issue.

First I tried to rebuild using -fno-omit-frame-pointer: this changes the flamegraphs adding a few more things, but the result is still incomplete and unsatisafctory.

The next tip worked: add "--call-graph dwarf" to perf. Note that this depends on libunwind, so my previous addition of libunwind-devl (which ALSO installed libunwind) was necessary I think.

I will commit tomorrow the better flamegraphs and a few modified scripts.

Note that indeed FFV1P0_3 is reported as "(inlined)", so libunwind is able to see it somehow, but it is more tricky than other functions.

valassi commented 3 years ago

Here is a flamegraph for gcc9 with the latest script using dwarf. It is much nicer. The graph for gcc8 is almost indistiguishable (both for fast math). The numbers on the graph are consistent with the timings written out

EvtsPerSec[MatrixElems] (3) = ( 1.146790e+06                 )  sec^-1
...
TOTAL       :     7.961286 sec
TOTAL (123) :     7.766326 sec
...
TOTAL   (3) :     5.486143 sec

This is for flgrAV time ./check.exe -p 2048 256 12.

About build options: apart from fast math, I added nothing specific for flamegraph (neither -fno-omit-frame-pointer nor -fno-inline nor -g). It's best to let dwarf handle it.

About libunwind, I removed it and all is ok, there was no need to install it. Probably dwarf uses it internally statically. Note dwarf is http://wiki.dwarfstd.org.

I will commit the new flgrAV. First I will also check the fortran with fast math.

valassi commented 3 years ago

Here is a flamegraph for gcc9 with the latest script using dwarf. It is much nicer. The graph for gcc8 is almost indistiguishable (both for fast math). The numbers on the graph are consistent with the timings written out

EvtsPerSec[MatrixElems] (3) = ( 1.146790e+06                 )  sec^-1
...
TOTAL       :     7.961286 sec
TOTAL (123) :     7.766326 sec
...
TOTAL   (3) :     5.486143 sec

This is for flgrAV time ./check.exe -p 2048 256 12.

About build options: apart from fast math, I added nothing specific for flamegraph (neither -fno-omit-frame-pointer nor -fno-inline nor -g). It's best to let dwarf handle it.

About libunwind, I removed it and all is ok, there was no need to install it. Probably dwarf uses it internally statically. Note dwarf is http://wiki.dwarfstd.org.

I have also checked Fortran with fast math. It makes a big difference. See the timings here https://github.com/madgraph5/madgraph4gpu/blob/4728b756c1fbe1ba8427f8e384a57bf24cdbc1a5/epoch1/gridpack/README.md The flamegraph in the link above have a further hack to limit the height of the flames to 30 (as otherwise the python3 stack depth is almost 100). The latest script I used there is https://github.com/madgraph5/madgraph4gpu/blob/4728b756c1fbe1ba8427f8e384a57bf24cdbc1a5/epoch1/gridpack/flgrAV This link was an interesting read in that context: https://www.gabriel.urdhr.fr/2014/05/23/flamegraph/

Note that with the most aggressive compilation flags (fast math and -O3 in both C++ and Fortran), I get throughputs of 1.15E6/s in C++ and 1.50E6/s in Fortran. These are timing two different things (sigmakin in the standalone C++ application, matrix1 in the gridpack madevent Fortran application), but in principle they should be comparable. The Fortran throughput is a factor 1.3 higher (30%) than C++.

This difference of 30% between Fortran and C++ with the most aggressive flags looks comparable to what @oliviermattelaer had found in earlier tests: https://indico.cern.ch/event/907278/contributions/3818707/attachments/2020732/3378680/standalone_speed.pdf Note that he also experimented with "-0 -fcx-fortran-rules -fcx-limited-range". I considered using these, to use the same approximations for complex arithmentics in Fortran and C++, but in the end it's probably best to compar eFortran and C++ (and CUDA) using the most aggressive options, i.e. fast math. In the CUDA build we have -use_fast_math since a long time ago.

Fast math essentially intervenes here because it breaks IEEE 754 compliance, for instance for NaN and Inf compliance in complex number arithmetics. See these two interesting links https://gcc.gnu.org/wiki/FloatingPointMath https://stackoverflow.com/a/49438578 This is also what this issue #117 was originally about (__muldc3 probably also is about IEEE 754 compliance of complex numbers).

All this said, this should settle the question of defining a reasonable environment for comparing our C++/CUDA with the production Fortran. I will use Centos7, gcc9, fast math, and then CUDA11. It would still be interesting to understand what causes the 30% higher throughput in Fortran (disassemble with godbolt?), but that is probably too much.

Final comment, one should check if NaN and Inf are correctly propagated (and those events discarded) in madevent and the other samplers like outr standalone driver. I opened issue #129 about this.

valassi commented 3 years ago

For the record, I tried to use "-03 -fcx-fortran-rules -fcx-limited-range" in both Fortran and C++. Both decrease speed by about 20-30%, and Fortran remains faster than C++. This is a bit different from what Olivier had found. Ok, probably not much point investigating these compiler flags further.

valassi commented 3 years ago

A small update after this morning's findings on NaN (issue #144): using fast math is quite dangerous. We should make sure we get no NaN whatsoever preferably, otherwise our MC integration is unreliable? Anyway with double precision I have seen none. With single precision I had to implement an ad-hoc NaN checker, just to exclude events.

Another small update after Hadrien's useful talk on cutter this afternoon https://indico.cern.ch/event/1003975/

Olivier: try "-03 -fcx-fortran-rules -fcx-limited-range". I think this may be useful to get rid of NaNs, even if it is slower. To be considered...
Hadrien: Fortran can be faster than c++ because of errno in sqrt. Stephan: try -fmath-no-errno.
Hadrien: Fortran can also be faster because of the implicit restrict. This is one thing I tried to test in issue #9 but I have seen no difference
Hadrien: thrust complex can be used also in c++ and actually has better vectorization support than std complex

valassi commented 3 years ago

A few more compiler flag suggestions from Stephan (thanks!) on vectorisation/performance:

-O3 // clang is happy with O2, but not gcc
-fno-signaling-nans // gcc needs this for auto vectorisation
-fno-trapping-math // same here. Traps destroy vectorisation
-fno-math-errno // switches off the crazy old-style error handling using globals
-fno-rounding-math seems to be relevant only when you want to switch rounding modes. We didn't see differences.

And again the page about fp math the speaker shared yesterday, so it's all in one place: https://gcc.gnu.org/wiki/FloatingPointMath

valassi commented 3 years ago

I am closing this issue because it is very old.

There is an open "standing" issue #252 about compiler flags like -O3, fast math etc. I think that's a better place to reassess these various options, with our latest code (also on vectorized ggttgg and not only eemuymu) and our latest compilers.

Note that I have moved from gcc9.2 to gcc10.3 (with cuda 11.4 in both cases) as the new baseline. See #269.

Closing this.

madgraph5 / madgraph4gpu

1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?) #117