Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...)

valassi commented 1 year ago

Hi @oliviermattelaer @roiser @hageboeck @zeniheisser @Jooorgen I have finally done a few systematic 'launch' tests (using the scripts lauX.sh of #683). This is really ./bin/generate_events.

No time to analyse the details now, but the files are in WIP MR #709.

There are some icolamp crashes, see #710

And then there is the analysis of physcs results and of timing performance to do, which I will do here

First impressions

up to ggttg looks ok, same cross sections, timing performance difficult to tell
from ggttgg onwards, some crashes, different cross sections, clearly FORTRAN much slower

valassi commented 1 year ago

Looking at time performance

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> for f in ./logs_*/*txt; do echo $f; egrep '^(Thu|Fri)' $f; done 
./logs_ggtt_CPP/output.txt
Thu Jun 15 07:46:35 CEST 2023
Thu Jun 15 07:46:56 CEST 2023
./logs_ggtt_CUDA/output.txt
Thu Jun 15 07:45:46 CEST 2023
Thu Jun 15 07:46:10 CEST 2023
./logs_ggtt_FORTRAN/output.txt
Thu Jun 15 07:46:11 CEST 2023
Thu Jun 15 07:46:34 CEST 2023
./logs_ggttg_CPP/output.txt
Thu Jun 15 07:48:08 CEST 2023
Thu Jun 15 07:48:42 CEST 2023
./logs_ggttg_CUDA/output.txt
Thu Jun 15 07:46:58 CEST 2023
Thu Jun 15 07:47:32 CEST 2023
./logs_ggttg_FORTRAN/output.txt
Thu Jun 15 07:47:33 CEST 2023
Thu Jun 15 07:48:07 CEST 2023
./logs_ggttgg_CPP/output.txt
Thu Jun 15 08:02:42 CEST 2023
Thu Jun 15 08:07:21 CEST 2023
./logs_ggttgg_CUDA/output.txt
Thu Jun 15 07:48:43 CEST 2023
Thu Jun 15 07:50:39 CEST 2023
./logs_ggttgg_FORTRAN/output.txt
Thu Jun 15 07:50:40 CEST 2023
Thu Jun 15 08:02:41 CEST 2023
./logs_ggttggg_CPP/output.txt
Thu Jun 15 21:10:27 CEST 2023
Fri Jun 16 01:52:35 CEST 2023
./logs_ggttggg_CUDA/output.txt
Thu Jun 15 08:07:22 CEST 2023
Thu Jun 15 08:52:27 CEST 2023
./logs_ggttggg_FORTRAN/output.txt
Thu Jun 15 08:52:28 CEST 2023
Thu Jun 15 21:10:26 CEST 2023

Focusing on ggttgg and ggttggg, even if they are those where Fortran crashes

For ggttgg

CPP 5m21
CUDA 1m56
FORTRAN 12m01

For ggttggg

CPP 4h41m
CUDA 45m
FORTRAN 12h18m

So overall a speedup 2x to 3x for CPP and around 6x to 15x for CUDA with respect to FORTRAN, which is not bad.

Note that here CUDA speedups is with respect to all 4 cores on the CPU, not a single core.

valassi commented 1 year ago

Looking at events in lhe files - note that two runs crashed in #710

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> ls -l ./logs_*/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059846 Jun 16 17:53 ./logs_ggtt_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059847 Jun 16 17:53 ./logs_ggtt_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059850 Jun 16 17:53 ./logs_ggtt_FORTRAN/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_FORTRAN/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 11931512 Jun 16 17:53 ./logs_ggttgg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 11931402 Jun 16 17:53 ./logs_ggttgg_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 13366778 Jun 16 17:53 ./logs_ggttggg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 13366543 Jun 16 17:53 ./logs_ggttggg_CUDA/Events/run_01/unweighted_events.lhe

valassi commented 1 year ago

Concerning results

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> for f in `ls -tr ./logs_*/*txt`; do echo $f; egrep '(Cross-section|Luminosity)' $f; done 
./logs_ggtt_CPP/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggtt_CUDA/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggtt_FORTRAN/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggttg_CPP/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttg_CUDA/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttg_FORTRAN/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttgg_CPP/output.txt
INFO: Effective Luminosity 47.57523835538049 pb^-1 
     Cross-section :   252.3 +- 0.3483 pb
./logs_ggttgg_FORTRAN/output.txt
./logs_ggttgg_CUDA/output.txt
INFO: Effective Luminosity 47.60279369991978 pb^-1 
     Cross-section :   252.3 +- 0.3624 pb
./logs_ggttggg_CPP/output.txt
INFO: Effective Luminosity 95.2140713761495 pb^-1 
     Cross-section :   126 +- 0.1757 pb
./logs_ggttggg_FORTRAN/output.txt
./logs_ggttggg_CUDA/output.txt
INFO: Effective Luminosity 95.14995234074645 pb^-1 
     Cross-section :   126.1 +- 0.1732 pb

This is interesting because results are identical for CUDA, CPP and FORTRAN for ggtt, ggttg.

But for ggttgg and ggttggg there is a tiny difference between CUDA and CPP (why?). And the FORTRAN fails as per #710

valassi commented 1 year ago

In the latest commits I have rerun the testst after fixing #710 (using @oliviermattelaer select_color patch ... which however reintroduces #655 that will need to be fixed).

The latest timings are as follows

grep ELAPSED `ls -tr tlau/logs_ggtt*/*txt`
tlau/logs_ggtt_CUDA/output.txt:ELAPSED: 24 seconds
tlau/logs_ggtt_FORTRAN/output.txt:ELAPSED: 23 seconds
tlau/logs_ggtt_CPP/output.txt:ELAPSED: 22 seconds
tlau/logs_ggttg_CUDA/output.txt:ELAPSED: 35 seconds
tlau/logs_ggttg_FORTRAN/output.txt:ELAPSED: 49 seconds
tlau/logs_ggttg_CPP/output.txt:ELAPSED: 36 seconds
tlau/logs_ggttgg_CUDA/output.txt:ELAPSED: 116 seconds
tlau/logs_ggttgg_FORTRAN/output.txt:ELAPSED: 857 seconds
tlau/logs_ggttgg_CPP/output.txt:ELAPSED: 280 seconds
tlau/logs_ggttggg_CUDA/output.txt:ELAPSED: 2705 seconds
tlau/logs_ggttggg_FORTRAN/output.txt:ELAPSED: 57322 seconds
tlau/logs_ggttggg_CPP/output.txt:ELAPSED: 17034 seconds

This includes everything including all build overheads. It is here with the default survey/refine/generate settings.

The most interesting speedups, as usual, are for ggttggg - which however I will try to make shorter as these tests are really very long. Anyway in practice

CPP (512y here) is a factor 3.4 faster than FORTRAN overall (17k vs 57k seconds)
CUDA is a factor 21 faster than FORTRAN on 4 CPU cores (2.7k vs 57k)... maybe it indicates a factor x80 over a single core, maybe not (also CUDA is faster by running over 4 cores, as the fortran overhead is spread out, I imagine)

The cross sections are very similar but with a few small differences

for f in `ls -tr tlau/logs_*/*txt`; do echo $f; egrep '(Cross-section|Luminosity)' $f; done 
tlau/logs_ggtt_CUDA/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggtt_FORTRAN/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggtt_CPP/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggttg_CUDA/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
tlau/logs_ggttg_FORTRAN/output.txt
INFO: Effective Luminosity 28.97333746486687 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
tlau/logs_ggttg_CPP/output.txt
INFO: Effective Luminosity 28.973330399461705 pb^-1 
     Cross-section :   414.2 +- 0.7846 pb
tlau/logs_ggttgg_CUDA/output.txt
INFO: Effective Luminosity 47.60279369991978 pb^-1 
     Cross-section :   252.3 +- 0.3624 pb
tlau/logs_ggttgg_FORTRAN/output.txt
INFO: Effective Luminosity 47.5680525374908 pb^-1 
     Cross-section :   252.4 +- 0.3528 pb
tlau/logs_ggttgg_CPP/output.txt
INFO: Effective Luminosity 47.57523835538049 pb^-1 
     Cross-section :   252.3 +- 0.3483 pb
tlau/logs_ggttggg_CUDA/output.txt
INFO: Effective Luminosity 95.14995234074645 pb^-1 
     Cross-section :   126.1 +- 0.1732 pb
tlau/logs_ggttggg_FORTRAN/output.txt
INFO: Effective Luminosity 95.24990591754717 pb^-1 
     Cross-section :   125.9 +- 0.1767 pb
tlau/logs_ggttggg_CPP/output.txt
INFO: Effective Luminosity 95.2140713761495 pb^-1 
     Cross-section :   126 +- 0.1757 pb

I think that this is due to the fact that the numbers of events are not multiples of 2, so effectively CUDA/CPP process a different number of events than scalar FORTRAN. I will try to tune this too.

I also need to check the LHE files including color and helicity...

valassi commented 1 year ago

This remains one of the highest priorities in my opinion. One part of this is also being able to configure the use of fewer events in launch, to make faster tests for development (fewer events are clearly a nogo in production, but are essential for developer tests).

valassi commented 5 months ago

As discussed in #855 and #852, I remain convinced that making it possible to tune the machinery to run generate_events with reduced precision and fewer events is a priority, to enable QUICK and SYSTEMATIC tests of all processes, all fptype combinations, etc. While a reduced precision is not what the users will use, it is what developers need for unit tests and integration tests. To be discussed...

madgraph5 / madgraph4gpu

Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...) #711