madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...) #711

Open valassi opened 1 year ago

valassi commented 1 year ago

Hi @oliviermattelaer @roiser @hageboeck @zeniheisser @Jooorgen I have finally done a few systematic 'launch' tests (using the scripts lauX.sh of #683). This is really ./bin/generate_events.

No time to analyse the details now, but the files are in WIP MR #709.

There are some icolamp crashes, see #710

And then there is the analysis of physcs results and of timing performance to do, which I will do here

First impressions

valassi commented 1 year ago

Looking at time performance

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> for f in ./logs_*/*txt; do echo $f; egrep '^(Thu|Fri)' $f; done 
./logs_ggtt_CPP/output.txt
Thu Jun 15 07:46:35 CEST 2023
Thu Jun 15 07:46:56 CEST 2023
./logs_ggtt_CUDA/output.txt
Thu Jun 15 07:45:46 CEST 2023
Thu Jun 15 07:46:10 CEST 2023
./logs_ggtt_FORTRAN/output.txt
Thu Jun 15 07:46:11 CEST 2023
Thu Jun 15 07:46:34 CEST 2023
./logs_ggttg_CPP/output.txt
Thu Jun 15 07:48:08 CEST 2023
Thu Jun 15 07:48:42 CEST 2023
./logs_ggttg_CUDA/output.txt
Thu Jun 15 07:46:58 CEST 2023
Thu Jun 15 07:47:32 CEST 2023
./logs_ggttg_FORTRAN/output.txt
Thu Jun 15 07:47:33 CEST 2023
Thu Jun 15 07:48:07 CEST 2023
./logs_ggttgg_CPP/output.txt
Thu Jun 15 08:02:42 CEST 2023
Thu Jun 15 08:07:21 CEST 2023
./logs_ggttgg_CUDA/output.txt
Thu Jun 15 07:48:43 CEST 2023
Thu Jun 15 07:50:39 CEST 2023
./logs_ggttgg_FORTRAN/output.txt
Thu Jun 15 07:50:40 CEST 2023
Thu Jun 15 08:02:41 CEST 2023
./logs_ggttggg_CPP/output.txt
Thu Jun 15 21:10:27 CEST 2023
Fri Jun 16 01:52:35 CEST 2023
./logs_ggttggg_CUDA/output.txt
Thu Jun 15 08:07:22 CEST 2023
Thu Jun 15 08:52:27 CEST 2023
./logs_ggttggg_FORTRAN/output.txt
Thu Jun 15 08:52:28 CEST 2023
Thu Jun 15 21:10:26 CEST 2023

Focusing on ggttgg and ggttggg, even if they are those where Fortran crashes

For ggttgg

For ggttggg

So overall a speedup 2x to 3x for CPP and around 6x to 15x for CUDA with respect to FORTRAN, which is not bad.

Note that here CUDA speedups is with respect to all 4 cores on the CPU, not a single core.

valassi commented 1 year ago

Looking at events in lhe files - note that two runs crashed in #710

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> ls -l ./logs_*/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059846 Jun 16 17:53 ./logs_ggtt_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059847 Jun 16 17:53 ./logs_ggtt_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059850 Jun 16 17:53 ./logs_ggtt_FORTRAN/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_FORTRAN/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 11931512 Jun 16 17:53 ./logs_ggttgg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 11931402 Jun 16 17:53 ./logs_ggttgg_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 13366778 Jun 16 17:53 ./logs_ggttggg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 13366543 Jun 16 17:53 ./logs_ggttggg_CUDA/Events/run_01/unweighted_events.lhe
valassi commented 1 year ago

Concerning results

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> for f in `ls -tr ./logs_*/*txt`; do echo $f; egrep '(Cross-section|Luminosity)' $f; done 
./logs_ggtt_CPP/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggtt_CUDA/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggtt_FORTRAN/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggttg_CPP/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttg_CUDA/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttg_FORTRAN/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttgg_CPP/output.txt
INFO: Effective Luminosity 47.57523835538049 pb^-1 
     Cross-section :   252.3 +- 0.3483 pb
./logs_ggttgg_FORTRAN/output.txt
./logs_ggttgg_CUDA/output.txt
INFO: Effective Luminosity 47.60279369991978 pb^-1 
     Cross-section :   252.3 +- 0.3624 pb
./logs_ggttggg_CPP/output.txt
INFO: Effective Luminosity 95.2140713761495 pb^-1 
     Cross-section :   126 +- 0.1757 pb
./logs_ggttggg_FORTRAN/output.txt
./logs_ggttggg_CUDA/output.txt
INFO: Effective Luminosity 95.14995234074645 pb^-1 
     Cross-section :   126.1 +- 0.1732 pb

This is interesting because results are identical for CUDA, CPP and FORTRAN for ggtt, ggttg.

But for ggttgg and ggttggg there is a tiny difference between CUDA and CPP (why?). And the FORTRAN fails as per #710

valassi commented 1 year ago

In the latest commits I have rerun the testst after fixing #710 (using @oliviermattelaer select_color patch ... which however reintroduces #655 that will need to be fixed).

The latest timings are as follows

grep ELAPSED `ls -tr tlau/logs_ggtt*/*txt`
tlau/logs_ggtt_CUDA/output.txt:ELAPSED: 24 seconds
tlau/logs_ggtt_FORTRAN/output.txt:ELAPSED: 23 seconds
tlau/logs_ggtt_CPP/output.txt:ELAPSED: 22 seconds
tlau/logs_ggttg_CUDA/output.txt:ELAPSED: 35 seconds
tlau/logs_ggttg_FORTRAN/output.txt:ELAPSED: 49 seconds
tlau/logs_ggttg_CPP/output.txt:ELAPSED: 36 seconds
tlau/logs_ggttgg_CUDA/output.txt:ELAPSED: 116 seconds
tlau/logs_ggttgg_FORTRAN/output.txt:ELAPSED: 857 seconds
tlau/logs_ggttgg_CPP/output.txt:ELAPSED: 280 seconds
tlau/logs_ggttggg_CUDA/output.txt:ELAPSED: 2705 seconds
tlau/logs_ggttggg_FORTRAN/output.txt:ELAPSED: 57322 seconds
tlau/logs_ggttggg_CPP/output.txt:ELAPSED: 17034 seconds

This includes everything including all build overheads. It is here with the default survey/refine/generate settings.

The most interesting speedups, as usual, are for ggttggg - which however I will try to make shorter as these tests are really very long. Anyway in practice

The cross sections are very similar but with a few small differences

for f in `ls -tr tlau/logs_*/*txt`; do echo $f; egrep '(Cross-section|Luminosity)' $f; done 
tlau/logs_ggtt_CUDA/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggtt_FORTRAN/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggtt_CPP/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggttg_CUDA/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
tlau/logs_ggttg_FORTRAN/output.txt
INFO: Effective Luminosity 28.97333746486687 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
tlau/logs_ggttg_CPP/output.txt
INFO: Effective Luminosity 28.973330399461705 pb^-1 
     Cross-section :   414.2 +- 0.7846 pb
tlau/logs_ggttgg_CUDA/output.txt
INFO: Effective Luminosity 47.60279369991978 pb^-1 
     Cross-section :   252.3 +- 0.3624 pb
tlau/logs_ggttgg_FORTRAN/output.txt
INFO: Effective Luminosity 47.5680525374908 pb^-1 
     Cross-section :   252.4 +- 0.3528 pb
tlau/logs_ggttgg_CPP/output.txt
INFO: Effective Luminosity 47.57523835538049 pb^-1 
     Cross-section :   252.3 +- 0.3483 pb
tlau/logs_ggttggg_CUDA/output.txt
INFO: Effective Luminosity 95.14995234074645 pb^-1 
     Cross-section :   126.1 +- 0.1732 pb
tlau/logs_ggttggg_FORTRAN/output.txt
INFO: Effective Luminosity 95.24990591754717 pb^-1 
     Cross-section :   125.9 +- 0.1767 pb
tlau/logs_ggttggg_CPP/output.txt
INFO: Effective Luminosity 95.2140713761495 pb^-1 
     Cross-section :   126 +- 0.1757 pb

I think that this is due to the fact that the numbers of events are not multiples of 2, so effectively CUDA/CPP process a different number of events than scalar FORTRAN. I will try to tune this too.

I also need to check the LHE files including color and helicity...

valassi commented 1 year ago

This remains one of the highest priorities in my opinion. One part of this is also being able to configure the use of fewer events in launch, to make faster tests for development (fewer events are clearly a nogo in production, but are essential for developer tests).

valassi commented 5 months ago

As discussed in #855 and #852, I remain convinced that making it possible to tune the machinery to run generate_events with reduced precision and fewer events is a priority, to enable QUICK and SYSTEMATIC tests of all processes, all fptype combinations, etc. While a reduced precision is not what the users will use, it is what developers need for unit tests and integration tests. To be discussed...