madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

DY+3 jets cross section decreases by a factor 10 when changing vector size from 16384 to 32? #959

Open valassi opened 1 month ago

valassi commented 1 month ago

I am investigating why CMS does not see a SIMD speedup in DY+3jets ie #943

Specifically I am investigating why the 'Fortran overhead' is still so large and why it varies with SIMD flags in c++ ie #958

One of the points here, as discussed in #546, is trying to understand if vector_size has an impact on speed and particularly on the speed of the 'Fortran overhead'.

On itgold91 (Intel Gold, nproc=32, no GPU) I had initially done some tests with vector_size=16384. Now I am doing the same tests with vector_size=32. I recretaed the gridpacks (which was faster because the c++ builds were in ccache).

However, the first very surprising effect is that the cross section has varied by one order of magnitude?

< START: Wed Aug  7 08:53:34 PM CEST 2024
---
> START: Thu Aug  8 09:05:29 AM CEST 2024
290,299c290,295
< INFO:  Idle: 39,  Running: 32,  Completed: 1753 [ current time: 21h07 ] 
< INFO:  Idle: 38,  Running: 32,  Completed: 1754 [ current time: 21h07 ] 
< INFO:  Idle: 31,  Running: 32,  Completed: 1761 [  3.5s  ] 
< INFO:  Idle: 18,  Running: 32,  Completed: 1774 [  6.7s  ] 
< INFO:  Idle: 10,  Running: 32,  Completed: 1782 [  9.8s  ] 
< INFO:  Idle: 0,  Running: 29,  Completed: 1795 [  13.4s  ] 
< INFO:  Idle: 0,  Running: 19,  Completed: 1805 [  16.4s  ] 
< INFO:  Idle: 0,  Running: 11,  Completed: 1813 [  19.6s  ] 
< INFO:  Idle: 0,  Running: 0,  Completed: 1824 [  21.9s  ] 
< sum of cpu time of last step: 5h59m08s
---
> INFO:  Idle: 20,  Running: 31,  Completed: 1773 [ current time: 09h11 ] 
> INFO:  Idle: 19,  Running: 32,  Completed: 1773 [ current time: 09h11 ] 
> INFO:  Idle: 0,  Running: 31,  Completed: 1793 [  3s  ] 
> INFO:  Idle: 0,  Running: 14,  Completed: 1810 [  6.1s  ] 
> INFO:  Idle: 0,  Running: 0,  Completed: 1824 [  7.8s  ] 
> sum of cpu time of last step: 3h13m56s
302c298
<      Cross-section :   1.069e+04 +- 27.84 pb
---
>      Cross-section :   139.4 +- 0.6185 pb
308c304
< combination of events done in 0.41349196434020996 s 
---
> combination of events done in 0.3937568664550781 s 
405,408c401,404
< 26470.32user 549.43system 16:19.34elapsed 2758%CPU (0avgtext+0avgdata 1119336maxresident)k
< 251256inputs+31085672outputs (6402major+219118214minor)pagefaults 0swaps
< END: Wed Aug  7 09:09:53 PM CEST 2024
< ELAPSED: 979 seconds
---
> 11662.02user 249.60system 8:06.23elapsed 2449%CPU (0avgtext+0avgdata 76640maxresident)k
> 289688inputs+22151728outputs (3133major+71995085minor)pagefaults 0swaps
> END: Thu Aug  8 09:13:35 AM CEST 2024
> ELAPSED: 486 seconds

@oliviermattelaer is this something you would expect because of problems covering the phase space with large vector sizes? Or does this sound like a bug?

Or, is it that this process diverges and one has to put some physics cuts? @choij1589 do you have some physics cuts in your DY+3jets?

Thanks Andrea

valassi commented 1 month ago

(Then there is always the possibility that I am doing something really stupid...)

choij1589 commented 1 month ago

@valassi No, in the old version presented in the meetings only mll = 50 cut is implied for CMS in run_card - but since xqcut had been specified there would be automatic cuts on ptj due to auto_ptj_mjj?

oliviermattelaer commented 1 month ago

@choij1589 Do you have ickkw=1? (not sure of the support of xqcut if ickkw=0) But yes in that mode you do have sensible cuts.

Now given the small statistical error this is likely not related to the cuts (if you have singularity, the error on the cross-section should be bigger than that).

@oliviermattelaer is this something you would expect because of problems covering the phase space with large vector sizes? Or does this sound like a bug?

I would not expect an issue at the cross-section level (more at the distribution level) when setting large vector size (which is issue that the "channelId" branch is fixing). So this sounds to that you identify a new bug here.

choij1589 commented 1 month ago

@oliviermattelaer yes ickkw=1 turned on for CMS default (though not thinking about merging in this step)

oliviermattelaer commented 1 month ago

So I have made some pure fortran comparison for the following script:

generate p p  > l+ l- 3j
output 
launch
set mmll 50
set ickkw 1
set xqcut 20

For different branch of MG5aMC (no plugin impact here, pure mg5 fortran):

So conclusions here:

 cluster.f: Error. Invalid combination.
 error for clustering
At line 669 of file cluster.f
Fortran runtime error: Index '-1562495544' of dimension 1 of array 'imap' below lower bound of 1

At this stage, It is not clear if:

valassi commented 1 month ago

So this sounds to that you identify a new bug here.

Ouf that does not sound good :-(

Again I might be doing something silly, but I repeated the test and I seem to see this again. Maybe @choij1589 you can also try in your setup please? Run once with vector_size=16384 and once with 32 in the runcards.

@oliviermattelaer note a few points

choij1589 commented 4 weeks ago

@valassi Sorry I missed this issue, I will come back after testing with different vector_size configurations.

choij1589 commented 3 weeks ago

Hi @valassi , I have check DY+3j and the cross sections are different, vector_size=32: 1357 \pm 1.473 pb vector_size=16384: 1369 \pm 2.333 pb

but within 5 sigma difference ~ (1369-1357)/(1.473+2.333) so not sure it's actual different xsecs like in DY+4j(as @oliviermattelaer quoted?)

oliviermattelaer commented 3 weeks ago

Those two sounds indeed compatible with each other.