DY+3 jets cross section decreases by a factor 10 when changing vector size from 16384 to 32?

valassi commented 1 month ago

I am investigating why CMS does not see a SIMD speedup in DY+3jets ie #943

Specifically I am investigating why the 'Fortran overhead' is still so large and why it varies with SIMD flags in c++ ie #958

One of the points here, as discussed in #546, is trying to understand if vector_size has an impact on speed and particularly on the speed of the 'Fortran overhead'.

On itgold91 (Intel Gold, nproc=32, no GPU) I had initially done some tests with vector_size=16384. Now I am doing the same tests with vector_size=32. I recretaed the gridpacks (which was faster because the c++ builds were in ccache).

However, the first very surprising effect is that the cross section has varied by one order of magnitude?

< START: Wed Aug  7 08:53:34 PM CEST 2024
---
> START: Thu Aug  8 09:05:29 AM CEST 2024
290,299c290,295
< INFO:  Idle: 39,  Running: 32,  Completed: 1753 [ current time: 21h07 ] 
< INFO:  Idle: 38,  Running: 32,  Completed: 1754 [ current time: 21h07 ] 
< INFO:  Idle: 31,  Running: 32,  Completed: 1761 [  3.5s  ] 
< INFO:  Idle: 18,  Running: 32,  Completed: 1774 [  6.7s  ] 
< INFO:  Idle: 10,  Running: 32,  Completed: 1782 [  9.8s  ] 
< INFO:  Idle: 0,  Running: 29,  Completed: 1795 [  13.4s  ] 
< INFO:  Idle: 0,  Running: 19,  Completed: 1805 [  16.4s  ] 
< INFO:  Idle: 0,  Running: 11,  Completed: 1813 [  19.6s  ] 
< INFO:  Idle: 0,  Running: 0,  Completed: 1824 [  21.9s  ] 
< sum of cpu time of last step: 5h59m08s
---
> INFO:  Idle: 20,  Running: 31,  Completed: 1773 [ current time: 09h11 ] 
> INFO:  Idle: 19,  Running: 32,  Completed: 1773 [ current time: 09h11 ] 
> INFO:  Idle: 0,  Running: 31,  Completed: 1793 [  3s  ] 
> INFO:  Idle: 0,  Running: 14,  Completed: 1810 [  6.1s  ] 
> INFO:  Idle: 0,  Running: 0,  Completed: 1824 [  7.8s  ] 
> sum of cpu time of last step: 3h13m56s
302c298
<      Cross-section :   1.069e+04 +- 27.84 pb
---
>      Cross-section :   139.4 +- 0.6185 pb
308c304
< combination of events done in 0.41349196434020996 s 
---
> combination of events done in 0.3937568664550781 s 
405,408c401,404
< 26470.32user 549.43system 16:19.34elapsed 2758%CPU (0avgtext+0avgdata 1119336maxresident)k
< 251256inputs+31085672outputs (6402major+219118214minor)pagefaults 0swaps
< END: Wed Aug  7 09:09:53 PM CEST 2024
< ELAPSED: 979 seconds
---
> 11662.02user 249.60system 8:06.23elapsed 2449%CPU (0avgtext+0avgdata 76640maxresident)k
> 289688inputs+22151728outputs (3133major+71995085minor)pagefaults 0swaps
> END: Thu Aug  8 09:13:35 AM CEST 2024
> ELAPSED: 486 seconds

@oliviermattelaer is this something you would expect because of problems covering the phase space with large vector sizes? Or does this sound like a bug?

Or, is it that this process diverges and one has to put some physics cuts? @choij1589 do you have some physics cuts in your DY+3jets?

Thanks Andrea

valassi commented 1 month ago

(Then there is always the possibility that I am doing something really stupid...)

choij1589 commented 1 month ago

@valassi No, in the old version presented in the meetings only mll = 50 cut is implied for CMS in run_card - but since xqcut had been specified there would be automatic cuts on ptj due to auto_ptj_mjj?

oliviermattelaer commented 1 month ago

@choij1589 Do you have ickkw=1? (not sure of the support of xqcut if ickkw=0) But yes in that mode you do have sensible cuts.

Now given the small statistical error this is likely not related to the cuts (if you have singularity, the error on the cross-section should be bigger than that).

@oliviermattelaer is this something you would expect because of problems covering the phase space with large vector sizes? Or does this sound like a bug?

I would not expect an issue at the cross-section level (more at the distribution level) when setting large vector size (which is issue that the "channelId" branch is fixing). So this sounds to that you identify a new bug here.

choij1589 commented 1 month ago

@oliviermattelaer yes ickkw=1 turned on for CMS default (though not thinking about merging in this step)

oliviermattelaer commented 1 month ago

So I have made some pure fortran comparison for the following script:

generate p p  > l+ l- 3j
output 
launch
set mmll 50
set ickkw 1
set xqcut 20

For different branch of MG5aMC (no plugin impact here, pure mg5 fortran):

LTS: 125.1 +- 0.2817 pb
3.6.0: 124.1 +- 0.3227 pb (This I consider compatible with LTS)
gpucpp_for360: 122.7 +- 0.3098 pb (note failing in the selection of helicity for the helicity recycling)
gpucpp_goodhel: 122.74 ± 0.309 (note failing in the selection of helicity for the helicity recycling + failing in combine events)
gpucpp: 122.7 +- 0.3098 pb (note failing in the selection of helicity for the helicity recycling)

So conclusions here:

the cross-section is modified between gpucpp and "official" branch of MG5aMC.
they are an issue for determining helicity recycling (only for the 3j case actually) which is related to the black box and the clustering of such events (see the log of the helicity recycling code):

 cluster.f: Error. Invalid combination.
 error for clustering
At line 669 of file cluster.f
Fortran runtime error: Index '-1562495544' of dimension 1 of array 'imap' below lower bound of 1

At this stage, It is not clear if:

what I found is related or not to your 10x factor
if the helicity recycling crash is related to the missmatch of cross-section (it does fall back to no-helicity recycling) So for the moment, I will investigate the helicity recycling point and then the missmatch in the MG5aMC status issue: #867 https://github.com/madgraph5/madgraph4gpu/issues/867#issuecomment-2275315137

valassi commented 1 month ago

So this sounds to that you identify a new bug here.

Ouf that does not sound good :-(

Again I might be doing something silly, but I repeated the test and I seem to see this again. Maybe @choij1589 you can also try in your setup please? Run once with vector_size=16384 and once with 32 in the runcards.

@oliviermattelaer note a few points

unless I am mistaken this is all using fortran MEs. So if it is a bug it is probably in the madevent fortran infrastructure, nothing to do with cudacpp?
I think (am not extra sure) that I am not using any VECSIZE_USED here, so hopefully this is not to blame
I noticed in generation from gridpacks (not in gridpack creation, which is what I describe here) that many more events are processed when a large vector size is set... this is not surprising, but what I mean is that the results above are in any case based on very different of MC points
last, again I find it strange that event generation from gridpack prints out a zero cross section (#716) , I am not sure if this is normal?

choij1589 commented 1 month ago

@valassi Sorry I missed this issue, I will come back after testing with different vector_size configurations.

choij1589 commented 4 weeks ago

Hi @valassi , I have check DY+3j and the cross sections are different, vector_size=32: 1357 \pm 1.473 pb vector_size=16384: 1369 \pm 2.333 pb

but within 5 sigma difference ~ (1369-1357)/(1.473+2.333) so not sure it's actual different xsecs like in DY+4j(as @oliviermattelaer quoted?)

oliviermattelaer commented 4 weeks ago

Those two sounds indeed compatible with each other.

madgraph5 / madgraph4gpu

DY+3 jets cross section decreases by a factor 10 when changing vector size from 16384 to 32? #959