madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Understand why CMS sees cross section discrepancy between fortran and cuda/cpp for DY+4 jets #944

Open valassi opened 3 months ago

valassi commented 3 months ago

This is another followup to the meeting with CMS last week and the meeting with CMS yesterday https://indico.cern.ch/event/1373473/

@choij1589 presented results where the cross section for Drell Yan plus 4 jets is different in fortran and in cuda/cpp

We should understand why CMS sees this cross section discrepancy for DY+4 jets

*NB: IIUC the fortran version here is the original fortran (no vector_size), not the cudacpp version (with vector_size)


image

valassi commented 3 months ago

Thanks to @choij1589 see these extra numbers https://github.com/madgraph5/madgraph4gpu/issues/943#issuecomment-2268366920

image

I am very suprised because Jin was quoting

But the new numbers are

My feeling here is that the errors are completely underestimated. I mean, for the various fortran, 2188, 2206, 2228, these should be consistent wothin errors, right, @oliviermattelaer ?

Otherwise, if the errors are underestimated, I guess that also the discrepancy for DY+4 jets is completely acceptable?....

oliviermattelaer commented 3 months ago

My feeling here is that the errors are completely underestimated.

This is an estimator of the (one sigma) error assuming that all channel of integration are completely un-correlated. Due to the assumption of no correlation, those error are typically slightly under-estimated. So if you compare "2188 +- 4" and "2206 +- 4 " the difference is 18+-8 so ~2 sigma difference. I typically do not worry about less than a 3 sigma missmatch. So I would say that this sounds compatible. (but they are no guarantee that the two fortran will be compatible --and this is impossible to have a bit by bit comparison between two fortran version).

For the DY+4j, the difference will be 38+-0.6, so 63 sigma ... which is too large.

Cheers,

Olivier

PS: Reducing the "285" subprocesses is something that we have to put in our todo list, this is too large and is/will be a blocker for CMS (and should be easy to fix for SIMD, likely more problematic for GPU)

valassi commented 3 months ago

So if you compare "2188 +- 4" and "2206 +- 4 " the difference is 18+-8 so ~2 sigma difference.

Hi Olivier thanks.

I was more comparing "2188+-4" to "2236+-0.5" from an earlier slide by Jin (which I guess was produced in the same setup).

One idea I had was to try and use different random numbers and get a spread. On x10 runs with "new" fortran (ie split into very many processes) the DY+2jets gives me this

     Cross-section :   2.26e+04 +- 25.7 pb
     Cross-section :   2.274e+04 +- 26.02 pb
     Cross-section :   2.261e+04 +- 25.98 pb
     Cross-section :   2.268e+04 +- 30.3 pb
     Cross-section :   2.26e+04 +- 29.1 pb
     Cross-section :   2.266e+04 +- 28.5 pb
     Cross-section :   2.259e+04 +- 25.3 pb
     Cross-section :   2.256e+04 +- 27.53 pb
     Cross-section :   2.278e+04 +- 24.88 pb
     Cross-section :   2.27e+04 +- 24.88 pb

Or more precisely

more tlau/logs_ppdy012j.mad_fortran/*txt | egrep '(Current est)'
- Current estimate of cross-section: 22604.882597000003 +- 25.69693417269259
- Current estimate of cross-section: 22736.487131999995 +- 26.02223931415431
- Current estimate of cross-section: 22606.672284000004 +- 25.982101016390413
- Current estimate of cross-section: 22680.418818000002 +- 30.296789851771535
- Current estimate of cross-section: 22598.979159 +- 29.095684586947588
- Current estimate of cross-section: 22661.842675000004 +- 28.504426906822836
- Current estimate of cross-section: 22594.760607 +- 25.30150482309723
- Current estimate of cross-section: 22562.885393999994 +- 27.53350228395446
- Current estimate of cross-section: 22783.444705999995 +- 24.879796947884447
- Current estimate of cross-section: 22699.778944 +- 24.883887513199372

At first glance, the error is underestimated.

But I also do not understand why I get 23000 while Jin gets 2300 (a factor 10 lower) cross section?

This is the following process, I thought this was consistent?

[avalassi@itgold91 gcc11/usr] /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp> more pp_dy012j.mad/mg5.in 
set stdout_level DEBUG
set zerowidth_tchannel F
import model sm-no_b_mass
define p = u d c s b u~ d~ c~ s~ b~ g
define j = p
define ell+ = e+ mu+ ta+
define ell- = e- mu- ta-
define nu = ve vm vt
define nubar = ve~ vm~ vt~
generate p p > ell+ ell- @0
add process p p > ell+ ell- j @1
add process p p > ell+ ell- j j @2
output madevent_simd pp_dy012j.mad --hel_recycling=False --vector_size=32 

PS Just so that I do not forget where this was, it is here from WIP PR #946

[avalassi@itscrd90 bash] /data/avalassi/GPU2023/ghav-madgraph4gpu/epochX/cudacpp> git reset --hard f1a9800900c9b9d85f62d03196eed15863d7891d
HEAD is now at f1a980090 [cmsdy] in tlau add the results of x10 ppttdy012j fortran tests (manually fix the directory name)
[avalassi@itscrd90 bash] /data/avalassi/GPU2023/ghav-madgraph4gpu/epochX/cudacpp> more tlau/logs_ppdy012j_fortran/*txt | egrep '(Current est)'
- Current estimate of cross-section: 22604.882597000003 +- 25.69693417269259
- Current estimate of cross-section: 22736.487131999995 +- 26.02223931415431
- Current estimate of cross-section: 22606.672284000004 +- 25.982101016390413
- Current estimate of cross-section: 22680.418818000002 +- 30.296789851771535
- Current estimate of cross-section: 22598.979159 +- 29.095684586947588
- Current estimate of cross-section: 22661.842675000004 +- 28.504426906822836
- Current estimate of cross-section: 22594.760607 +- 25.30150482309723
- Current estimate of cross-section: 22562.885393999994 +- 27.53350228395446
- Current estimate of cross-section: 22783.444705999995 +- 24.879796947884447
- Current estimate of cross-section: 22699.778944 +- 24.883887513199372
choij1589 commented 3 months ago

Hi @valassi , I think the 10 times xsec difference occurs from add process - For my production there is no DY+0j or DY+1j included, so it's like

generate p p > ell+ ell- j j @0 
output madevent_gpu DY2j...
valassi commented 3 months ago

Hi @valassi , I think the 10 times xsec difference occurs from add process - For my production there is no DY+0j or DY+1j included, so it's like

generate p p > ell+ ell- j j @0 
output madevent_gpu DY2j...

Thanks Jin! That explains, I will try 2j only.

PS Could it be that you also have cuts? If I just add 0j, 1j, 2j on your slide I get about 6000+4000+3000 ie 13000 while I see 23000

Qubitol commented 3 months ago

Hi, here are some fresh tests on DY+3j with the mg5amcnlo@3.5.5, aka the current upstream Fortran.

Tests on DY+3j

Setup

Runs

I varied the parameters:

The reason I also fixed the scales is because I had recorded that different values of sde_strategy may result in different cross section values. This was a tip from @oliviermattelaer.

With fixed_ren_scale, fixed_fac_scale = False

sde_strategy = 1 sde_strategy = 2
1380 +- 3.2 1506 +- 4.1
1385 +- 3.3 1512 +- 4.5
1391 +- 3 1519 +- 3.8
1394 +- 3.4 1511 +- 3.7
1388 +- 3.1 1512 +- 3.8
Average
1387.6 +- 1.4 1512.0 +- 1.8

With fixed_ren_scale, fixed_fac_scale = True

sde_strategy = 1 sde_strategy = 2
1477 +- 3.5 1542 +- 4.1
1475 +- 3.5 1545 +- 4.5
1474 +- 3.5 1544 +- 4.2
1475 +- 3.6 1545 +- 3.7
1471 +- 3.4 1547 +- 4.3
Average
1474.4 +- 1.6 1544 +- 1.9

The Jin's results (from slides of 30/07) are:

Comments

It seems to me the sde_strategy = 1 with not fixed scale reproduces well the CUDA results. However, I'm a bit worried about the differences with the choice of sde_strategy values.

@oliviermattelaer what do you think?

oliviermattelaer commented 3 months ago

Thanks a lot Daniele, very clear report.

I agree with you that this indicates that DY+jets has some issue with the phase-space integration (likely for sde_strategy=2). Maybe one think that you can do here (if you have the time) is to force the phase-space point for all channel of integration to be (always) the same (you can do that when the smatrix function is called) and then print the multi-channel factor for each channel and check that the sum of those is indeed one for each strategy.

Cheers,

Olivier