umd-lhcb / lhcb-ntuples-gen

ntuples generation with DaVinci and in-house offline components
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

Understand MC efficiencies for cocktail and large samples #81

Closed manuelfs closed 2 years ago

manuelfs commented 3 years ago

As presented to the semileptonic group, we had seen a 37% higher efficiency in Run 2 B->D*+ mu nu data than in Run 1 (with Run 1 cuts and accounting for a luminosity ratio of 1.41 and cross section ratio of 2)

image

We saw similar (though lower) factors with the cocktail MC

image

However, when we tried to do this comparison with the large B->D*+ mu nu MC samples, we found a much higher efficiency in Run 2

image

We need to understand the efficiencies of the large MC samples, which will be the ones we use in the analysis. First steps will be:

manuelfs commented 3 years ago

Checked that Yipeng found the right number of events for

manuelfs commented 3 years ago

Svende documented the Run 1 generator efficiencies in this post. For B->D*+ mu nu it is 5.88%, coming from here.

For Run 2, it is 8.04%, coming from here.

So the generator cut efficiencies above are right.

yipengsun commented 3 years ago

I want to do some code cleanups to make sure the styling is consistent in this repo. However, I see no point doing it in the middle of development. Once you finalize your changes, can you ping me here so I can do some quick cleanups on these scripts?

yipengsun commented 3 years ago

BTW, by "consistent coding style", I mean that the code checker flake8 and pylint won't show any error nor warnings. This project has .flake8 and .pylintrc included to suppress some default warnings.

Since you are migrating to VS code, you might be able to enable these warnings in VS code, if you like.

manuelfs commented 3 years ago

I activated pylint in VSCode and saw that it doesn't like to import several packages in one line, or no additional line at the end, so I quickly fixed that.

While I think it is helpful to aim at having consistent styles and I'll try to follow the pylint specifications, I'm wary about enforcing it too strongly, for various reasons

So let us all try to be reasonable.

manuelfs commented 3 years ago

Ran again the 3 types of cutflow using the same scripts/run_cutflows.py script, and obtained Run 2/Run 1 efficiency ratios consistent with above

Cut Run 1 Run 2 Run 1 $\epsilon$ Run 2 $\epsilon$ $\epsilon$ ratio
Total events 118213 126958 - - -
Signal truth-matching 4388 4638 3.7 3.7 0.98
Trig. + Strip. 151 397 3.4 8.6 2.49
Offline $D^0$ cuts 71 106 47.0 26.7 0.57
Offline $\mu$ cuts 66 77 93.0 72.6 0.78
Offline $D^* \mu$ combo cuts 50 60 75.8 77.9 1.03
$BDT_{iso} < 0.15$ 43 39 86.0 65.0 0.76
Total eff. - - 0.04 0.03 0.84
Yield ratio x 0.99 - - 43 39 0.90
Cut Run 1 Run 2 Run 1 $\epsilon$ Run 2 $\epsilon$ $\epsilon$ ratio
Total events 118213 126958 - - -
Normalization truth-matching 76567 82950 64.8 65.3 1.01
Trig. + Strip. 2898 8702 3.8 10.5 2.77
Offline $D^0$ cuts 1546 2171 53.3 24.9 0.47
Offline $\mu$ cuts 1348 1731 87.2 79.7 0.91
Offline $D^* \mu$ combo cuts 1039 1267 77.1 73.2 0.95
$BDT_{iso} < 0.15$ 880 1012 84.7 79.9 0.94
Total eff. - - 0.74 0.80 1.07
Yield ratio x 0.99 - - 880 1012 1.14
Cut Run 1 Run 2 Run 1 $\epsilon$ Run 2 $\epsilon$ $\epsilon$ ratio
Total events 118213 126958 - - -
$D^{**}$ truth-matching 35827 37755 30.3 29.7 0.98
Trig. + Strip. 1225 3818 3.4 10.1 2.96
Offline $D^0$ cuts 687 942 56.1 24.7 0.44
Offline $\mu$ cuts 617 755 89.8 80.1 0.89
Offline $D^* \mu$ combo cuts 185 244 30.0 32.3 1.08
$BDT_{iso} < 0.15$ 60 76 32.4 31.1 0.96
Total eff. - - 0.05 0.06 1.18
Yield ratio x 0.99 - - 60 76 1.25
Cut Run 1 Run 2 Run 1 $\epsilon$ Run 2 $\epsilon$ $\epsilon$ ratio
Total events 116295 834002 - - -
Trig. + Strip. 98982 264911 85.1 31.8 0.37
Offline $D^0$ cuts 52192 67807 52.7 25.6 0.49
Offline $\mu$ cuts 46244 54339 88.6 80.1 0.90
Offline $D^* \mu$ combo cuts 43996 50911 95.1 93.7 0.98
$BDT_{iso} < 0.15$ 36563 40741 83.1 80.0 0.96
Total eff. - - 31.4 4.9 0.16
Yield ratio x 1.82 - - 36563 40741 2.03
Cut Run 1 Run 2 Run 1 $\epsilon$ Run 2 $\epsilon$ $\epsilon$ ratio
Total events 342641 3066797 - - -
Trig. + Strip. 202992 3006318 59.2 98.0 1.65
Offline $D^0$ cuts 96838 568383 47.7 18.9 0.40
Offline $\mu$ cuts 90589 348105 93.5 61.2 0.65
Offline $D^* \mu$ combo cuts 73018 276307 80.6 79.4 0.98
$BDT_{iso} < 0.15$ 47178 172140 64.6 62.3 0.96
Total eff. - - 13.8 5.6 0.41
Yield ratio x 0.35 - - 47178 172140 1.29
manuelfs commented 3 years ago

The absolute efficiencies for B -> D*+ mu nu in the bare and large MC samples, calculated as N_aftercut*eff_gen*eff_filter/(N_BKK*eff_BF). are

For the bare (MC ID 11874091), the D*+ mu nu BF is 64.63%, the 33.3% generator-level efficiency comes from here, and the filter efficiency should be 1.

With respect to the bare efficiencies, Run 1 large is 56% and Run 2 large 99%, so it looks like the former may be wrong.

manuelfs commented 3 years ago

A key difference between these samples are the FFs used for the MC generation

If the difference comes from the different FFs, recalculating the efficiencies after FF reweighting should improve the agreement, but perhaps not fully given that the generator efficiency for the Run 1 large sample is calculated based on the buggy FFs.

CoffeeIntoScience commented 3 years ago

For the Run1 exclusive production (sim09 ver) I did the exercise of getting the filter efficiency as # of events coming out of the filter-stage jobs divided by # of events coming out of the generator-stage jobs and got 10.5% (up from the 7% expectation, but normalization-like modes seem to do better in the filter for whatever reason than, e.g., DD). This still only gets you up to 0.0369%, which is better but not matching Run2 yet

manuelfs commented 3 years ago

Phoebe detailed the process to find the total number of generated events before filtering, which can help determine the filter efficiency together with the BKK number

  1. Go to the Sim08 or Sim09 statistic tables on generator efficiency
  2. Look for the production you want of the given MC ID and get the ProdID (note, the number of accepted events in there may not include all the jobs, that's why the next steps are needed)
  3. Go to DIRAC
  4. On the leftmost panel, select Applications -> Data -> Transformation Monitor
  5. On the second leftmost panel, type the ProdID into "ProductionID(s):", click "Submit", and find the request number in gray just above the ProdID on the right panel image
  6. On the leftmost panel, select Applications -> Data -> Production Request
  7. Type the request ID from step 5 into the "RequestID(s):" field and click "Submit"
  8. Click on the + to the left of the request, find the subrequest for your MC ID
  9. Right click on the request in the right panel and click on "Productions" to find the number of generated events (first row) and the number of events that pass the filter (second row). The number of BKK events (third row) indicates that a few events were lost during the merge step and should not be used to calculate the efficiency

For instance, for the Run 2 FullSim MD production, the ProdID is 121220. The Request ID is 74234, the subrequest 74252, the number of generated events 6087052 and events that pass the filter 1502907, resulting in a filter efficiency of 24.7%.

image

Phobe did the same exercise for the Run 1 request and found a filter efficiency of 10.5% (for the Sim09 production), which makes the absolute efficiency 0.0369%, just 16% below the cocktail's 0.0442%.

The 16% difference now may be coming from the FFs. We can check if this is plausible with two tests

yipengsun commented 2 years ago

I think after fixing the chi2/ndof < 4 selection bug, the efficiencies between bare SIGNAL COMPONENT and FullSim SIGNAL are very similar:

Note: This is w/o applying any FF weights.

Bare signal component

Cut Run 1 Run 2 Run 1 $\epsilon$ Run 2 $\epsilon$ $\epsilon$ ratio
Total events 118213 126958 - - -
Signal truth-matching 4388 4638 3.7 3.7 0.98
Trig. + Strip. 151 397 3.4 8.6 2.49
Offline $D^0$ cuts 120 207 79.5 52.1 0.66
Offline $\mu$ cuts 110 162 91.7 78.3 0.85
Offline $D^* \mu$ combo cuts 70 115 63.6 71.0 1.12
$BDT_{iso} < 0.15$ 61 91 87.1 79.1 0.91
Total eff. - - 0.05 0.07 1.39
Yield ratio x 0.99 61 91 - - 1.48

FullSim signal:

Cut Run 1 Run 2 Run 1 $\epsilon$ Run 2 $\epsilon$ $\epsilon$ ratio
Total events 34794 150544 - - -
Trig. + Strip. 21534 45081 61.9 29.9 0.48
Offline $D^0$ cuts 17315 23191 80.4 51.4 0.64
Offline $\mu$ cuts 15935 18880 92.0 81.4 0.88
Offline $D^* \mu$ combo cuts 14992 17675 94.1 93.6 1.00
$BDT_{iso} < 0.15$ 12481 13961 83.3 79.0 0.95
Total eff. - - 35.9 9.3 0.26
Yield ratio x 1.31 12481 13961 - - 1.46
yipengsun commented 2 years ago

We should try to understand the 1.25 in normalization, but everything else looks fine for now.