vhbb / cmssw

CMS Offline Software
cms-sw.github.io/cmssw
4 stars 5 forks source link

vhbb#561 #564

Closed veelken closed 7 years ago

veelken commented 7 years ago

Dear all,

we would like to propose 3 modifications to vhbb.py and vhbbobj.py for the ttH, H->tautau analysis. We believe the modifications will not introduce any problem for anyone, except for an increase in filesize for some of the samples. Please let us know what you think.

Cheers,

Karl and Christian for ttH, H->tautau

vhbb.py

In the VHbbAnalyzer config we would like to set the flag passall=True to disable the numJets >= 2 cut The motivation for this modification is that in the ttH multilepton and ttH, H->tautau analyses we measure the electron charge misidentification rate in Z->ee events. Because the electron charge misidentification rate is very small, we need the full event statistics. We would like to add JetAna.lepSelCut = lambda lep : (abs(lep.pdgId()) == 11 and lep.relIso03 < 0.4) or (abs(lep.pdgId()) == 13 and lep.relIso04 < 0.4) so that jets don't get cleaned with respect to leptons that pass only the miniIso, but not the standard isolation cuts. In case the jet collection is cleaned with respect to leptons passing the miniIso, about 1% of b-jets get cleaned, which we would like to recover. vhbbobj.py

We would like to replace the line NTupleVariable("eleooEmooP", lambda x : abs(1.0/x.ecalEnergy() - x.eSuperClusterOverP()/x.ecalEnergy()) if abs(x.pdgId())==11 and x.ecalEnergy()>0.0 else 9e9 , help="Electron 1/E - 1/P"), by NTupleVariable("eleooEmooP", lambda x : (1.0/x.ecalEnergy() - x.eSuperClusterOverP()/x.ecalEnergy()) if abs(x.pdgId())==11 and x.ecalEnergy()>0.0 else 9e9 , help="Electron 1/E - 1/P"), i.e. remove the abs function. In the ttH multilepton and ttH, H->tautau analyses events with negative 1/E - 1/P values get cut and the presence of the abs in the computation of eleooEmooP means that we cannot do that, causing a problem for us in terms of synchronization with other groups. In our opinion, it is safe to remove the abs function, as it can always be applied later on analysis level.

veelken commented 7 years ago

Hi Andrea, I merged the trigger changes from Michele with mine. Please merge this PR now. Thank you, Christian

arizzi commented 7 years ago

this PR makes passall=true that is not ok. If there is a specific class of events you want to save we have to let them pass, we cannot passall=true for space reason especially when running on fully hadronic stuff

veelken commented 7 years ago

Hi Andrea,

the effect of the passall=true flag is that events with less than 2 jets no longer get cut. The nJets >= 2 cut is very loose for fully hadronic events. I expect that the nJets >= 2 cut mainly removes Z->ll and W->lnu events. Unfortunately, we do need an inclusive sample of Z->ee events for the purpose of estimating backgrounds, arising from electron charge misidentification, in the ttH, H->tautau analysis. I would prefer that we keep the event processing simple and not run different configs (with and without the nJet >= 2 cut on different samples). If disk space is a problem, we can store the VHbb Ntuples in Tallinn if you like (we have enough disk space).

arizzi commented 7 years ago

you only need Zee? then why having passall=true? can't we just whitelist Vtype=1 ?

veelken commented 7 years ago

Hi Andrea,

The “problem” is that we need to compare Z->ee data (single and double electron datasets) will the sum of SM MC (DYJets, but also WJets, TTbar, diboson).

Maybe it is easier to discuss this on Skype… samples with tt+jets, ttH and H->bb will pass the nJets >= 2 cut anyway, so setting passall=true will not increase the size of those samples, isn’t it ?

Cheers,

Christian

On Dec 21, 2016, at 2:53 PM, arizzi notifications@github.com wrote:

you only need Zee? then why having passall=true? can't we just whitelist Vtype=1 ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vhbb/cmssw/pull/564#issuecomment-268516094, or mute the thread https://github.com/notifications/unsubscribe-auth/AEwCTpZXVX1YUJ5e7beFqTUPKVBX8_DJks5rKSFDgaJpZM4KrjGC.

degrutto commented 7 years ago

HI Andrea, Christian,

sorry to chime in, to have passall=true is actually a recurrent question/desire that I have heard by many people using the vhbb ntuples (and many never said on github)

May I ask you @veelken if you know for example for QCD bkgs how much would be the addition of events on tape?

I think this is the only sample that we are afraid of exploding, @arizzi true? Maybe also the data (MET dataset? BtagCVS?

 Michele
arizzi commented 7 years ago

michele, asking with no motivation is not going to go anywhere. What's the reason for passall=true from others?

jpata commented 7 years ago

ciao, I tend to agree with @arizzi that we have to be conservative about space for the following reasons:

  1. at T2_CH_CSCS where I monitor the space, vhbb ntuples are among the largest individual user datasets, in the scale of 10s of TB (subset of the full vhbb datasets)
  2. Inflating the file sizes with events used only rarely directly affects analysis downstream, where the mostly-useless events have to be filtered every time, with costs in IO, CPU, analysis job reliability.

Possibly for the special cases one can consider making a separate crab run, but we have to keep throwing events at every possible stage we can.

degrutto commented 7 years ago

the most common comment is to make life easier for signal and bkg cut flow/efficiency studies (so I guess is more relevant for signal samples)

btw it would be interesting to quantify for QCD

veelken commented 7 years ago

Hi Michele,

I haven’t checked the effect on QCD multijet MC. I would expect that most QCD events actually pass nJets >= 2 anyway, as the pT and eta cuts on the jets are rather loose (pT > 25 GeV && abs(eta) < 4.7).

Cheers,

Christian

P.S. If you want me to compare the size of the VHbb Ntuples with and without the nJets >= 2 cut for one QCD MC sample, I am happy to do it. Just let me know on which sample I should run on

On Dec 21, 2016, at 3:51 PM, michele de gruttola notifications@github.com wrote:

the most common comment is to make life easier for signal and bkg cut flow/efficiency studies (so I guess is more relevant for signal samples)

btw it would be interesting to quantify for QCD

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vhbb/cmssw/pull/564#issuecomment-268527370, or mute the thread https://github.com/notifications/unsubscribe-auth/AEwCTuXcwZofWC2WUmpUqYgiCC0xhAipks5rKS7ngaJpZM4KrjGC.

degrutto commented 7 years ago

Hi, actually it is very easy to do just looking at the sample we have for V24 so

QCDHT200-300: root -l root://stormgf1.pi.infn.it:1094///store/user/arizzi/VHBBHeppyV24/QCD_HT200to300_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/VHBB_HEPPY_V24_QCD_HT200to300_TuneCUETP8M1_13TeV-madgraphMLM-Py8__spr16MAv2-puspr16_80r2as_2016_MAv2_v0-v1/160909_064004/0000/tree_4.root tree->GetEntries()/Count->Integral() (const double)8.96610860519146069e-01

but for the lower bin we consider root -l root://stormgf1.pi.infn.it:1094///store/user/arizzi/VHBBHeppyV24/QCD_HT100to200_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/VHBB_HEPPY_V24_QCD_HT100to200_TuneCUETP8M1_13TeV-madgraphMLM-Py8__spr16MAv2-puspr16_80r2as_2016_MAv2_v0-v1/160909_063817/0000/tree_10.root root [1] tree->GetEntries()/Count->Integral() (const double)8.72890038750593900e-02

so this one will be problematic

veelken commented 7 years ago

Hi Michele,

thank you for the numbers. I didn't know it is that easy to get them! How shall we proceed now ? Andrea, would it be ok with you to merge PR #561 and then we submit the crab jobs for the QCD MC samples with passall set to false ?

Cheers,

Christian

arizzi commented 7 years ago

no, we are not going to do it. The DY sample (i.e. a big one) reduction is 0.13. As joosep explained we do not waste resources because people doesn't like to get the ratios out of the histos. We cannot pay a factor 5-10x on the ntuple sizes. We currently distribute 2-3 copies of the ntuples so having space in a given T2 for one version of the ntuple is not the point (we need several site, plus we need one storing the whole history of VHbb ntuples for reproducibility, reanalysis for combination etc etc..) Let me add that our event size is already too big and that considering many variables that we compute are jet based I see no point in storing events without any jet.

PS: QCD events are not passing for other reasons (not the 2 jets requirements) and the passall will let those go through too.

veelken commented 7 years ago

Hi,

I have changed passall=False so that PR #561 can be merged. Would it be an option that we build two versions of vhbbHeppy for the ReReco data and MC , one with passall=True and one with passall=False, so that a few VHbb Ntuples could centrally be produced with passall set to true ? The alternative is of course that people working on ttH, H->tautau organize the production of samples with passall=True by themselves. What do you think ?

Cheers,

Christian

arizzi commented 7 years ago

Can you perhaps clarify the details of the selection you would apply on the passall=true events? Which vtype? Any ll mass cut?

Ciao Andrea

Il 22 dic 2016 09:50, "Christian Veelken" notifications@github.com ha scritto:

Hi,

I have changed passall=False so that PR #561 https://github.com/vhbb/cmssw/issues/561 can be merged. Would it be an option that we build two versions of vhbbHeppy for the ReReco data and MC , one with passall=True and one with passall=False, so that a few VHbb Ntuples could centrally be produced with passall set to true ? The alternative is of course that people working on ttH, H->tautau organize the production of samples with passall=True by themselves. What do you think ?

Cheers,

Christian

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/vhbb/cmssw/pull/564#issuecomment-268749619, or mute the thread https://github.com/notifications/unsubscribe-auth/AEyiluRP-6YU4-3gCrJQ0uGl8QP9q0Myks5rKjmkgaJpZM4KrjGC .

veelken commented 7 years ago

Hi Andrea,

we don't apply a cut on vtype in the ttH, H->tautau analysis. We do apply a cut 60 < mll < 120 GeV in some of our control regions/auxiliary measurements, but not in all. The best option in my opinion would still be to set passall=true based on the sample name.

I noticed that PR #561 is not merged yet. Can you please merge it now ? (passall is set to false by default now) What is the status/plan/timescale for the VHbb Ntuple production for the ReReco data and MC ?

Cheers,

Christian

arizzi commented 7 years ago

what does "mll" means for VTypes where there are not two leptons selected?!?

veelken commented 7 years ago

Hi Andrea,

we compute mll by looping over the selLeptons branch, apply some lepton selection criteria and then add the lepton four-vectors for pairs of leptons that pass the lepton selection criteria.

As I mentioned before, I think it is best not to use vtype and mll, but set passall=true based on the sample name.

Cheers,

Christian

arizzi commented 7 years ago

well, "it is better" is a relative concept, for sure it is not better for who has to babysit tens of thousands of jobs and add the complication of different settings. I'm not sure why you do not use vtype to classify Zee events of a control region. I mean VHbb ntuples are based on the vtype to setup cuts and fill variables, so just asking "remove any selection you do because we do not like the vtype" is not helping here. At some point we can decide to have different production campaign if different analysis have different needs. The ttH bb guys are already running their own campaign with additional MEM stuff, so it could be better to prepare a ttHtt.py and ttHtt-data.py config that you run for ttHtt with passall=true and we clean up vhbb.py from what we do not need in H->bb related analysis. We should understand what is the cost of different choices (cpu,diskspace,people time)

veelken commented 7 years ago

Hi Andrea,

sorry for not being more clear about it: Zee is only one of the control regions we need for ttH, H->tautau. We need other control regions to measure tight/loose lepton ratios and these control regions use events with single leptons (these control regions are dominated by QCD; we don't need QCD MC for these measurements though, only data). The selection of the control regions to measure tight/loose lepton ratios is work in progress and I would very much prefer to avoid hardcoding the cuts at Ntuple production time - at least for this round of the VHbb Ntuple production.

Cheers,

Christian

arizzi commented 7 years ago

this doesn't help. On data the passall=false is even more important because that's where we get most of the reduction. For VH channels the ntuple production always assumes that lepton selection has been already defined/optimized in dedicated sample production (e.g. with passall=true) or in previous studies. This is needed because we have to avoid events sharing between analyses in order to keep them stat-independent.

veelken commented 7 years ago

Hi Andrea,

yes, the motivation for this change is to not clean the jets wrt leptons that pass the miniIsolation, but fail the standard isolation. As we studied with Lorenzo, the effect of cleaning the jets wrt leptons passing miniIsolation or standard isolation is small, on the level of 1%. The main motivation for restoring the "old" jet cleaning behavior is to avoid differences in synchronization with other groups.