Cutflow Structure - Githubissues

demarley commented 6 years ago

Comments/suggestions for cutflow implementation in the producer.

[x] Why are the cutflows different for data and MC? lines 373-380
[x] The cutflow should only be filled if the cut passes. Here I think you should move this fill statement to outside of the if statement (after return false) and change the bin to cutflow_bin+1. See other lines:
466,491,506,521,538,555,563, 612,622,633,646,655
[x] If it were me, I would split the event selection differently and only do a single cut for finding the correct leptons, and not split it by the ID/ISO/etc. Then you only need to do this once.
[x] I wouldn't say the b-tagging SF depends on the sample, but on the analysis. I think this analysis should use ttbar events with similar kinematics to estimate the efficiency in simulation.

tahuang1991 commented 6 years ago

Hi Dan,

I was looking into your comments:

On Mar 26, 2018, at 8:15 PM, Dan Marley <notifications@github.com mailto:notifications@github.com> wrote:

Comments/suggestions for cutflow implementation in the producer https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_lpernie_HhhAnalysis_blob_CMSSW940_python_NanoAOD_HHbbWWProducer.py&d=DwMCaQ&c=ODFT-G5SujMiGrKuoJJjVg&r=fC_t7onKO6AaG0-ckJ9SVwag7IrMN0a5kEZt0IN6H-8&m=-me22Uq29_wvRbnyyK9Vvl4QqiMFeuI2O8ARwAxhvKE&s=CR9O7sNvZ_DxChJtc27bjq00y5IO2nRZsR3QhGMVjRA&e=.

Why are the cutflows different for data and MC? lines 373-380 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_lpernie_HhhAnalysis_blob_CMSSW940_python_NanoAOD_HHbbWWProducer.py-23L373-2DL380&d=DwMCaQ&c=ODFT-G5SujMiGrKuoJJjVg&r=fC_t7onKO6AaG0-ckJ9SVwag7IrMN0a5kEZt0IN6H-8&m=-me22Uq29_wvRbnyyK9Vvl4QqiMFeuI2O8ARwAxhvKE&s=FMQ4V9XUNmJUzCmuWFQNUy5zSm6OZhNYeW2T9UsyXMY&e= At first I was thinking for MC, if one event does not passing the trigger selection, then I do not know which category it is but I can assign it with a weight based on branch ratio: 0.25 for MuMu, 0.5 for MuEl and 0.25 for ElEl. I think it should not matter so much as the event weight sum is derived from h_cutflows (https://github.com/lpernie/HhhAnalysis/blob/CMSSW940/python/NanoAOD/HHbbWWProducer.py#L381 https://github.com/lpernie/HhhAnalysis/blob/CMSSW940/python/NanoAOD/HHbbWWProducer.py#L381). For data, I just fill all cutflow histogram. If you think keeping the data and MC the same is better, then I can fix it. I am fine with both implementation. The cutflow should only be filled if the cut passes. Here https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_lpernie_HhhAnalysis_blob_CMSSW940_python_NanoAOD_HHbbWWProducer.py-23L392&d=DwMCaQ&c=ODFT-G5SujMiGrKuoJJjVg&r=fC_t7onKO6AaG0-ckJ9SVwag7IrMN0a5kEZt0IN6H-8&m=-me22Uq29_wvRbnyyK9Vvl4QqiMFeuI2O8ARwAxhvKE&s=Ib8BrbI6VNDJQpfveSxoe9uCV_R3zo-7AMIzcOVAASs&e= I think you should move this fill statement to outside of the if statement (after return false) and change the bin to cutflow_bin+1. See other lines: 466,491,506,521,538,555,563, 612,622,633,646,655 My logic here is slightly different from what we discussed in Skype. For example, if one events failed cut 6, then program will go to fillcutflow function(https://github.com/lpernie/HhhAnalysis/blob/CMSSW940/python/NanoAOD/HHbbWWProducer.py#L339 https://github.com/lpernie/HhhAnalysis/blob/CMSSW940/python/NanoAOD/HHbbWWProducer.py#L339) to fill outflow histogram in bin1,2,3,4,5 with event_reco_weightsample_weight, and bin0 was already filled at the beginning with sample_weight. Here sample_weight includes generator weight and PU reweighs . And event_reco_weight includes all kinds of scale factors if it applies before events failed the cut. In other word, I think this implementation should be logically same as your If it were me, I would split the event selection differently and only do a single cut for finding the correct leptons, and not split it by the ID/ISO/etc. Then you only need to do this https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_lpernie_HhhAnalysis_blob_CMSSW940_python_NanoAOD_HHbbWWProducer.py-23L572-2DL575&d=DwMCaQ&c=ODFT-G5SujMiGrKuoJJjVg&r=fC_t7onKO6AaG0-ckJ9SVwag7IrMN0a5kEZt0IN6H-8&m=-me22Uq29_wvRbnyyK9Vvl4QqiMFeuI2O8ARwAxhvKE&s=Nbjf4b2Wv7Gp_xiebZzuGFtskJtlZdhOY-aRaIdrJRE&e= once. For now, it is better to split these different selections since we can know where efficiency drops and whether efficiency loss is expected or not. I also preferred to merge some selection in next stage once we have a better understanding on samples
I wouldn't say the b-tagging SF depends on the sample https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_lpernie_HhhAnalysis_blob_CMSSW940_python_NanoAOD_HHbbWWProducer.py-23L668&d=DwMCaQ&c=ODFT-G5SujMiGrKuoJJjVg&r=fC_t7onKO6AaG0-ckJ9SVwag7IrMN0a5kEZt0IN6H-8&m=-me22Uq29_wvRbnyyK9Vvl4QqiMFeuI2O8ARwAxhvKE&s=6jNz1mIOFL_zGpUCupPYTvFyHWTTgWpbsD6l2pBaCf0&e=, but on the analysis. I think this analysis should use ttbar events with similar kinematics to estimate the efficiency in simulation. Agreed, This description is not very correct. My point here is that applying b-tagging SF is not always just SF1 SF2. I just read this twiki: https://twiki.cern.ch/twiki/bin/view/CMSPublic/BtagRecommendation2010OpenData https://twiki.cern.ch/twiki/bin/view/CMSPublic/BtagRecommendation2010OpenData To apply b-tagging scale factor: weight(2|2) = SF1SF2 for the events with two jets originated from b parton and b-tagged weight(2|1) = (1-SF1)SF2+SF1(1-SF2) for the events with only one of two jets originated from b parton but with both jets b-tagged weight(2|0) = (1-SF1)(1-SF2) for the events with no jets originated from b parton but with both jets b-tagged In a word the event weight from btagging SF depends on whether two jets are truly from b parton and it should be known by gen matching. I did not implement it yet but I plan to do it soon . What do you think ?

Tao

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_lpernie_HhhAnalysis_issues_14&d=DwMCaQ&c=ODFT-G5SujMiGrKuoJJjVg&r=fC_t7onKO6AaG0-ckJ9SVwag7IrMN0a5kEZt0IN6H-8&m=-me22Uq29_wvRbnyyK9Vvl4QqiMFeuI2O8ARwAxhvKE&s=4uBmNw_P78UFZDyus4w5NHxU8QHbZiMRBRG6c5HaPSE&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFfjntmFqGLk7DVHZUjMxcPdnjLcOcgyks5tiTAtgaJpZM4S7qfz&d=DwMCaQ&c=ODFT-G5SujMiGrKuoJJjVg&r=fC_t7onKO6AaG0-ckJ9SVwag7IrMN0a5kEZt0IN6H-8&m=-me22Uq29_wvRbnyyK9Vvl4QqiMFeuI2O8ARwAxhvKE&s=X4BZjcHUkCVIYHhBR-ct_hZBgjrV_r2uGCkCbB3RGgo&e=.

demarley commented 6 years ago

Thanks for the comments, Tao. It seems I didn't read through the producer closely enough, so thanks for correcting my understanding. I would suggest using method 1a) for the b-taggging rather than 1c) just because of this statement:

However, the b-tag-related variables for the leading two jets in events with 1 or 0 b tags might become ill-defined (e.g., events with 0 b tags will get contribution from events with 2 b tags which will result in ill-defined b-tag discriminator distributions for the leading two jets). Also, the $ p_\text{T} $ distribution of the b-tagged jet in 1-tag events becomes ill-defined, etc.

tahuang1991 commented 6 years ago

Hi Dan, I read the twiki page again and I found that my last understanding is also not correct. The efficiency definition of btagging is eff_i = (jets with parton flavor i and b-tagged)/(all jets with parton flavor i). In other word, eff_i is btagging eff for jets with parton flavor 5, is mistagging rate for jets with parton flavor c and jets with light parton flavor. Since we always requires two jets with medium b-tagged. Our probability P(MC) = eff1eff2 and P(Data) = SF1eff1SF2eff2. therefore event weight is always SF1*SF2

lpernie / HhhAnalysis

Cutflow Structure #14