Data/MC comparisons with Phoebe ntuples

@afernez As a first step to understand Run 2 data and MC, we will produce a series of data/MC plots using Phoebe's ntuples. Given their size, we will focus first on her 15 GB D*+ sample

lhcb-ntuples-gen/ntuples/ref-rdx-run1/Dst-mix/Dst--20_07_02--mix--all--2011-2012--md-mu--phoebe.root

The first step is to go through redohistos_Dsp.C and find the cuts to select each component. This is a script that she uses to produce the templates (histograms) that provides the shapes used to extract the yield of each component in the signal fit. For instance, h_sigtau is the histogram with the B → D*+ τ ν signal, so the cuts used to fill it define which events she considers signal. the _v1p and others and ±1 σ variations, so we don't care about them.

I wrote a portion of the script that will be used for this, src/rdx/plot_phoebe_datamc.cxx, with examples of functions to select data, signal, and normalization (with eventType getType and NamedFunc event_type), and give them different weights (with NamedFunc weight). You'll need to expand on the enum values, change names appropriately, add the cuts to select each component and sample, and plot m_nu1 (mmiss^2), q2, El and any other interesting variables you may like.

The variables used in redohistos_Dsp.C may be branches of the tree (those with a fChain→SetBranchAddress in redoHistos_Dst.h). In that case you can access them in plot_scripts in strings (eg "m_nu1>5"), with their name directly (m_nu1 > 5.) or inside NamedFunc definitions (b.m_nu1() > 5.). Note that after loading the value of the branch into the variable, the variable may be modified, so you should check for that. For instance, the weight is modified by if(mcWeight > 5) { mcWeight=5;}. If the variable is not a branch you'll have to calculate it.

The second step is find the cuts for each of the samples defined in Table 15 of the ANA note. ISO is the signal region, and DD, 1OS, D**, and 2OS are control samples with no signal. These samples are selected by redoHistos_Dst.C based on the option variable, see here.

To check that things are going in the right direction you can try to reproduce the data distribution (choose the same binning) of some of the fit projections in chapter 8 of the ANA note.

Feel free to ask for help and any clarification in this GitHub issue or in Slack. Reading other people's code is not easy! Variable names are not always obvious (for instance m_nu1 is the missing mass...), and we'll probably have to compile a set of questions to ask Phoebe some time.

Good luck!

In the meeting 7-28-20, I was directed to table 18 of the ANA note for a list of all MC decay modes for the D⁺ mu sample, but I still have a question for the most sensible way to group background MC. In this image I've color-coded which modes I'm currently grouping together (and included the names for the groups) Instead of this grouping, I'm wondering if it's better to make the plots less complicated and just define "dss" and "dd" processes rather than specifying whether the lepton coming from the B is a muon or a tau. Additionally, related to the picture, I have two more questions: First, probably a question for Phoebe, in her code, she seems to have a histogram for a strange B -> D(D->mu nu X')X decay mode that isn't listed in table 18 of the ANA note. Should I count this mode in my plots? Second, I'm thinking I shouldn't include the "comb", "misID", and "Combinatorial D" as an extra process for the plots, since Phoebe's code (eg. where she fills the misID hist) seems to be claiming that these are data. I could just be misunderstanding this (of course there is combinatorial background in the data, etc), but I was figuring these entries in table 18 referred to something the MC tried to simulate. If I'm not counting these as MC processes, too, should I add them to the "data" event type?

As usual, excellent questions

Instead of this grouping, I'm wondering if it's better to make the plots less complicated and just define "dss" and "dd" processes rather than specifying whether the lepton coming from the B is a muon or a tau.

Sounds very reasonable. At some point we may need access to all individual components (for instance, to validated their shapes against the plots in the note), but you certainly can do any groupings you prefer for these plots. I agree that too many components would make the plot hard to read.

By the way, you have a couple of Bs meson processes lumped with the dssmu. Those are not D**, so you should put them with the dd, or make their category (they're probably small though).

First, probably a question for Phoebe, in her code, she seems to have a histogram for a strange B -> D(D->mu nu X')X decay mode that isn't listed in table 18 of the ANA note. Should I count this mode in my plots?

I think all the Bs mesons can go in the dd. We can ask her later but do include them in your plots.

Second, I'm thinking I shouldn't include the "comb", "misID", and "Combinatorial D*" as an extra process for the plots, since Phoebe's code (eg. where she fills the misID hist) seems to be claiming that these are data. I could just be misunderstanding this (of course there is combinatorial background in the data, etc), but I was figuring these entries in table 18 referred to something the MC tried to simulate. If I'm not counting these as MC processes, too, should I add them to the "data" event type?

The entries in the table only refer to the shapes used in the fit. The fit then extracts the normalization under each shape (and a bunch of shape parameters that define our uncertainty on the shapes). These shapes are mostly taken from MC, but those components you mention are taken from data indeed.

Phoebe claimed that this ntuple had everything, and it definitely has the normal data, so it may also have the "wrong sign" and "additional track" data control samples used for "comb", "misID", and "Combinatorial D*". The weights of these events are probably quite different than the weights of the components taken from MC, but these should also be included as Process::Type::background.

(I apologize for the long post that is an accumulation of my questions; I've waited to ask them in hope I'd get a better understanding of the code over time and there would be less/easier questions...) Overall, my persistent issue has been that Phoebe's code seems to have a lot more going on than feels like would make sense to copy over, but it's tough to figure out where to draw the line. I've been trying to make as much sense as I can of differences between her code and the ANA note specifications (usually in the form of added conditionals in the code), things in her code that seem inconsistent with one another (that likely arise because she wanted to change one set of histograms but didn't change the code for related histograms), when it's okay for me to ignore things she does using variables I don't have access to (eg. lines 1137-1142 she resets the values of mm2, El, and q2 for all MC using variables with a suffix smearres, which I don't have access to), etc. But, it's been tough for me to keep everything straight or come up with general questions I could use for guidance (instead, I encounter many very specific questions, essentially line-by-line, which wouldn't be very useful to ask, because they would only solve minor confusions). My thought process has been that it makes the most sense for me to try to investigate each of these specific questions on my own, but I think my conclusions are likely to be flawed.

That all being said, I've tried to come up with questions that are as "general" as possible, but ultimately some are fairly specific:

If Phoebe adds more conditions when specifying the control samples (eg. this extra condition for the DD sample compared to table 15 in ANA note), should I ignore it and just use the ANA note specifications?
Phoebe doesn't seem to have code for specifying a D* control sample; I've copied over the cuts in table 15, but I'm unsure how I can calculate M_D⁺pi₁... I'm not sure what pion pi₁ is (w.r.t. the analysis or the code), and in general I only have access to particles' |p| and |p_T|, it seems, which isn't enough to add together to find the mass of the combined D*⁺pi₁
There are a number of places where Phoebe seems to be cutting events entirely (not just to select control samples or components), for example, here, where she cuts the event if the B mass is greater than 5280 MeV, and these cuts from lines 1269-1281. Should I be ignoring these overall cuts, just focusing on what you've asked me to look at (how she's selecting components and control samples)?
Calculating the weights is, I think, the most complex part of this. I have tried to mimic exactly what Phoebe is doing, but am inevitably running into problems. One things that affects mcweight that I'm unsure about: she seems to be doing multiple "rounds", specified by the option variable (that I don't believe I have access to?), which I don't understand the purpose of. There are (many) other calculations being done, most of which I think I can trace through her code for, but if I copy all of these calculations, it's likely going to eat up a lot of my time (and I likely won't finish by this Tuesday, as I had hoped to). Overall, do you have any suggestions, or do you think I really ought to just do every calculation in Phoebe's code related to the weights (mcweight, used for MC components, and totweight, used for misID and comb)?
For the combinatorial D* background, I'm guessing I should be looking at Phoebe's h_comb_usb or h_SB (I'm not sure where the "sb" comes from, but it's referenced in table 18, and based on other context clues, I narrowed down my guesses to this), but I'm not sure which. I follow h_comb for general combinatoric background and h_misID for misID; I am naturally more confident in these guesses.
For the D components, it seems like if I include semitauonic modes, the histograms are badly skewed (there are way too many D tau events, and they mostly fall in a single bin). I'll commit a snapshot of my code if you want to compile it to see for yourself. As an example, for the DD control sample Originally, I thought this may be a problem with the weights, since I hadn't done much to specify weights yet. But, unless that drastically changes something for only the D** tau, it is probably actually me misunderstanding how I should be selecting these components. Also confusingly, it seems like the spike is a darker red than these components should be colored; I'm not sure how this would be happening.
Further, for the D semimuonic components, from what I can tell, Phoebe makes two plots for each D, in the case they decay to D* pi pi, and in the case they don't. I'm assuming I should just be lumping these two cases together (essentially, ignoring lines like this). Additionally, she makes a histogram h_D2Smu which, I think, corresponds to the line B -> D*[->D pi pi]mu nu in table 18, but I was thinking that this component is just the general form of the D mu decays (but specifying D->D* pi pi), and so I wasn't counting it separately to avoid what I figured would be double counting; is this the right idea? Related to these components, too, do you have an idea what ishigher and mm_mom might be?

That is an excellent and well thought out set of questions. Dealing with code like this is hard, because indeed there a lot going on that we may not care about, and it is hard to know what to care about and what not. But you are drawing that line beautifully well.

Just one note about finding where things are. It is always possible that the code is not self-consistent, but in general the branch definitions should be found in AddB.C, and the variables in redoHistos_Dst.h. For instance, Elsmearres is actually the Elsmear branch that you do have access to. For some reason, Phoebe did not name the variable the same as the branch in this case.

The ANA note in principle is the gold standard, but we shouldn't ignore these cuts. This cut is definitely important enough to add to a list of questions that we'll ask Phoebe
Another good question for Phoebe
These look like they are part of the baseline cuts we saw in Chapter 3, so they should also be included. The m(B) < 5200 MeV is in Table 7, and muVeto is defined in AddB.C as if (piminus0_isMuon || Kplus_isMuon /*|| piminus_isMuon*/) muVeto=true;, so it is the !isMuon in Table 6.
I think you've handled this perfectly. You tried to understand the overall working of the weights, and the right time you asked the question when you thought it would not be a good use of your time. The full treatment of the weights is something that will be different in Run 1 and Run 2, and in our ntuples we'll be missing some inputs (like the PID weights) for a while. Some of the weights will come from the specific procedure described in their Chapter 4. You can just implement the simpler things (those that do not depend on option or histograms such as NewMCWeights/Histos/firstround.root), and we'll ask Phoebe more about this.
SB typically stands for sideband. In this case, it should refer to the sidebands of the Δm distribution for the D (as opposed to the mass peak), where there should only exist combinatorial D candidates. And for instance if you see USB, it typically would refer to the upper sideband of the B mass, > 5280 MeV, where the B combinatorial background lives. So I think you got it all exactly right 🙂
The transparent gray shade refers to the statistical uncertainty of the histogram, so that bin is not a different shade of red, it is simply red with the gray shade on top of it. What your plot is suggesting is that they had very few stats for the D**τν component, so their BF weights (probably mcWeight) are very large. Given that the uncertainly is as large as the spike, that looks like a single event with a very large weight. For now, I would just comment out this component.
I think the D**τν should be separated into 1 and 2 pion components just like the D**μν components.
1. The 2S histogram I was not aware of, but it looks like it contains higher charm mass 2S states. Since we call D** everything above the D*, these would be also D**, but higher mass than the 4 states we know. mm_mom is the Δm for the mother of the D and the D. ishigher is also defined in AddB.C and seems to indeed pick up high-mass charm states. I have to say, I had never heard of IDs 100413 or 100423 (they're not even included in the PDG's MC numbering scheme), but you can extrapolate that they are 2S charm states since they end with 413 (D+) and 423 (D0), and start with 100 like the ηc(2S).

This 2013 measurement from LHCb provides a useful illustration of the various charmed states, including the 2S states. When we say D** we typically refer to the four 1P states

These 2018 paper by Bernlochner, Ligeti, and Robinson is also useful. This table also provides the masses and widths for the four 1P states. And you can see here why the D1 and D*2 are some times referred to as the narrow states

Also, in terms of notation D** used to refer any charmed meson heavier than the D*, but I think these days it often refers just to these four states.

Another set of useful numbers to have in mind are the tau/mu ratios for the D**

I'll record a list of questions for Phoebe- that I'm guessing will be added to in the future- here, for now.

As a note, these questions are based on my current understanding of the analysis-- having not read all of the ANA note, it's likely that some/all of the questions could be answered in the text. Also, I haven't implemented all the weight calculations that Phoebe does (which amounts to ~1000 lines of code in the redoHistos_Dst script), and I'm sure I'll inevitably have questions about this in the future.

First, testing Phoebe's memory: is the script, as far as you know, exactly what was used for "final" analysis (ie. it should be exactly implementing what's in the ANA note), or are there things that you changed to get various other/different plots? If the script was modified and what I'm looking at is a sort of snapshot in time, I imagine it would be nearly impossible to remember everything that was changed, but I just want to get a feeling for what I'm looking at.
Now, some questions about what's included for the D components. My understanding is that there are four 1P D states- D0, D1, D1', D2 (but D0 only decays to D, not D). In the code, a histogram (h_D2Smu) is created for 2S D states: should these be included along with the other 1P D components (when comparing the shapes of the sum of MC components to data)?
For the heavy D (ie. D -> D* pi pi), in both the ANA note and the code, I only see semimuonic modes included, no semitauonic. Is this right? Is it just because these components would be negligible because the tau is too massive?
A question about defining control samples: the ANA note indicates there are some p and pT cuts on additional tracks found by the isolation BDT, but there seem to be more of these cuts in the code than in table 15 (pg 49). Also, there seem to be various other cuts in the code that aren't mentioned in the table. Should I just follow the code, or should I ignore extra cuts and follow the ANA note? List of additional cuts- DD: Max(iso_P(iso_PT > 150), iso_P2(iso_PT2 > 150)(iso_BDT2 > -1.1), iso_P3(iso_PT3 > 150)(iso_BDT3 > -1.1))> 5e3 2OS: iso_CHARGE < 100 && Max(iso_P(iso_PT > 150), iso_P2(iso_PT2 > 150)) > 5e3 1OS: (iso_CHARGE)(Dst_ID) < 0 D**: I'm assuming the mass cut in line 1348/1349 is meant to select D sample (subset of 1OS, cut around narrow D states), but the uncommented mass cut there doesn't match table 15 (commented out cut does).
Maybe I'm being fussy, but I'm not sure if there's a reason behind a slight non-uniformity when selecting some D* components: comparing the selection of the (heavy) D2 component to the (heavy) D1', D2* doesn't have a requirement Dststtype==415 while D1' does have a requirement Dststtype==20413. Is there, in fact, a reason the conditions aren't symmetric?

We can consider this done for now.

umd-lhcb / plot_scripts

Data/MC comparisons with Phoebe ntuples #4