Naming convention for branches in step 2 ntuples

manuelfs commented 3 years ago

I think we may have started an issue to discuss branch naming convention, but I can't find it nor there is anything in the wiki, so let us discuss here.

The more I work with the step 2 ntuples, the more I love them. All of them having a TTree named tree is incredibly convenient, and most of the names that @yipengsun chose are so easy to type and remember. I just committed a few changes based of the rules below to make them a wee bit shorter and clearer

Particle properties should start with particle name (eg, even though this one is not saved, other_trks → trks_other)
I find branches with flag_ a bit confusing (often I think of flagged events as bad ones) and long, so I propose
- Flags that indicate quality type _ok at the end (eg, flag_d0 → d0_ok)
- Skims and other non-obvious booleans start with is (eg, is_2os)
- Obvious booleans do not need anything else (eg, flag_l0 → l0)
Adeded event and run which are easier to type. I left eventNumber and runNumber because the code needed them

If these conventions are fine, we should add them to the wiki.

By the way, starting with the name of the particle has helped me in the past find which group of branches are contributing most to the size of the tree by grouping all branches starting with the same suffix before the first _ (eg, d0_). For instance, the output of the scripts/ntp_disk_usage.C script I just committed on the step 2 ntuples looks like this

root [1] ntp_disk_usage("Dst--21_08_21--std--data--2016--md.root")

Dst--21_08_21--std--data--2016--md.root

Tree "tree" occupies 1,320,406,362 bytes and has 3,066,825.000000 entries (0.42 kB/event):

            Branch name    Byte/ev   Frac. [%]  Cumulative            Branch group name    Byte/ev   Frac. [%]  Cumulative
==========================================================      ==========================================================
                  Total      430.5      100.00           -                        Total      430.5      100.00           -
              mu_pid_mu        7.7        1.79        1.79                           d0       60.3       13.99       13.99
            spi_ip_chi2        7.7        1.79        3.58                           mu       56.6       13.16       27.15
               k_pid_mu        7.7        1.79        5.37                           pi       47.7       11.80       38.23
              pi_pid_mu        7.7        1.78        7.15                            k       47.6       11.50       49.28
     b0_endvtx_chi2ndof        7.7        1.78        8.93                          spi       47.2       10.97       60.25
                  mu_pt        7.7        1.78       10.72                          dst       45.2       10.50       70.75
                  pi_pt        7.7        1.78       12.50                          iso       41.7        9.69       80.44
                    mm2        7.7        1.78       14.28                           b0       38.3        8.89       89.33
                   k_pt        7.7        1.78       16.70                          mm2        7.7        1.78       91.12
    dst_endvtx_chi2ndof        7.7        1.78       17.85                           el        7.6        1.77       92.89
            b0_fd_trans        7.7        1.78       19.63                          prs        7.6        1.76       94.65
             d0_ip_chi2        7.7        1.78       21.41                           q2        7.6        1.75       96.41
              spi_pid_e        7.7        1.78       23.19                  eventNumber        5.1        1.18       97.59
     d0_endvtx_chi2ndof        7.7        1.78       24.97                        event        5.1        1.18       98.77
               pi_pid_e        7.6        1.77       26.74                      GpsTime        4.6        1.80       99.84
                     el        7.6        1.77       28.52                           is        0.6        0.13       99.98
                k_pid_e        7.6        1.77       30.29                    runNumber        0.0        0.10       99.99
                  d0_ip        7.6        1.77       32.60                          run        0.0        0.10      100.00
             d0_fd_chi2        7.6        1.77       33.83                         hlt2        0.0        0.00      100.00
               mu_pid_e        7.6        1.77       35.60                         hlt1        0.0        0.00      100.00
              k_ip_chi2        7.6        1.77       37.37                           l0        0.0        0.00      100.00
             pi_ip_chi2        7.6        1.77       39.14      
             mu_ip_chi2        7.6        1.77       40.91
...

indicating 14% of the ntuple is taken by the d0_ variables.

yipengsun commented 3 years ago

I think these suggestions make sense. I'll see if the run and event variables can be defined inside rename instead of calculation (because for different ntuples they may have different types). Everything else seems fine by me.

yipengsun commented 3 years ago

Also, for the ntp_disk_usage.C, we can make it directly compilable, and put it in a src folder (to store all compilable C++ code). We can also write a rule to compile them and store the compiled executable in the scripts folder, with a .exe suffix, so that all directly usable scripts are still in the bin folder.

yipengsun commented 3 years ago

I plan to add the naming conventions to the Nomenclature section of the wiki, @manuelfs what do you think?

Also, Probably a good idea to keep this open til the very end to discuss all naming conventions? If so, I'll pin this issue.

yipengsun commented 3 years ago

@manuelfs FYI, now a branch can be kept and renamed at the same time. I've already updated the run 2 w/ run 1 cut YAML to reflect this change.

yipengsun commented 2 years ago

I've updated cut flags and cut documentation @manuelfs :

Flags: https://github.com/umd-lhcb/lhcb-ntuples-gen/blob/master/postprocess/rdx-run2/rdx-run2_oldcut.yml Doc: https://github.com/umd-lhcb/rdx-run2-analysis/blob/master/docs/cuts/cut_flag_review.md

Old cuts: is_normal & d_mass_window_ok & is_<skim>

From ntuple: /home/syp/downloads/sample_ntuples/D0--22_02_24--std--data--2016--md--000--old.root
   ISO:        5,822
   1OS:          537
   2OS:          146
    DD:          935

New cuts: mu_ubdt_ok & is_<skim>

From ntuple: /home/syp/downloads/sample_ntuples/D0--22_02_24--std--data--2016--md--000--new.root
   ISO:        5,822
   1OS:          537
   2OS:          146
    DD:          935

New cuts, without UBDT: is_<skim>

print_skim_size.py ~/downloads/sample_ntuples/D0--22_02_24--std--data--2016--md--000--new.root
From ntuple: /home/syp/downloads/sample_ntuples/D0--22_02_24--std--data--2016--md--000--new.root
   ISO:        5,982
   1OS:          561
   2OS:          170
    DD:        1,049

yipengsun commented 2 years ago

I've upload the sample ntuples to glacer at: /home/syp/public/sample_ntuples.

manuelfs commented 2 years ago

Thank you, at first sight it looks like a great improvement. I checked that indeed is_iso is the same as is_iso_loose && b_m_ok && dx_m_ok && in_fit_range.

The one thing I'm not sure of is how to select the fit templates for the MC. We could define the is_<skim> and is_<skim>_loose flags for MC to include all the cuts that should be applied to MC, that is, same cuts as data minus PID/some trigger. That way, plotting data or MC in a skim would both be tree->Draw("mm2", "is_iso * weight"), with the weight of data being 1, and the weight of MC including the PID/some trigger cuts.

yipengsun commented 2 years ago

For the MC skims, it's just wiso, which is defined to be w*wskim_iso*skim_global_ok, where w is some global weights, and wskim_iso is like is_iso_loose, and skim_global_ok is a boolean that only contains kinematic cuts.

The nice thing about MC is that we don't care the SB region.

yipengsun commented 2 years ago

And for MC you should forget about the is_iso boolean because we need a weight for MC, not a boolean.

manuelfs commented 2 years ago

Ah, I had forgotten that we needed different weights for the different skims.

Still, I wonder whether it would be worth to homogenize data and MC. Given that wiso encapsulates both weights and cuts for MC, perhaps we can put the is_iso cuts for data in wiso instead? That would allow us to plot data and MC as tree->Draw("mm2", "wiso")

yipengsun commented 2 years ago

That's a good point, but is_iso is of type bool and wiso double. Maybe we can do the following:

Keep the is_iso branch as-is (boolean for data, doesn't exist in MC)
Add the wiso branch for data to be is_iso cased to double
Keep wiso as-is for MC

yipengsun commented 2 years ago

@manuelfs We discussed that we don't need to differentiate between B0/B+, D0/D+ after all. So let's just refer to both as b/d instead of b0,b/d0,d.

Let's wait a couple of days to see if we need additional changes.

yipengsun commented 2 years ago

I tried to count the total size of 2016 MagDown 1277341. It is ~7 GB. Not too bad.

yipengsun commented 2 years ago

I feel changing the names for D0 is too much, as there's lots of flags that has name d0_XXX (and the documentation needs to be changed too for consistency). I plan to change b/b0 to b only, while leaving d0 as-is.

This hybrid approach is not too bad, as you don't need to remember which B meson you are working on, and the dst/d0 is already differentiated anyway.

What do you think about this @manuelfs ?

manuelfs commented 2 years ago

The compromise sounds good

yipengsun commented 2 years ago

Closed for now.

umd-lhcb / lhcb-ntuples-gen

Naming convention for branches in step 2 ntuples #84