Integrate the UBDT efficiency workflow w/ pidcalib2

yipengsun commented 2 years ago

We should discuss w/ PIDCalib experts to see if we can integrate the UBDT branch directly in the official PIDCalib sample.

The procedure would be the following:

[x] Download the 2016 PIDCalib ntuple
[x] Apply UBDT to each of them, merging the UBDT branches
[x] Generate a json file list to be consumed by pidcalib2
[x] Change pidcalib2 source code to add UBDT branches and use our local PIDCalib samples w/ the UBDT branches.

yipengsun commented 2 years ago

I've asked Vitalli about this. It looks like even if we can't merge the mu_UBDT branch directly, we can produce efficiency histograms relatively easily offline because the sWeight is alreadly available to us.

yipengsun commented 2 years ago

It looks like the 2016 production has finished: https://its.cern.ch/jira/projects/WGP/issues/WGP-274?filter=allissues

yipengsun commented 2 years ago

From Vitalli's reply, I think we should do this:

Download the 2016 PIDCalib ntuple (the Mu_nopt ones)
Apply UBDT to each of them, storing them as friends
Optional: See if we can merge the friend trees directly, at least locally
Use the official pidcalib2 to find efficiencies
- This is doable because Vitalli claimed that pidcalib2 can specify the input ntuples to use.

yipengsun commented 2 years ago

@emilyj816 Let's focus on the downloading step for now.

First, write a Python script called ntuple_grabber.py under scripts, make sure that all variables are named like iAmAVariable.

This script should read the spec YAML in the spec/pidcalib.yml, then download to the correction location. You need to compute checksums of each file to ensure file integrity.

sha256sum <path_to_file>

and make sure the downloaded ntuple is unbroken. You also probably want to use ssh-wrapped rsync to download, so that you'll only do the incremental download.

Also, to parse the YAML, just use pyyaml. You can add that as a dependency in requirements.txt.

yipengsun commented 2 years ago

Sorry, after some thought, let's use ntuple_grabber.py as the script name, while maintaining the iAmAVariable naming convention inside the code.

yipengsun commented 2 years ago

I see that you've created a branch for this. I made additional changes to the project, and have merged these to your branch. Don't forget to do a git pull first before commit! @emilyj816

yipengsun commented 2 years ago

So the total size for 2016 official PIDCalib ntuples is just 2.7 TB. This is much smaller than what I expect.

emilyj816 commented 2 years ago

Hi Yipeng, I'm trying to figure out the python script to call test-nix-pkg, and I'm having a hard time looking online for certain syntax things. My plan is to pass make test-nix-pkg to a python script which parses the yaml file and is able to pass the filenames to the Makefile. Here is my current attempt from the command line:

make test-nix-pkg gen/pidcalib_w_nix_pkg.root=gen/real_pidcalib_w_nix_pkg.root samples/Jpsi--21_11_30--pidcalib--data_turbo--2016--mu--Mu_nopt-subset.root=/home/public/pidcalib_ntuples/remote/Mu_nopt-2016-MagDown/00152085_00000001_1.pidcalib.root

This results in make: *** No rule to make target 'samples/Jpsi--21_11_30--pidcalib--data_turbo--2016--mu--Mu_nopt-subset.root', needed by 'test-nix-pkg'. Stop.

I guess it's because the input file is a prerequisite and so there must be a different way to specify the name of the file. I've done a lot of searching and haven't been able to figure it out yet, I was wondering if you knew off the top of your head an easy way to do this. Thanks!

yipengsun commented 2 years ago

I think you don't need to use make at all. What you need is calling AddUBDTBranchPidCalib executable directly in python, say, with os.system, and its usage is listed in the Makefile.

To make your life easier, I'll list the usage of the executable explicitly here:

AddUBDTBranchPidCalib -i <path_to_a_pidcalib_ntuple> -o <path_to_output_ntuple> -p probe -b UBDT -t <tree1>,<tree2>

An example:

AddUBDTBranchPidCalib -i /some/folder/input.root -o /some/other/folder/output.root -p probe -b UBDT -t "tree1","tree2"

Also, you should check the exit code for each run to ensure the command was executed properly. If you decide to use os.system, you can do ret_code = os.system("<your command>" and just check that ret_code == 0.

Edit: Just to be clear: The make rule was meant to tell you how to use AddUBDTBranchPidCalib. I never pointed that out clearly. Sorry!

emilyj816 commented 2 years ago

I understand now, thank you!

emilyj816 commented 2 years ago

In the process of running my script to call AddUBDTBranchPidCalib for all the ntuples, should be ready sometime in the future. I've pushed the code to my branch in scripts/nix_tester.py, in case you wanted to take a look. The cleaned up version of ntuple_grabber.py is also on git, and the cleaned up sha_checker.py will follow soon.

yipengsun commented 2 years ago

BTW, I think a better name for nix_tester would be just apply_ubdt, as that's what's actually happening right :-)

yipengsun commented 2 years ago

Final step for Emily:

Try to generate a JSON file of the following form: https://gitlab.cern.ch/lhcb-rta/pidcalib2/-/blob/master/src/pidcalib2/data/samples.json

emilyj816 commented 2 years ago

Hey Yipeng,

I've pushed 1) my script for writing the JSON file and 2) the JSON file to git, under the names json_writer.py and samples.json, respectively. Let me know if there's anything that looks wrong or needs to be renamed/cleaned up. Also, sha_checker.py has been cleaned up and no longer has a lot of hard-coded components.

yipengsun commented 2 years ago

Thank! I'll try to clean up the code over the weekend and merge the branch.

yipengsun commented 2 years ago

I was able to use uproot to combine the friend UBDT ntuple with the raw PIDCalib ntuple in a chunk-by-chunk manner so the whole file needs NOT to be loaded into memory. It is not trivial.

As a validation, I checked the Jpsi_M and probe_UBDT branches between the raw-merged and friend-merged ntuples. They agree perfectly. This suggests that the merging was successful.

yipengsun commented 2 years ago

Merging of PIDCalib ntuples w/ its corresponding UBDT friend ntuples started.

yipengsun commented 2 years ago

The merging is still bugged. I've opened an issue upstream.

yipengsun commented 2 years ago

OK, I decide to do the irresponsible thing and use a very large step size. This means that we are reading a huge chunk of data into memory directly (~3 GB I guess).

This is not too bad for the server and I'm going to proceed for now.

yipengsun commented 2 years ago

This is done.

umd-lhcb / MuonBDTPid

Integrate the UBDT efficiency workflow w/ pidcalib2 #9