scikit-hep / hepconvert

BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Skimming branches #64

Closed alefisico closed 4 months ago

alefisico commented 4 months ago

Hi, I couldn't attend your talk in CAT, but I saw the slides. I am very interested in your tools since my analysis tries to take some nanooad files and try some simple skimming of the files to make our pipeline faster. Your tool can really simplify our lives.

Thanks a lot for the help.

zbilodea commented 4 months ago

Hi, Great to hear this tool could be helpful for you! Currently you can just run it locally, (though hopefully in the future it will include dak). Currently I am adding the ability for wildcarding branch selection, as well as a "keep_branches" argument.

And sorry, just to clarify the example you are looking for, do you want to copy all the files in the dataset while removing branches or merge them into one?

Thank you for the feedback!

alefisico commented 4 months ago

Hi @zbilodea thanks for the reply. What I have in mind is something like this. I understand that now you can keep or drop the full branches. Now, in nanoAOD, there are boolean branches, like HLT_blahblah. So my question is if you can keep or drop a subset of the full branches based on that boolean condition. Is it more clear? I can try to explain more if you need.

NJManganelli commented 4 months ago

Do you mean keep or drop TTree entries based on the per-entry boolean values (skimming), or keep/drop certain branches (slimming) based on which boolean branches are present?

alefisico commented 4 months ago

I mean skimming. From what I understand, the current functionality only does slimming (from your definition)

zbilodea commented 4 months ago

Ah yes I see now! Thank you for clarifying, I will work on adding skimming!

zbilodea commented 4 months ago

I've added an option for branch skimming, it is the keyword argument "cut" (works like Uproot's "cut" parameter in uproot.iterate() and uproot.arrays()). Let me know if there are any changes I could make to improve this or if any issues come up! I'll add examples to the docs soon but here's one for the case you mentioned: For using an HLT branch and keeping only branches that start with "Photon_":

hepconvert.copy_root(
        "destination.root",
        "nanoAOD_2015_CMS_Open_Data_ttbar.root",
        keep_branches=["Photon_*"],
        force=True,
        cut="HLT_Photon30",
    )
alefisico commented 4 months ago

thank you so much @zbilodea I will try later and give you feedback accordingly.