CMS NanoAOD interface - Githubissues

jpivarski commented 5 years ago

There should be a mechanism that recognizes a TTree as NanoAOD and presents a virtual, formatted view of the data, using knowledge of NanoAOD idioms. For instance, Muon_* should be collected into a single jagged table called muons with the muon branches as its columns. It should use VirtualArrays, so that you can carry an array of muons around without having loaded all of the branches. References between particles and jets—expressed as integer indexes in NanoAOD—should be IndexedArrays. I'm on the fence about making them ChunkedArrays at the basket level—that may be too small. Perhaps they could be ChunkedArrays at the file level (or a function for loading them that takes the chunking size as an option).

This was inspired by scikit-hep/awkward-array#95.

jpivarski commented 5 years ago

The links I'm referring to are for cross-cleaning.

guitargeek commented 5 years ago

This is such a great proposition! I am doing something similar in my analysis, but there is unfortunately a large overhead when loading NanoAODs because individual columns are spread over several files accessed via xrootd (about 20 per dataset). You should maybe keep this in mind when thinking of a solution. Some day, data should also be anyway stored columnar in CMS I hope!

Some more things I observed when working with awkward and NanoAOD:

1) Cross cleaning info in NanoAOD is not really necessary since cross cleaning is so fast and easy with awkward and uproot-methods. As this example where I do a quick TnP study on Run2 data shows: https://github.com/guitargeek/geeksw/blob/master/examples/electron_tnp.ipynb (ln [17]). The cross cleaning information could be dropped to save space.

2) One could save space in NanoAOD by saving branches directly in the jagged table style described here. there should be one flat tree per object (electrons, muons), and in the main event tree the "starts" and "stops" for each event and object are stored only once. Right now, this information is in every branch which is redundant because many branches belong to the same object and have the same starts and stops.

All in all, I think uproot/awkward should not only adapt better to NanoAOD, but NanoAOD could also benefit from the lessons learned in awkward the other way around.

jpivarski commented 5 years ago

Actually, they evolved together—we were talking with each other when NanoAOD and awkward were both being developed. I had some suggestions about the branch type: to use ROOT arrays instead of std::vector (which adds 10 bytes per event per branch). This is the JaggedArray format, almost byte for byte. (ROOT's offsets are byte offsets relative to the TKey, rather than item offsets relative to the start of data, but that's a subtraction and a bit-shift.)

NanoAOD can save space by storing one set of counts (nMuons) instead of the counts and also the offsets (internally in each of Muon_pt, Muon_eta, etc.), but that's also a ROOT feature, motivated by NanoAOD itself: the TTree::IOFeatures bit that tells ROOT to not store offsets and get everything from the counts. I don't know if this feature is being used in production because it's not backward-compatible in ROOT, but if it is used, uproot can read it and nobody would notice that it's there, apart from a 30% savings in space.

What I'm talking about in this issue is not about changing any formats or making anything more efficient—just packaging it up in a more intuitive way. Turning NanoAOD's links into IndexedArrays wouldn't change their speed, but it would make them act like pointers without user intervention. (As though you had a subset of the electron objects nested within their jets, rather than having to do some extra indexing by hand.)

guitargeek commented 5 years ago

Thank you for your explanations! As someone relatively new in CMS, I'm always very glad if someone explains me some context on how things evolved historically. I did not know many of this, so thanks for taking the time to answer even though my previous comment was not really on topic as I see now.

jpivarski commented 5 years ago

That's okay—it's good to hear about the level of interest!

The thread here will be replaced with a PR as soon as I start actually working on it anyway.

nsmith- commented 5 years ago

Hey, can we use the recursively defined IndexedArray for the gen particle parents? :)

jpivarski commented 5 years ago

Yes. If the gen particles looks something like this:

tree = awkward.fromiter([
    {"value": 1.23, "left":    1, "right":    2},     # node 0
    {"value": 3.21, "left":    3, "right":    4},     # node 1
    {"value": 9.99, "left":    5, "right":    6},     # node 2
    {"value": 3.14, "left":    7, "right": None},     # node 3
    {"value": 2.71, "left": None, "right":    8},     # node 4
    {"value": 5.55, "left": None, "right": None},     # node 5
    {"value": 8.00, "left": None, "right": None},     # node 6
    {"value": 9.00, "left": None, "right": None},     # node 7
    {"value": 0.00, "left": None, "right": None},     # node 8
])
left = tree.contents["left"].content
right = tree.contents["right"].content
left[(left < 0) | (left > 8)] = 0         # satisfy overzealous validity checks
right[(right < 0) | (right > 8)] = 0
tree.contents["left"].content = awkward.IndexedArray(left, tree)
tree.contents["right"].content = awkward.IndexedArray(right, tree)

tree[0].tolist()

we can make a tree. (That's what the above does: tree[0] is the tree and all other elements of tree are its subtrees. Try the above code: it prints out a nested dict of dicts.)

nsmith- commented 5 years ago

lol that's a BDT So the NanoAOD is a bottom-up rather than top-down: each entry in the list has a reference to the index of its parent entry. I played a bit with the recursive thing after reading your spec and I'm pretty sure its possible. Just keeping it on the radar. I can devote some time to an implementation if you like.

jpivarski commented 5 years ago

BDT was a motivating case. Yes, these are top-down arrows, so that you can walk from root to leaf. If gen particle arrows point from leaf to root, then a new calculation would be needed. Since we'd only want to do that on demand, it could be in a VirtualArray.

There are quite a few good things the CMS NanoAOD extension could have. It's not short-timescale like the awkward/uproot-methods version management, though.

By the way, I lost track of something you said about mocking Methods—I didn't understand and then lost the tab. Could you ping me on that with more explanation?

scikit-hep / uproot3-methods

CMS NanoAOD interface #45