TTree model - Githubissues

jpata commented 8 years ago

How to best access (r/w) a TTree? There are numerous proposed models out there, we should learn from them. I would adopt something that's very close to raw ROOT, and then something additional with some julian AbstractDataFrame semantics.

default PyROOT (read-only)

via __getattr__:

objs = tree.myBranch1
print objs

object schema is generated on-the-fly, i.e. if a branch contains a complex class (std::vector, pat::Electron), it will be loaded with Cling.

rootpy

http://www.rootpy.org/auto_examples/tree/model_simple.html

read: like PyROOT, but buffered so that tree.myBranch1 only gets loads branch using TBranch.GetEntry, only once
write: create a TTree based on a model, set branch values using tree.__setattr__, fill row-by-row as usual in ROOT
heppy

https://github.com/cbernet/heppy

read: like in PyROOT via tree.myBranch1
write: schedule an AutoFillTreeProducer, which knows how to translate a complex event model into a "flat ntuple" structure like

event.leptons = [Lepton(pt=120, eta=0.5, phi=0.2, mass=12), ...]
=> 
tree.nleptons # ::Int32, variable per row
tree.leptons_pt # (NTuple{NMAX, Float32}) with some predefined NMAX, each row has values up to tree.nleptons

Example of scheduling:


#example of how to save an object (with derived characteristics)
leptonTypeVHbb = NTupleObjectType("leptonTypeVHbb", baseObjectTypes = [ leptonType ],
    variables = [
        NTupleVariable("looseIdSusy", lambda x : x.looseIdSusy if hasattr(x, 'looseIdSusy') else -1, int, help="Loose ID for Susy ntuples (always true on selected leptons)"),
        NTupleVariable("looseIdPOG", lambda x : x.muonID("POG_ID_Loose") if abs(x.pdgId()) == 13 else -1, int, help="Loose ID for Susy ntuples (always true on selected leptons)"),
        ...
    ]
)

#putting it all together into a tree
treeProducer= cfg.Analyzer(
    class_object=AutoFillTreeProducer,␣
    defaultFloatType = "F",
    verbose=False,
    vectorTree = True,
        globalVariables = [
                 NTupleVariable("puWeightUp", lambda ev : getattr(ev,"puWeightPlus",1.), help="Pileup up variation",mcOnly=True),
                 NTupleVariable("puWeightDown", lambda ev : getattr(ev,"puWeightMinus",1.), help="Pileup down variation",mcOnly=True),
                 ...
    ],
    globalObjects = {
          "met"    : NTupleObject("met",     metType, help="PF E_{T}^{miss}, after default type 1 corrections"),
        ....
    },
    collections = {
        "selectedLeptons" : NTupleCollection("selLeptons", leptonTypeVHbb, 8, help="Leptons after the preselection"),
        ...
   }
)

https://github.com/cbernet/cmssw/blob/heppy_8_0_11_tutorial/PhysicsTools/Heppy/python/analyzers/core/AutoFillTreeProducer.py#L8

ROOTDataFrames.jl

for columnar: tdf[:myBranch1] => Vector{Float32} transforms to in-memory column
read row-by-row: auto-generated immutable row class with correct types based to TTree branches
write: transform in-memory DataFrame to on-disk TTree using writetree(df::DataFrame)

Example of row-by-row access

df = TreeDataFrame(["file1.root"]; treename="tree")

for i=1:nrow(df)
    load_row(df, i) #load all branches using TTree::GetEntry
    n__jet = df.row.n__jet() #otherwise only this will actually to TBranch::GetEntry(i - 1)
    jet__pt = df.row.jet__pt()[1:n__jet]
end

@oschulz I'm continuing the discussion on gitter here.

oschulz commented 8 years ago

We should certainly offer something that can take advantage of Julia's metaprogramming facilities. I always imagined the branches could be auto-generated from a type (resp. immutable) structure - looks like rootpy is doing something in that direction. Reading, the reverse (auto-detecting branch type and appropriate Julia equivalents) would certainly be good.

The question is whether to write TTree input based on the TTree API, of via TTreeReader. We'll probably have to experiment.

The auto-flattening of heppy is also something we may want to adopt. I'm familiar with the technique, I used it in daqcore v1 to map nested Scala case classes to branches. The downside is that, when reading (into auto-generated Julia types), it's unclear if an underscore indicates a nested structure or not (considering the file may have been written by a different framework). We could also try to auto-generate C++ classes to match nested Julia structures, then we don't need underscores - but that approach has downsides as well, esp. when trying to read that file from ROOT/C++.

Apart from that, heppy seems far more opinionated about how things should be done than by be suitable for the generic ROOT/Julia interface package. Something like this should probably live in an add-on package, at least longer term (the same probably holds for the current high-level TTree interface in ROOTFramework).

jpata commented 8 years ago

We should certainly offer something that can take advantage of Julia's metaprogramming facilities.

This is what I implemented for ROOTDataFrames, I think it works quite well. However, as usual, the more layers of complexity, the more points of failure. Maybe something around TTreeReader would be good to try.

The auto-flattening

Indeed. In my experience, you need to save the "schema" of the dataset at the time you declare how you map your data to a flat structure. Then you don't have to rely on underscores or some implicit structure for the reverse parsing (flat to objects). We should have a way of reading & writing immutables to branches.

Apart from that, heppy seems far more opinionated

Agreed, I'm not a 100% fan of heppy either. It's more high-level, scheduling Analyzers that run in a loop over events. However, people are using it and they've experimented with various methods, so it's worth to consider both the good and the bad they've done. I think we mostly care about AutoFillTreeProducer.

oschulz commented 8 years ago

Yes, heppy calls itself an "event processing framework". Sure we'll need those too, and preferably a limited number of them, but one size will never fit all ... these should be add-on packages, I think. I may move my TTreeWriter out of ROOTFramework at some point too. Though there's a good reason at the moment for having it in there - it acts as a wrapper that take care of type conversion, and a place for proxy objects to live (so they don't have to be reallocated all the time).

oschulz commented 8 years ago

As far as DataFrames is concerned - maybe we don't need to actually implement to DataFrames interface. From what I understand, the DataFrames approach has performance limitations (the concept stems from Julia's early days), and there now ideas out now.

jpata commented 8 years ago

As far as DataFrames is concerned - maybe we don't need to actually implement to DataFrames interface.

This has indeed been a long-standing issue, described nicely in this article by John Miles White. What I've done in ROOTDataFrames.jl is exactly what he suggests:

A second possible solution is to generate custom DataFrame types for every distinct DataFrame object.

It works reasonably, would need some more standardization, but doesn't have to come first in wrapping ROOT. I still think DataFrames, based on the immense usefulness of pandas and R, will be useful to provide (in an external package perhaps).

oschulz commented 8 years ago

We might also want to take a look at TypedTables.jl. A DataFrames-like interface certainly can't hurt, but I think it should definitely be an add-on package.

The other thing is that the basic idea behind DataFrames is really a in-memory data model. TTrees are more of a data stream, really.

jpata commented 8 years ago

We might also want to take a look at TypedTables.jl. A DataFrames-like interface certainly can't hurt, but I think it should definitely be an add-on package.

I agree (also about streams). So my takeaway would be that df-like interfaces should come later. we should try to isolate the bare minimum that we need to access branches from a TTree so that julia can take advantage of the type info.

oschulz commented 8 years ago

Yes, I think so, too.

jpata / API

TTree model #3

default PyROOT (read-only)

rootpy

heppy

ROOTDataFrames.jl