EDM4hep usage in RDataFrame (or similar high level libraries)

tmadlener commented 3 years ago

This is really a discussion that touches more parts of key4hep, and potentially also podio. I have mainly decided to put this here, because edm4hep is kind of in the center of it all and we do need a place to start.

@clementhelsens has started to implement quite a bit of functionality to read edm4hep files into an RDataFrame that have been produced with k4SimDelphes. We have had several smaller discussions already about how to achieve certain things most easily and so this is also a bit of a summary of these discussions and the issues that we have discovered. Furhtermore, some of the current problems mentioned here, might appear in a similar fashion when used in an uproot/awkward context, so they are not exclusive to using RDataFrames.

Some of the things work quite nicely "out of the box" when using the edm4hep output files in an RDataFrame:

Accessing POD members that are stored in the Data classes / branches. This works without any additional code, e.g.

import ROOT as r
df = r.RDataFrame('events', 'delphes_edm4hep_output.root')
reco_e = df.Histo1D('ReconstructedParticles.energy')
mc_px = df.Histo1D('Particle.momentum.x')
reco_e.Draw()

Also defining additional functionality to obtain e.g. the pT can be done quite easily, with some additional utility functions, e.g. https://github.com/key4hep/EDM4hep/blob/e17d47ab9f17f3a56fb90550d307275cacd93153/dataframe/edm4hepRDF.cc#L6-L12 which can then be used via: https://github.com/key4hep/EDM4hep/blob/e17d47ab9f17f3a56fb90550d307275cacd93153/test/test_rdf.cc#L15

Usage from the python side in this case can be a bit more cumbersome, but is still possible, and some of the "inconveniences" (see, e.g. here) are actively being worked on by the root developers.

The problem

So as long as one only wants to work with the members that are stored in the PODs, things work pretty seamlessly. However, it starts to get tricky as soon as one wants to start to access VectorMembers, OneToOneRelations or OneToManyRelations, because in that cases one has to

get the correct branch that stores the vector or podio::ObjectIDs of the desired member.
correctly interpret the contents and correlate that with the contents of the original data branch (where the begin and end indices are stored for VectorMembers and OneToManyRelations)
access the desired data for a given object.

It is not impossible to do this, but the necessary functionality becomes a bit unwieldy pretty quickly and composing different functions becomes very hard. For example to get the PDG of an MCParticle related to a ReconstructedParticle something along the lines of the following is necessary

using namespace  edm4hep;
using ROOT::VecOps::RVec;

RVec<int> getPDG(RVec<int> recoInd, RVec<int> mcInd, RVec<MCParticleData> mcParts, RVec<ReconstructedParticleData> recoParts) {
  RVec<int> results;
  results.resize(recoParts.size(), -1);
  for (size_t i = 0; i < recoParts.size(); ++i) {
    results[ recoInd[i] ] = mcParts[ mcInd[i] ].PDG;
  }
  return results;
}

Then to use it, e.g. from python you still have to do:

import ROOT as r
df = r.RDataFrame('events', 'delphes_edm4hep_output.root')
df2 = df.Alias('MCRecoAssocMC', 'MCRecoAssociations#1.index')
        .Alias('MCRecoAssocReco', 'MCRecoAssociations#0.index')
        .Define('reco_pdg', 'getPDG(MCRecoAssocReco, MCRecoAssocMC, ReconstructedParticles, Particle)')

The Alias definitions are necessary because PyROOT does not gracefully handle # in branch names, when used directly in calls toDefine. However, there are several other issues with this approach:

We implicitly rely on the fact that all the MCRecoParticleAssociations that we use here are in the same collection, as we only use the index of the podio::ObjectID.
- I am not sure if it would be possible to support any kind of relation that has objects in different collections.
We have to know how to interpret the branch names that are created for relations and vector members. I.e. we have to check the order in which they are defined in the yaml file to identify the index that we need to use after the # or _ in the branch names (while keeping in mind that the order also depends on whether they are OneToManyRelation or OneToOneRelation).
If we want to access information that is accessible only via a relation from a relation, e.g. TrackerHit from a Track that we get from a ReconstructedParticle, we have to know about all the involved collections, and also about the structure of the involved objects and relations. Additionally, we need all this information ready at a rather high level already and cannot really hide this in some abstraction without rather limiting assumptions on the structure (resp. naming scheme) of the input file.
All the implementations rely on juggling with indices, which is on the one hand pretty error prone when implementing the logic, but on the other hand also very hard to read and interpreted once implemented.

Possible (steps towards a) solution

Obviously solving the above issue can probably not be done with one single approach and will rather be a combination of several different things. @clementhelsens and me have been discussing several possible approaches, and this is just a list of the things that we have considered up until now. It is by far not complete, and there might be better approaches that we have not yet considered, but that can also be discussed here.

Make the framework produce an easier to digest (i.e. flatter) data format from the original edm4hep EDM. This is basically the approach of having different data tiers like the LHC experiments. This would imply that there has to be a dedicated module that produces output from the edm4hep format that is used in the framework, but this would still allow for a "central" production of such an "analysis level" data format, that can be easily used in RDataFrames or uproot/awkward based analysis frameworks. This would allow to handle the complicated relation navigation still with all the facilities of the core datamodel. On the other hand, this would imply to have another set of utilities that then works on this output format.
Have the possibility that podio still handles all the relations before it is passed to RDataFrame. There is the possibility to define RDataSource for RDataFrame that might be able to handle this. This could be a very elegant solution to this problem, with the caveat that it only works for RDataFrame. The advantage in this case could be that podio could potentially also do some additional code generation, that would make maintaining utility code a bit easier.
Another option would be to offer some utility functionality, that allows to operate on the original edm4hep outputs. However, this would then need to make several, possibly limiting assumptions on, e.g. the available collections (and their names). Nevertheless, it might be possible to cover a lot of the use cases with this, even though it is not fully generic.

In the end it is something that I think we need to address somehow, as a lot of the "analysis level" code seems to be using python and it's libraries more and more, even though "framework code" will of course still use edm4hep in its full glory.

vvolkl commented 3 years ago

Good summary. One more (RDataFrame-specific) point: While you can use both the top level collection names and the full branches, like:

df.Define('newcol1', func1, 'Particle')
df.Define('newcol2', func2, 'Particle.momentum.x')

it would be good to be able to also use the 'intermediate' types, like:

df.Define('newcol3', func3, 'Particle.momentum')

which would allow to write funcs that are quite a bit more general.

tmadlener commented 3 years ago

it would be good to be able to also use the 'intermediate' types, like:
df.Define('newcol3', func3, 'Particle.momentum')
which would allow to write funcs that are quite a bit more general.

Yes, I agree that would be very useful. I have just had a quick look with uproot and here I can also only access the components of momentum but not momentum as a whole.

Maybe this is something we have to talk with the ROOT guys about. Maybe, accessing momentum as a whole could be possible by proper settings while writing (branch level, etc.). However, ideally we would like to have both, the possibility to use momentum as a whole, while still being able to access the sub-members.

clementhelsens commented 3 years ago

I had an other thought here. It would be good to try to stay as much as possible with latest ROOT developments. So being able to produce RNtuple from edm4hep events in the framework would be excellent

key4hep / EDM4hep

EDM4hep usage in RDataFrame (or similar high level libraries) #103

The problem

Possible (steps towards a) solution