key4hep / EDM4hep

Generic event data model for HEP collider experiments
https://cern.ch/edm4hep
Apache License 2.0
25 stars 36 forks source link

EDM4hep usage in RDataFrame (or similar high level libraries) #103

Open tmadlener opened 3 years ago

tmadlener commented 3 years ago

This is really a discussion that touches more parts of key4hep, and potentially also podio. I have mainly decided to put this here, because edm4hep is kind of in the center of it all and we do need a place to start.

@clementhelsens has started to implement quite a bit of functionality to read edm4hep files into an RDataFrame that have been produced with k4SimDelphes. We have had several smaller discussions already about how to achieve certain things most easily and so this is also a bit of a summary of these discussions and the issues that we have discovered. Furhtermore, some of the current problems mentioned here, might appear in a similar fashion when used in an uproot/awkward context, so they are not exclusive to using RDataFrames.

Some of the things work quite nicely "out of the box" when using the edm4hep output files in an RDataFrame:

Usage from the python side in this case can be a bit more cumbersome, but is still possible, and some of the "inconveniences" (see, e.g. here) are actively being worked on by the root developers.

The problem

So as long as one only wants to work with the members that are stored in the PODs, things work pretty seamlessly. However, it starts to get tricky as soon as one wants to start to access VectorMembers, OneToOneRelations or OneToManyRelations, because in that cases one has to

  1. get the correct branch that stores the vector or podio::ObjectIDs of the desired member.
  2. correctly interpret the contents and correlate that with the contents of the original data branch (where the begin and end indices are stored for VectorMembers and OneToManyRelations)
  3. access the desired data for a given object.

It is not impossible to do this, but the necessary functionality becomes a bit unwieldy pretty quickly and composing different functions becomes very hard. For example to get the PDG of an MCParticle related to a ReconstructedParticle something along the lines of the following is necessary

using namespace  edm4hep;
using ROOT::VecOps::RVec;

RVec<int> getPDG(RVec<int> recoInd, RVec<int> mcInd, RVec<MCParticleData> mcParts, RVec<ReconstructedParticleData> recoParts) {
  RVec<int> results;
  results.resize(recoParts.size(), -1);
  for (size_t i = 0; i < recoParts.size(); ++i) {
    results[ recoInd[i] ] = mcParts[ mcInd[i] ].PDG;
  }
  return results;
}

Then to use it, e.g. from python you still have to do:

import ROOT as r
df = r.RDataFrame('events', 'delphes_edm4hep_output.root')
df2 = df.Alias('MCRecoAssocMC', 'MCRecoAssociations#1.index')
        .Alias('MCRecoAssocReco', 'MCRecoAssociations#0.index')
        .Define('reco_pdg', 'getPDG(MCRecoAssocReco, MCRecoAssocMC, ReconstructedParticles, Particle)')

The Alias definitions are necessary because PyROOT does not gracefully handle # in branch names, when used directly in calls toDefine. However, there are several other issues with this approach:

Possible (steps towards a) solution

Obviously solving the above issue can probably not be done with one single approach and will rather be a combination of several different things. @clementhelsens and me have been discussing several possible approaches, and this is just a list of the things that we have considered up until now. It is by far not complete, and there might be better approaches that we have not yet considered, but that can also be discussed here.

In the end it is something that I think we need to address somehow, as a lot of the "analysis level" code seems to be using python and it's libraries more and more, even though "framework code" will of course still use edm4hep in its full glory.

vvolkl commented 3 years ago

Good summary. One more (RDataFrame-specific) point: While you can use both the top level collection names and the full branches, like:

df.Define('newcol1', func1, 'Particle')
df.Define('newcol2', func2, 'Particle.momentum.x')

it would be good to be able to also use the 'intermediate' types, like:

df.Define('newcol3', func3, 'Particle.momentum')

which would allow to write funcs that are quite a bit more general.

tmadlener commented 3 years ago

it would be good to be able to also use the 'intermediate' types, like:

df.Define('newcol3', func3, 'Particle.momentum')

which would allow to write funcs that are quite a bit more general.

Yes, I agree that would be very useful. I have just had a quick look with uproot and here I can also only access the components of momentum but not momentum as a whole.

Maybe this is something we have to talk with the ROOT guys about. Maybe, accessing momentum as a whole could be possible by proper settings while writing (branch level, etc.). However, ideally we would like to have both, the possibility to use momentum as a whole, while still being able to access the sub-members.

clementhelsens commented 3 years ago

I had an other thought here. It would be good to try to stay as much as possible with latest ROOT developments. So being able to produce RNtuple from edm4hep events in the framework would be excellent