Open tmadlener opened 3 years ago
Good summary. One more (RDataFrame-specific) point: While you can use both the top level collection names and the full branches, like:
df.Define('newcol1', func1, 'Particle')
df.Define('newcol2', func2, 'Particle.momentum.x')
it would be good to be able to also use the 'intermediate' types, like:
df.Define('newcol3', func3, 'Particle.momentum')
which would allow to write func
s that are quite a bit more general.
it would be good to be able to also use the 'intermediate' types, like:
df.Define('newcol3', func3, 'Particle.momentum')
which would allow to write funcs that are quite a bit more general.
Yes, I agree that would be very useful. I have just had a quick look with uproot
and here I can also only access the components of momentum
but not momentum
as a whole.
Maybe this is something we have to talk with the ROOT guys about. Maybe, accessing momentum
as a whole could be possible by proper settings while writing (branch level, etc.). However, ideally we would like to have both, the possibility to use momentum
as a whole, while still being able to access the sub-members.
I had an other thought here. It would be good to try to stay as much as possible with latest ROOT developments. So being able to produce RNtuple from edm4hep events in the framework would be excellent
This is really a discussion that touches more parts of key4hep, and potentially also podio. I have mainly decided to put this here, because edm4hep is kind of in the center of it all and we do need a place to start.
@clementhelsens has started to implement quite a bit of functionality to read edm4hep files into an RDataFrame that have been produced with k4SimDelphes. We have had several smaller discussions already about how to achieve certain things most easily and so this is also a bit of a summary of these discussions and the issues that we have discovered. Furhtermore, some of the current problems mentioned here, might appear in a similar fashion when used in an
uproot/awkward
context, so they are not exclusive to using RDataFrames.Some of the things work quite nicely "out of the box" when using the edm4hep output files in an RDataFrame:
Data
classes / branches. This works without any additional code, e.g.Usage from the python side in this case can be a bit more cumbersome, but is still possible, and some of the "inconveniences" (see, e.g. here) are actively being worked on by the root developers.
The problem
So as long as one only wants to work with the members that are stored in the PODs, things work pretty seamlessly. However, it starts to get tricky as soon as one wants to start to access
VectorMembers
,OneToOneRelations
orOneToManyRelations
, because in that cases one has topodio::ObjectID
s of the desired member.begin
andend
indices are stored forVectorMembers
andOneToManyRelations
)It is not impossible to do this, but the necessary functionality becomes a bit unwieldy pretty quickly and composing different functions becomes very hard. For example to get the PDG of an
MCParticle
related to aReconstructedParticle
something along the lines of the following is necessaryThen to use it, e.g. from python you still have to do:
The
Alias
definitions are necessary because PyROOT does not gracefully handle#
in branch names, when used directly in calls toDefine
. However, there are several other issues with this approach:MCRecoParticleAssociation
s that we use here are in the same collection, as we only use theindex
of thepodio::ObjectID
.#
or_
in the branch names (while keeping in mind that the order also depends on whether they areOneToManyRelation
orOneToOneRelation
).TrackerHit
from aTrack
that we get from aReconstructedParticle
, we have to know about all the involved collections, and also about the structure of the involved objects and relations. Additionally, we need all this information ready at a rather high level already and cannot really hide this in some abstraction without rather limiting assumptions on the structure (resp. naming scheme) of the input file.Possible (steps towards a) solution
Obviously solving the above issue can probably not be done with one single approach and will rather be a combination of several different things. @clementhelsens and me have been discussing several possible approaches, and this is just a list of the things that we have considered up until now. It is by far not complete, and there might be better approaches that we have not yet considered, but that can also be discussed here.
uproot/awkward
based analysis frameworks. This would allow to handle the complicated relation navigation still with all the facilities of the core datamodel. On the other hand, this would imply to have another set of utilities that then works on this output format.podio
still handles all the relations before it is passed to RDataFrame. There is the possibility to defineRDataSource
for RDataFrame that might be able to handle this. This could be a very elegant solution to this problem, with the caveat that it only works for RDataFrame. The advantage in this case could be thatpodio
could potentially also do some additional code generation, that would make maintaining utility code a bit easier.In the end it is something that I think we need to address somehow, as a lot of the "analysis level" code seems to be using python and it's libraries more and more, even though "framework code" will of course still use edm4hep in its full glory.