Automatic conversion of TClonesArray in RDF

pieterdavid commented 3 years ago

Explain what you would like to see improved

Reading Delphes trees (which use TClonesArray) with RDF is a bit complicated (see below) due to the need to cast elements to their concrete type, or convert the TClonesArray to a type-aware container (for reference: first brought up in https://root-forum.cern.ch/t/reading-tclonesarray-in-jitted-rdf/45784, thanks to @eguiraud for explaining why my naive code did not work).

Optional: share how it could be improved

The type information is stored, so an automatic conversion could be done.

To Reproduce

import ROOT as gbl
gbl.gSystem.Load("/afs/cern.ch/user/p/piedavid/public/delphes_bamboo/DelphesIO/libDelphesIO.so")
gbl.Electron
f = gbl.TFile.Open("/afs/cern.ch/user/a/atalierc/public/snowmass/delphes_lhe_file_47_14TeV.lhe.root")
tree = f.Get("Delphes")
df = gbl.ROOT.RDataFrame(tree)
df.Define("leadElPT", "Electron[0]->PT")

where the last statement needs this to work (the typedef to distinguish the column and type name):

gbl.gInterpreter.Declare("using delphes_electron = Electron;")
df.Define("leadElPT1", "static_cast<delphes_electron*>(Electron[0])->PT")

but for integration in an analysis framework, converting the container is more convenient. What I came up with is this:

#include <TObjArray.h>
#include <ROOT/RVec.hxx>

namespace rdfhelpers {
  template<typename OBJ>
  ROOT::VecOps::RVec<OBJ*> objArrayToRVec(const TObjArray& arr) {
    ROOT::VecOps::RVec<OBJ*> out{reinterpret_cast<OBJ**>(arr.GetObjectRef()), static_cast<std::size_t>(arr.GetEntries())};
    auto it = *dynamic_cast<TObjArrayIter*>(arr.MakeIterator());
    std::size_t i = 0;
    while ( ( i != out.size() ) && ( out[i] == *it ) ) {
      it.Next();
      ++i;
    }
    if ( i != out.size() ) { // not equal to contiguous array, copy pointers
      it.Reset();
      const auto n = out.size();
      out = ROOT::VecOps::RVec<OBJ*>();
      out.reserve(n);
      do {
        out.push_back(static_cast<OBJ*>(*it));
      } while ( it.Next() );
    }
    return out;
  }
};

then I can add a set of defines and use those containers instead:

df = df.Define("myElectrons", "rdfhelpers::objArrayToRVec<delphes_electron>(Electron)")
df.Define("leadElPT", "myElectrons[0]->PT")

Setup

ROOT 6.24/00 from LCG_100 (x86_64-centos7-gcc10-opt) on lxplus

CC: @jnsandhya @selvaggi @pavel-demin (CMS upgrade performance study tools experts and Delphes maintainers, they may be interested as well)

Axel-Naumann commented 3 years ago

Side note: we really need to help Delphes move away from TClonesArray... Let's see whether we can cook up a PR over summer, something that works well with today's software world :-)

pavel-demin commented 3 years ago

As a developer and maintainer of Delphes, I am quite happy with TClonesArray. As far as I know, storing TClonesArray in TTree was the fastest and least crash-prone method for all ROOT 5 versions and early ROOT 6 versions.

If there is a newer and better method that meets all of the Delphes requirements, I would be interested in seeing and evaluating a few examples.

Before doing any pull requests, let's start with something more basic. I think a very basic example showing how to write/read a collection of generated particles to/from a file would be a good starting point.

something that works well with today's software world :-)

Newer doesn't always mean better :smiley: Especially in the HEP community where some researchers are still happy with CERNLIB and PAW and do not want to migrate to ROOT :smiley:

At the moment, Delphes maintains compatibility with ROOT 5 and I still receive questions from researches who use Delphes with ROOT 5. So, Delphes is still regularly tested with Ubuntu 14.04, GCC 4.8 and ROOT 5.

If the new IO method only works with recent versions of ROOT 6 and requires the change of file format and programming interface, it will be a major change for Delphes.

Axel-Naumann commented 3 years ago

Understood, and it's mostly outside this issue anyway. Let me first do my homework: dig out the code that we discussed a couple of years ago (?) and remind myself what Delphes does. I'll contact you then so we can discuss a proposal. And I'll make sure it works just fine for ROOT 5, too. And btw we - on the ROOT side - see several "lost souls" due to Delphes and its TClonesArray, so we might have some common motivation to evolve ;-) even if you feel ROOT 5 still has a relevant user community. I'll let contact you, @pavel-demin !

eguiraud commented 3 years ago

Regarding the titular issue, I have been thinking about it for a bit and I don't think we can do much better than what @pieterdavid already implemented.

Solutions considered:

Implicitly converting TClonesArrays to RVecs would be a) backward-incompatible and b) a silent perf degradation, as it requires a copy. Moreover, if we did this I don't see a migration path that doesn't require that users revise all their TClonesArrays-related logic
Adding a TClonesArrays2RVec helper function: it would require a copy (or it would have to return a clunky RVec<T*>). It is also trivial to implement for users that need it and don't mind the downsides.
Adding a RDF toggle to tweak "read TClonesArrays as RVecs" or not: requires adding a bunch of extra logic to the column-reading mechanism for a feature that might end up being mostly unused (modern data models don't use TClonesArrays), and it would require the extra copy (i.e. silent performance degradation)

The best seems to be 3, but feature toggles complicate internals and are typically not easy to discover (might end up mostly unused). It seems simpler to ask users to explicitly do the conversion in a Redefine if they need/want to.

Sorry I could not come up with something nice :confused:

I will close this in a few days unless people have something against it.

Axel-Naumann commented 3 years ago

Another ingredient here is that we have seen mass migration away from TClonesArray, with Delphes being the only relevant + known remaining user.

pieterdavid commented 3 years ago

Thanks a lot for looking into this @eguiraud ! Fine with me to close then. In case it's useful for anyone who finds this in the future: I implemented the conversion to RVec<T*> for bamboo here, but will probably leave that as a standalone example since it now seems more likely that my colleagues will end up using a reduced flat format.

root-project / root