root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.72k stars 1.3k forks source link

Snapshot duplicates columns when they have an invalid name and get redefined #13846

Open mmaneyro opened 1 year ago

mmaneyro commented 1 year ago

Check duplicate issues.

Description

Behavior: Snapshot warns that an illegaly named column will be renamed when writing to file. Then the column appears twice, with the new name and the original. Renamed leaves now appear outside of their original branch

Expected behavior: Only the renamed column appears in the saved tree, respecting the original tree structure.

Reproducer

//Dicts for the file structure
gSystem->Load("$HOME/progs/ExRootAnalysis/libExRootAnalysis.so");

auto df = ROOT::RDataFrame("LHEF", "pp_2j_LO_H_T_35GeV.root");

//redefinition of column with unsupported name

auto add_func_call_int=[](ROOT::VecOps::RVec<int> inputArray1,ROOT::VecOps::RVec<int> inputArray2){
    auto Array3 = inputArray1+inputArray2;
    return Array3;};

auto df2 = df.Redefine("Event.Nparticles",add_func_call_int,{"Event.Nparticles","Event.Nparticles"});

df2.Snapshot("LHEF", "out_snapshot.root");

std::unique_ptr<TFile> file1{TFile::Open("out_snapshot.root")};
TTree * tree1 = (TTree*)file1->Get<TTree>("LHEF");
tree1->Print();   

//Info in <Snapshot>: Column Event.Nparticles will be saved as Event_Nparticles
//Warning in <TTree::Bronch>: Using split mode on a class: TRootWeight with a custom Streamer

// Print() shows the column Events_Nparticles (renamed), but the original is also written to the file as Events.Nparticles

//Redefining by doing 
auto df2 = df.Redefine("Event.Nparticles","Event.Nparticles+Event.Nparticles"}); 
//for example, just gives 
//Error in <TRint::HandleTermInput()>: std::runtime_error caught: RDataFrame::Redefine: cannot define variation "Event.Nparticles". Not a valid C++ variable name.

pp_3j_LO_H_T_2_35GeV.root.tar.gz

ROOT version

ROOT 6.28/00

Installation method

built from source

Operating system

Linux Mint 21.1 Cinnamon

Additional context

No response

vepadulano commented 1 year ago

Dear @mmaneyro ,

Thank you for your report. I take it that Event.NParticles is a sub-branch of the Event branch. What you describe is not really surprising, as Redefine is meant to substitute the values of the full column of the RDataFrame (column==branch). The difference in behaviour between non-jitted and jitted code is more surprising though. As a fast workaround, you could be more explicit about the columns you want to save in your output TTree by adding the list of column names to the Snapshot call

auto snap = df2.Snapshot("LHEF", "out_snapshot.root", {"Event.NParticles"});

In order for me to better reproduce your problem though, I believe I would also need some instructions on how to generate the dictionaries for the classes in your file. Meanwhile, I can try to come up with a simpler reproducer, but having also your scenario would help.

Cheers, Vincenzo

mmaneyro commented 1 year ago

Dear Vincenzo,

I have already managed to work with the redefined trees I need, just with a number of workarounds.

The tree files in this case are generated from Les Houches event files using the ExRootLHEFConverter from ExRootAnalysis. As such the branches are custom classes, which can be found in the ExRootAnalysis source files. I can't actually snapshot individual columns without gettting an error as there are TClonesArray column headers which specify the structure of the branches. The obvious fix would be to snapshot the column plus the header, but then that also gives me an error.

I understand that Redefine is ideally used for columns, however I need to be able to apply different redefinitions to different leaves within a branch. Do RDataFrames just not support rewriting leaves/nested columns? The columns seem to actually be doing what I'd like before snapshotting.

It seems like there's not a simple solution where I get to benefit from using RDataFrame and keep the tree structure untouched. I need to be able to add rows of data to each entry within a leaf (I'm actually concatenating multiple trees), and TTrees don't allow this as far as I can see. I guess I could define a new TTree by hand, setup the branches and fill new arrays from my original trees with the redefinitions I need,(just by iterating over every entry and data value). But then I'm still changing my TTree stucture, as with snapshot. Maybe next time I'll just start by rewriting ExRootLHEFConverter to take the data from two .lhe files, or just stick to TTrees, but to be fair this project has been my first attempt at using ROOT/C++. You code you learn!

What I am trying to do may be a bit of a niche use case, but I hope some of what I wrote is useful to you.

Regards, Marina