root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.71k stars 1.28k forks source link

[DF] Distributed RDataFrame doesn't handle friend trees correctly #7584

Closed vepadulano closed 3 years ago

vepadulano commented 3 years ago

Describe the bug

Distributed RDataFrame has support for friend trees, but it seems something is missing. See the gist at https://gist.github.com/vepadulano/b42343bff7297958c46675577bce46a9 :

  1. Two RDF are created and one column is filled
  2. They are both snapshotted to disk and merged into a single file through hadd
  3. A TChain is created with one column from the merged file and a friend TChain is attached to it with the second column from the merged file
  4. A distributed RDataFrame with spark is created using the TChain as input, then one histogram per column is booked and drawn to a canvas
  5. The operation fails with
    TypeError: Template method resolution failed:
      none of the 4 overloaded methods succeeded. Full details:
      ROOT::RDF::RResultPtr<TH1D> ROOT::RDF::RInterface<ROOT::Detail::RDF::RRange<ROOT::Detail::RDF::RLoopManager>,void>::Histo1D(experimental::basic_string_view<char,char_traits<char> > vName) =>
        runtime_error: Unknown column: myfriend.rnd

Expected behavior

The program should not fail, in fact substituting the distributed rdataframe object with a plain rdataframe gives the correct output image

To Reproduce

  1. Source an environment with ROOT master
  2. download the linked gist
  3. python friendtrees_spark.py

Setup

Fedora 32 ROOT version: master Built from source

Additional context

Thanks to @Zeguivert for originally reporting this issue

eguiraud commented 3 years ago

Is the error message coming from the worker? I.e. the RDataFrame object built by the map function does not pick up the friend?

vepadulano commented 3 years ago

Yes this happens inside the mapper function, need to take a deeper look into it

eguiraud commented 3 years ago

The title mentions Spark, but I think it's more likely that the issue is independent of the backend, right? Can you point me to the code reponsible for detecting friendships and re-creating them in the workers? Maybe I can help

vepadulano commented 3 years ago

Actually you're right it's independent of Spark. The part where the tchain is reconstructed together with its friends on the workers is https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/bindings/experimental/distrdf/python/DistRDF/Backends/Base.py#L166-L185 . Whereas the information about friend trees is retrieved from the user provided TTree or TChain through this function https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/bindings/experimental/distrdf/python/DistRDF/Node.py#L766

eguiraud commented 3 years ago

tree.GetListOfFriends() should actually be tree.GetTree().GetListOfFriends() (see fa87d3c4b8328c), but that's probably not the cause of this issue (unless the friend is automatically loaded from file rather than added via AddFriend).

I don't see anything obviously wrong, adding a bunch of printouts in those two places should clarify what's going on (or breaking in with a debugger if it's possible through spark).

vepadulano commented 3 years ago

When passing from the TFriendElement returned by GetListOfFriends() to the underlying TChain we miss the name so the FriendInfo object that it's passed to the mapper function has the filename but not the tree name:

vpadulan@fedorathinkpad-T550 [~/Projects/rootcode/distrdf-friendtrees]: python friendtrees_spark.py
[...]
Retrieved tree Name:  Title: 
Retrieved friend trees { @0x7ffd020bad70 }
Here's the <class cppyy.gbl.TFriendElement at 0x55d6a9404200> --> Name: myfriend Title: 
Here's its name myfriend
Here's the <class cppyy.gbl.TChain at 0x55d6a8dc7340> --> Name:  Title: 
Here's its name 
Returning FriendInfo [''],[['friendtrees_spark.root']]
vepadulano commented 3 years ago

Now for this specific case the patch would be to use TFriendElement::GetName rather than TFriendElement::GetTreeName. But how many times do people create TChain without a name and then call TChain::AddFriend with an alias as in this case? Probably I need to add a couple more checks in the function that builds the friend info

eguiraud commented 3 years ago

Mmmh _get_friend_info might be a bit too simple, looks how it's done here: https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/tree/treeplayer/src/TTreeProcessorMT.cxx#L374

vepadulano commented 3 years ago

Yep, I'll definitely take a look at that :smile: Thanks!

eguiraud commented 3 years ago

how many times do people create TChain without a name and then call TChain::AddFriend with an alias as in this case?

both cases are common enough that we have to handle them

vepadulano commented 3 years ago

While working on this, I found another bug in the mapper code: https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/bindings/experimental/distrdf/python/DistRDF/Backends/Base.py#L156-L158

Here treename through https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/bindings/experimental/distrdf/python/DistRDF/Node.py#L703 , that only returns a single string that it's supposed to be the treename. But the reproducer in the linked gist uses the default constructor for TChain and then adds the filename/treename string in the following Add

chain = ROOT.TChain()
chainFriend = ROOT.TChain()

chain.Add("friendtrees_spark.root/randomNumbers")
chainFriend.Add("friendtrees_spark.root/randomNumbersBis")

So with the current code in distrdf treename is None. The more correct way to retrieve the treename would look probably like https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/tree/treeplayer/src/TTreeProcessorMT.cxx#L249 So that's another point in favor of having a common set of functions for retrieving information from the trees useful for range creation for multiple threads/nodes .

Furthermore, I'm thinking that a more proper logic for distrdf would be parse the arguments to DistRDF.RDataFrame to recognize which ROOT.RDataFrame constructor the user is trying to replicate. This way, in the mapper function it would be easier and more precise to make the single rdf objects for each task