Closed vepadulano closed 3 years ago
Is the error message coming from the worker? I.e. the RDataFrame object built by the map function does not pick up the friend?
Yes this happens inside the mapper function, need to take a deeper look into it
The title mentions Spark, but I think it's more likely that the issue is independent of the backend, right? Can you point me to the code reponsible for detecting friendships and re-creating them in the workers? Maybe I can help
Actually you're right it's independent of Spark. The part where the tchain is reconstructed together with its friends on the workers is https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/bindings/experimental/distrdf/python/DistRDF/Backends/Base.py#L166-L185 . Whereas the information about friend trees is retrieved from the user provided TTree or TChain through this function https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/bindings/experimental/distrdf/python/DistRDF/Node.py#L766
tree.GetListOfFriends()
should actually be tree.GetTree().GetListOfFriends()
(see fa87d3c4b8328c), but that's probably not the cause of this issue (unless the friend is automatically loaded from file rather than added via AddFriend
).
I don't see anything obviously wrong, adding a bunch of printouts in those two places should clarify what's going on (or breaking in with a debugger if it's possible through spark).
When passing from the TFriendElement
returned by GetListOfFriends()
to the underlying TChain
we miss the name so the FriendInfo
object that it's passed to the mapper function has the filename but not the tree name:
vpadulan@fedorathinkpad-T550 [~/Projects/rootcode/distrdf-friendtrees]: python friendtrees_spark.py
[...]
Retrieved tree Name: Title:
Retrieved friend trees { @0x7ffd020bad70 }
Here's the <class cppyy.gbl.TFriendElement at 0x55d6a9404200> --> Name: myfriend Title:
Here's its name myfriend
Here's the <class cppyy.gbl.TChain at 0x55d6a8dc7340> --> Name: Title:
Here's its name
Returning FriendInfo [''],[['friendtrees_spark.root']]
Now for this specific case the patch would be to use TFriendElement::GetName
rather than TFriendElement::GetTreeName
. But how many times do people create TChain
without a name and then call TChain::AddFriend
with an alias as in this case? Probably I need to add a couple more checks in the function that builds the friend info
Mmmh _get_friend_info might be a bit too simple, looks how it's done here: https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/tree/treeplayer/src/TTreeProcessorMT.cxx#L374
Yep, I'll definitely take a look at that :smile: Thanks!
how many times do people create TChain without a name and then call TChain::AddFriend with an alias as in this case?
both cases are common enough that we have to handle them
While working on this, I found another bug in the mapper code: https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/bindings/experimental/distrdf/python/DistRDF/Backends/Base.py#L156-L158
Here treename
through https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/bindings/experimental/distrdf/python/DistRDF/Node.py#L703 , that only returns a single string that it's supposed to be the treename. But the reproducer in the linked gist uses the default constructor for TChain and then adds the filename/treename string in the following Add
chain = ROOT.TChain()
chainFriend = ROOT.TChain()
chain.Add("friendtrees_spark.root/randomNumbers")
chainFriend.Add("friendtrees_spark.root/randomNumbersBis")
So with the current code in distrdf treename
is None
. The more correct way to retrieve the treename would look probably like https://github.com/root-project/root/blob/eb5bcf0f0d79b0011ad4f2b8a38df0b6be4ee9a0/tree/treeplayer/src/TTreeProcessorMT.cxx#L249
So that's another point in favor of having a common set of functions for retrieving information from the trees useful for range creation for multiple threads/nodes .
Furthermore, I'm thinking that a more proper logic for distrdf would be parse the arguments to DistRDF.RDataFrame
to recognize which ROOT.RDataFrame
constructor the user is trying to replicate. This way, in the mapper function it would be easier and more precise to make the single rdf objects for each task
Describe the bug
Distributed RDataFrame has support for friend trees, but it seems something is missing. See the gist at https://gist.github.com/vepadulano/b42343bff7297958c46675577bce46a9 :
hadd
Expected behavior
The program should not fail, in fact substituting the distributed rdataframe object with a plain rdataframe gives the correct output image
To Reproduce
python friendtrees_spark.py
Setup
Fedora 32 ROOT version: master Built from source
Additional context
Thanks to @Zeguivert for originally reporting this issue