spark-root / laurelin

Allows reading ROOT TTrees into Apache Spark as DataFrames
BSD 3-Clause "New" or "Revised" License
9 stars 4 forks source link

Notes from getting things working at FNAL #58

Open lgray opened 5 years ago

lgray commented 5 years ago

Finding the direction of a solution required fixing up #51 and #52 as well as making TBranches much lazier objects, and some additional moving around of when TFiles are opened, etc.

All the edits to code, not so many in the end, are here (which is based on laurelin 0.3.0): https://github.com/spark-root/laurelin/compare/master...lgray:topic_scaleout_and_laziness?expand=1

I think there's still more to gain but this yielded a nice 2x improvement on processing in a 24GB flat ntuple in our analysis and reduces "thread-joins" in the higher level spark processing workflow. Performance compared to the Vandy cluster on root will need to be established to understand things under some sort of baseline. I'm pretty sure more than 2x is possible.

Please note this (my) code is horrible and not at all well optimized, but is meant as an attempt to get things in the right places/shapes.

~There's one more exception I need to follow up, somehow it's finding non-monotonic basket entries but I thought I got that threaded through OK.~ This last exception has been fixed wasn't properly dealing with empty baskets you see in some files.

I'll post some notes on laurelin master not working tomorrow or Monday. That one seems to be truncated arrays or something.

lgray commented 5 years ago

Found the big one, got a factor of ten improvement, 20x total now. I think we are in business.