spark-root / laurelin

Allows reading ROOT TTrees into Apache Spark as DataFrames
BSD 3-Clause "New" or "Revised" License
10 stars 4 forks source link

Simplify/Fix basketentryoffset processing #76

Closed PerilousApricot closed 4 years ago

PerilousApricot commented 4 years ago

A significant amount of on-the-wire space defining partitions is the basketentryoffsets array. Previous commits attempted to lower the ser/de overhead by deduplicating/interning common basketentryoffsets (since, the assumption is that many baskets will have the same offsets if they're stored in the same ROOT cluster).

Unfortunately, the first pass was bad and led to the following NPE, reported by @lgray @ v0.5.1

[I 17:04:40.609 NotebookApp] Saving file at /coffeandbacon/analysis/baconbits-spark.ipynb
19/11/15 17:05:08 WARN TaskSetManager: Lost task 166.0 in stage 213.0 (TID 5465, 10.130.30.80, executor 44): java.lan
g.NullPointerException
        at edu.vanderbilt.accre.laurelin.spark_ttree.SlimTBranch.getBasketEntryOffsets(SlimTBranch.java:158)
        at edu.vanderbilt.accre.laurelin.spark_ttree.TTreeColumnVector.<init>(TTreeColumnVector.java:35)
        at edu.vanderbilt.accre.laurelin.Root$TTreeDataSourceV2PartitionReader.getBatchRecursive(Root.java:209)
        at edu.vanderbilt.accre.laurelin.Root$TTreeDataSourceV2PartitionReader.get(Root.java:182)
        at edu.vanderbilt.accre.laurelin.Root$TTreeDataSourceV2PartitionReader.get(Root.java:104)
        at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.next(DataSourceRDD.scala:59)

Fix this by using a prebuilt interning implementation for the global basketentryoffsets array, and a modified interning implementation for SlimTBranch.SerializeStorage, which is used by the JVM to transmit partitions from the executor to the drivers