A significant amount of on-the-wire space defining partitions is the
basketentryoffsets array. Previous commits attempted to lower the ser/de
overhead by deduplicating/interning common basketentryoffsets (since,
the assumption is that many baskets will have the same offsets if
they're stored in the same ROOT cluster).
Unfortunately, the first pass was bad and led to the following NPE,
reported by @lgray @ v0.5.1
[I 17:04:40.609 NotebookApp] Saving file at /coffeandbacon/analysis/baconbits-spark.ipynb
19/11/15 17:05:08 WARN TaskSetManager: Lost task 166.0 in stage 213.0 (TID 5465, 10.130.30.80, executor 44): java.lan
g.NullPointerException
at edu.vanderbilt.accre.laurelin.spark_ttree.SlimTBranch.getBasketEntryOffsets(SlimTBranch.java:158)
at edu.vanderbilt.accre.laurelin.spark_ttree.TTreeColumnVector.<init>(TTreeColumnVector.java:35)
at edu.vanderbilt.accre.laurelin.Root$TTreeDataSourceV2PartitionReader.getBatchRecursive(Root.java:209)
at edu.vanderbilt.accre.laurelin.Root$TTreeDataSourceV2PartitionReader.get(Root.java:182)
at edu.vanderbilt.accre.laurelin.Root$TTreeDataSourceV2PartitionReader.get(Root.java:104)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.next(DataSourceRDD.scala:59)
Fix this by using a prebuilt interning implementation for the global
basketentryoffsets array, and a modified interning implementation for
SlimTBranch.SerializeStorage, which is used by the JVM to transmit
partitions from the executor to the drivers
A significant amount of on-the-wire space defining partitions is the basketentryoffsets array. Previous commits attempted to lower the ser/de overhead by deduplicating/interning common basketentryoffsets (since, the assumption is that many baskets will have the same offsets if they're stored in the same ROOT cluster).
Unfortunately, the first pass was bad and led to the following NPE, reported by @lgray @ v0.5.1
Fix this by using a prebuilt interning implementation for the global basketentryoffsets array, and a modified interning implementation for SlimTBranch.SerializeStorage, which is used by the JVM to transmit partitions from the executor to the drivers