Open danking opened 1 year ago
Something along these lines might work. The trouble is when you get to replacing the functions in LocusFunctions.scala. In there, you need set containment and interval containment. Set containment is currently implemented in terms of IR, see SetFunctions.contains
. I'm not exactly sure the easiest way to fix that. We can't reference Code things in IR, but I don't know how to compile the IR in-line like that.
(base) dking@wm28c-761 hail % g diff
diff --git a/hail/src/main/scala/is/hail/expr/ir/EmitClassBuilder.scala b/hail/src/main/scala/is/hail/expr/ir/EmitClassBuilder.scala
index 115df824b3..6e5ee81e6a 100644
--- a/hail/src/main/scala/is/hail/expr/ir/EmitClassBuilder.scala
+++ b/hail/src/main/scala/is/hail/expr/ir/EmitClassBuilder.scala
@@ -59,6 +59,35 @@ class EmitModuleBuilder(val ctx: ExecuteContext, val modb: ModuleBuilder) {
new StaticFieldRef(rgField)
}
+ class LoweredReferenceGenome(
+ name: SStringPointerValue,
+ contigs: SIndexablePointerValue,
+ lengths: SIndexablePointerValue,
+ xContigs: SIndexablePointerValue,
+ yContigs: SIndexablePointerValue,
+ mtContigs: SIndexablePointerValue,
+ parInterval: SIntervalPointerValue
+ )
+
+ private val loweredReferences: mutable.Map[String, StaticField[Long]] = mutable.Map.empty
+
+ def getLoweredReferenceGenome(cb: EmitCodeBuilder, name: String): LoweredReferenceGenome = {
+ loweredReferences.getOrElseUpdate(name, {
+ val ecb = genEmitClass[Unit](s"lowered_reference_${name}")
+ val rg = ctx.getReference(name)
+ assert(rg.name == name)
+ new LoweredReferenceGenome(
+ ecb.addLiteral(cb, rg.name, VirtualTypeWithReq.fullyRequired(TString)).asInstanceOf[SStringPointerValue],
+ ecb.addLiteral(cb, rg.contigs, VirtualTypeWithReq.fullyRequired(TArray(TString))).asInstanceOf[SIndexablePointerValue],
+ ecb.addLiteral(cb, rg.lengths, VirtualTypeWithReq.fullyRequired(TArray(TInt32))).asInstanceOf[SIndexablePointerValue],
+ ecb.addLiteral(cb, rg.xContigs, VirtualTypeWithReq.fullyRequired(TSet(TString))).asInstanceOf[SIndexablePointerValue],
+ ecb.addLiteral(cb, rg.yContigs, VirtualTypeWithReq.fullyRequired(TSet(TString))).asInstanceOf[SIndexablePointerValue],
+ ecb.addLiteral(cb, rg.mtContigs, VirtualTypeWithReq.fullyRequired(TSet(TString))).asInstanceOf[SIndexablePointerValue],
+ ecb.addLiteral(cb, Interval(rg.parInput._1, rg.parInput._2), VirtualTypeWithReq.fullyRequired(TInterval(TLocus(rg.name)))).asInstanceOf[SIntervalPointerValue]
+ )
+ }
+ }
+
def referenceGenomes(): IndexedSeq[ReferenceGenome] = rgContainers.keys.map(ctx.getReference(_)).toIndexedSeq.sortBy(_.name)
def referenceGenomeFields(): IndexedSeq[StaticField[ReferenceGenome]] = rgContainers.toFastSeq.sortBy(_._1).map(_._2)
FWIW, this pipeline was performing these checks perhaps as many as 10 times per genotype, which is obviously unreasonable. Nonetheless, sending the RG along as a literal should improve the speed of these operations.
What happened?
Consider this code:
A single partition is taking a very long time to compute. Manual sampling of stack traces via
jstack
or the Spark UI reveals we spend a lot of time in computing the inPar predicates:A few things:
It seems like the right fix is for the ReferenceGenome's intervals to be shipped as literals so that we can perform
inXPar
orisAutosomal
checks without allocating contig strings or locus objects.Version
0.2.124
Relevant log output
No response