hammerlab / guacamole

Spark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly
Apache License 2.0
84 stars 21 forks source link

Multiple reference bases found in sample #302

Closed danvk closed 9 years ago

danvk commented 9 years ago

When I try to run germline-threshold on brody_307-bcell-tumor-bwa-mem-gap11-gep4-dedup-indelrealigned-bqsr.bam, I eventually get the following error:

java.lang.IllegalArgumentException: Multiple reference bases found in sample = SM at (chr, pos) = (17, 4455004)

Full command line:

spark-submit --master yarn --deploy-mode cluster --driver-java-options -Dlog4j.configuration=scripts/log4j.properties --executor-memory 4g --driver-memory 10g --num-executors 1000 --executor-cores 1 --class org.hammerlab.guacamole.Guacamole --verbose target/guacamole-with-dependencies-0.0.1-SNAPSHOT.jar germline-threshold --reads hdfs:///user/ahujaa01/brody-lymphoma/brody_307/brody_307-bcell-tumor-bwa-mem-gap11-gep4-dedup-indelrealigned-bqsr.bam --out hdfs:///user/vanded03/guac-test.vcf

See http://demeter-mgmt1.demeter.hpc.mssm.edu:8088/cluster/app/application_1432740718700_2196

@arahuja says that this shouldn't be happening.

arahuja commented 9 years ago

Looks like a read at that position has an N-base, AGGTGTGCAGAGCCAAAGGAGCAGATGAGCGAGGAGGTGAGNAGGAGCAN so we say the reference is possibly N or whatever the other reads say are the possible reference bases.

arahuja commented 9 years ago

Full stack trace:

User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 31 in stage 4.0 failed 4 times, most recent failure: Lost task 31.3 in stage 4.0 (TID 517, demeter-csmau08-13.demeter.hpc.mssm.edu): java.lang.IllegalArgumentException: Multiple reference bases found in sample = SM at (chr, pos) = (17, 4455004)
at org.hammerlab.guacamole.commands.GermlineThreshold$Caller$$anonfun$callVariantsAtLocus$1.apply(GermlineThresholdCaller.scala:165)
at org.hammerlab.guacamole.commands.GermlineThreshold$Caller$$anonfun$callVariantsAtLocus$1.apply(GermlineThresholdCaller.scala:98)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.hammerlab.guacamole.commands.GermlineThreshold$Caller$.callVariantsAtLocus(GermlineThresholdCaller.scala:98)
at org.hammerlab.guacamole.commands.GermlineThreshold$Caller$$anonfun$1.apply(GermlineThresholdCaller.scala:76)
at org.hammerlab.guacamole.commands.GermlineThreshold$Caller$$anonfun$1.apply(GermlineThresholdCaller.scala:75)
at org.hammerlab.guacamole.DistributedUtil$$anonfun$pileupFlatMap$1.apply(DistributedUtil.scala:292)
at org.hammerlab.guacamole.DistributedUtil$$anonfun$pileupFlatMap$1.apply(DistributedUtil.scala:289)
at org.hammerlab.guacamole.DistributedUtil$$anonfun$windowFlatMapWithState$1$$anonfun$apply$5.apply(DistributedUtil.scala:368)
at org.hammerlab.guacamole.DistributedUtil$$anonfun$windowFlatMapWithState$1$$anonfun$apply$5.apply(DistributedUtil.scala:363)
at org.hammerlab.guacamole.DistributedUtil$$anonfun$collectByContig$1.apply(DistributedUtil.scala:443)
at org.hammerlab.guacamole.DistributedUtil$$anonfun$collectByContig$1.apply(DistributedUtil.scala:440)
arahuja commented 9 years ago

This is specific to the GermlineThreshold caller, which is specifically looking for specific sets of alternates and references.

              } else {
                throw new IllegalArgumentException(
                  "Multiple reference bases found in sample = %s at (chr, pos) = (%s, %d)"
                    .format(sampleName, samplePileup.referenceName, samplePileup.locus)
                )
              }

What's curious is we are not hitting the previous condition in it

case (allele1, count1) :: (allele2, count2) :: rest => {
              if (allele1.refBases == Seq(Bases.N) || allele2.refBases == Seq(Bases.N)) {
                log.warn("Reference base N found and ignored in sample = %s at (chr, pos) = (%s, %d)"
                  .format(sampleName, samplePileup.referenceName, samplePileup.locus))
                // Find the non-N reference
                val properReference = if (allele1.refBases == Seq(Bases.N)) allele2.refBases else allele1.refBases
                variant(
                  Allele(properReference, Bases.ALT),
                  Ref :: Ref :: Nil
                ) :: Nil
              }
danvk commented 9 years ago

I get a similar error for /hdfs/datasets/dream/data/training/normal.chr20.withMDTags.bam, this time at 20:60823211. I don't see any Ns there.

danvk commented 9 years ago

I think the root cause is a deletion. When I print out the two alleles for a toy BAM, I see this:

(Allele(A,A),18)
(Allele(,),4)

and this in IGV:

screen shot 2015-06-23 at 3 52 29 pm