hammerlab / guacamole

Spark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly
Apache License 2.0
83 stars 21 forks source link

Ensure test BAM files are valid #370

Open hammer opened 8 years ago

timodonnell commented 8 years ago

One file we need to fix is:

https://github.com/hammerlab/guacamole/blob/continue-somatic-joint/src/test/resources/illumina-platinum-na12878-extract/NA12878.10k_variants.plus_chr1_3M-3.1M.chr_fixed.bam

When running from that branch (continue-somatic-joint), we're getting:

(analysis-venv-2.7)[tim@Tims-MacBook ~/sinai/git/guacamole]$ scripts/guacamole somatic-joint /Users/tim/sinai/git/guacamole/target/scala-2.10.3/test-classes/illumina-platinum-na12878-extract/NA12878.10k_variants.plus_chr1_3M-3.1M.chr_fixed.bam --reference-fasta /Users/tim/sinai/data/ucsc.hg19.fasta.gz --loci chr1:0-6700000
Using most recently modified jar: target/guacamole-with-dependencies-0.0.1-SNAPSHOT.jar
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
--> [Fri Jan 08 12:07:08 EST 2016]: Guacamole starting.
2016-01-08 12:07:08 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Running on 1 inputs:
<Input '/Users/tim/sinai/git/guacamole/target/scala-2.10.3/test-classes/illumina-platinum-na12878-extract/NA12878.10k_variants.plus_chr1_3M-3.1M.chr_fixed.bam' of normal dna at /Users/tim/sinai/git/guacamole/target/scala-2.10.3/test-classes/illumina-platinum-na12878-extract/NA12878.10k_variants.plus_chr1_3M-3.1M.chr_fixed.bam >
--> [110.58 sec. later]: Using samtools without BAM index to read: /Users/tim/sinai/git/guacamole/target/scala-2.10.3/test-classes/illumina-platinum-na12878-extract/NA12878.10k_variants.plus_chr1_3M-3.1M.chr_fixed.bam
--> [0.27 sec. later]: Splitting loci by region depth among 1 tasks using 250 micro partitions.
--> [0.00 sec. later]: Splitting loci evenly among 250 tasks = ~26,800 loci per task
--> [0.02 sec. later]: Done calculating micro partitions.
--> [0.02 sec. later]: Collecting region counts for RDD 1 of 1.
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 80, Read name HSQ1004:134:C0D8DACXX:3:1302:16380:148845, Mate Alignment start should be 0 because reference name = *.
    at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:439)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:644)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:629)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:599)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:544)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:518)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:42)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$class.toStream(Iterator.scala:1143)
    at scala.collection.AbstractIterator.toStream(Iterator.scala:1157)
    at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
    at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
    at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)

@e5c any interest in taking a look at how we might fix this BAM?

timodonnell commented 8 years ago

Looks like we can get around these errors with:

SamReaderFactory.setDefaultValidationStringency(ValidationStringency.LENIENT)

So this ticket is much lower priority. We should focus on getting a good somatic test case (both small for local testing and large for the cluster) put together and getting the germline calling component of somatic-join running on the cluster. Updated my branch with this fix. The tests now run for me in that branch, but you'll need to change the path to the reference fasta (too big to put it in the repo).