samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

CRAM: better testing of unmapped reads #1236

Open jmthibault79 opened 5 years ago

jmthibault79 commented 5 years ago

Subject of the issue

PR #1233 fixes a bug in HTSJDK 2.18.0 which was observed when running GATK on an input CRAM with unmapped reads. We were able to narrow down the issue to the unmapped reads by observing that the problem still occurred by specifying -L UNMAPPED.

Add a test file to HTSJDK with these characteristics:

I have a large CRAM which demonstrates the problem but I have been unable to reduce it to its unmapped reads. This should be trivial, so we should also investigate why both of these attempts failed:

  1. Use a version of GATK (with the fix) to run PrintReads on a CRAM with -L UNMAPPED and produce CRAM output. Run GATK PrintReads on the result CRAM with -L UNMAPPED. Surprisingly, no reads are found. When re-running without -L UNMAPPED, all reads are seen. The Current Locus for these reads displays as "unmapped" in the ProgressMeter.

  2. Run samtools view -@4 -f 4 -C on a CRAM to produce CRAM output consisting of only those reads which have bit flag 0x04 set (unmapped). Run GATK PrintReads on the result CRAM. Unfortunately, this crashes with a stack trace.

    java.lang.ArrayIndexOutOfBoundsException
        at sun.security.provider.DigestBase.engineUpdate(DigestBase.java:105)
        at java.security.MessageDigest$Delegate.engineUpdate(MessageDigest.java:584)
        at java.security.MessageDigest.update(MessageDigest.java:325)
        at htsjdk.samtools.util.SequenceUtil.calculateMD5(SequenceUtil.java:917)
        at htsjdk.samtools.cram.structure.Slice.setRefMD5(Slice.java:162)
        at htsjdk.samtools.CRAMContainerStreamWriter.flushContainer(CRAMContainerStreamWriter.java:455)
        at htsjdk.samtools.CRAMContainerStreamWriter.finish(CRAMContainerStreamWriter.java:128)
        at htsjdk.samtools.CRAMFileWriter.finish(CRAMFileWriter.java:129)
        at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:215)
        at org.broadinstitute.hellbender.utils.read.SAMFileGATKReadWriter.close(SAMFileGATKReadWriter.java:26)
        at org.broadinstitute.hellbender.tools.PrintReads.closeTool(PrintReads.java:107)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:970)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)