samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

htsjdk doesn't seem to honour CRAM "RR" (reference required) header. #1216

Closed jkbonfield closed 2 years ago

jkbonfield commented 5 years ago

Verify

This is related to #721, but not quite the same bug.

That was failing due to MD5 mismatches in the slice. Adding a "R=$HUMAN_REF" option to the picard line turns this error into #721 instead. However the reference is not required and should not be needed to be specified. It's not for SAM reading, so it isn't a globally required option. It should not be required for all CRAM files either.

Subject of the issue

Reading a CRAM file that was produced using "scramble -x" (likely samtools view --output-fmt-option no_ref would do the same thing) cannot be read by picard. I haven't written my own code against htsjdk, but I'm assuming this is a library thing rather than an application issue. The error is:

Exception in thread "main" htsjdk.samtools.cram.CRAMException: Contig 1 not found in the reference file.
    at htsjdk.samtools.CRAMIterator.nextContainer(CRAMIterator.java:171)
    at htsjdk.samtools.CRAMIterator.hasNext(CRAMIterator.java:261)
    at htsjdk.samtools.SamReader$AssertingIterator.hasNext(SamReader.java:601)
    at picard.sam.SamFormatConverter.doWork(SamFormatConverter.java:85)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

See attached file: a.zip

This file has no M5 tags, as they are not required when using non-ref mode. The CRAM file has the RR field set to 0 in the container header. From cram_dump:

    Container_header block pos 265
      Preservation map:
        RR => 0 (0x0)
        RN => 1 (0x1)
        SM => 21549531 (0x148d1db)
        TD => 21549527 (0x148d1d7)
        AP => 1 (0x1)

Your environment

Steps to reproduce

My comand line is:

java -jar ~/work/picard-2.18.15.jar SamFormatConverter I=a.cram O=a.sam

Expected behaviour

The expected output is that it works, comparable to samtools view or scramble a.cram a.sam.

Actual behaviour

Failure!

cmnbroad commented 5 years ago

@jkbonfield I think this is the same underlying issue as #721, which is I always thought of as "htsjdk doesn't support embedded reference". But now I'm wondering, is "RR=0" the same thing as "reference embedded in a block", or is it something else ("no embedded reference, reads are stored with the full sequence and not reference-compressed") ? Not sure I can tell this from the spec, at least in the past - is this described there ?

jkbonfield commented 5 years ago

RR is "reference required". It is set to zero either for embedded reference or for referenceless mode, which was my use case above.

It may turn out to be the same underlying issue though. If so, then feel free to close this with a link to the other ticket.

cmnbroad commented 2 years ago

This was fixed as part of the 2020 refactoring branch, and is covered by tests in CRAMReferencelessTest class, among others.