samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
649 stars 174 forks source link

CRAM header #201

Closed jkbonfield closed 5 years ago

jkbonfield commented 7 years ago

Section 8.3 of the specification has several errors, but also things that the C and Java implementations have drifted apart on. We need to tighten this up.

  1. The SQ:MD5 checksum is required unless the reference sequence has been embedded into the file. Maybe. If I use scramble -x to produce a referenceless CRAM file then it also should operate without MD5sums surely?

This then segways into a totally different incompatibility. I wasn't able to validate whether C/Java interop works without M5 tags for referenceless encoding because of my use of MD5 zero in slice headers. The C code knows the slice reference is unused, therefore does not need validating. It does this by setting MD5 to zero. The Java code however checks this and fails:

Exception in thread "main" htsjdk.samtools.cram.CRAMException: Reference sequence MD5 mismatch for slice: sequence id 0, start 10001, span 33954, expected MD5 00000000000000000000000000000000
    at htsjdk.samtools.CRAMIterator.nextContainer(CRAMIterator.java:187)
    at htsjdk.samtools.CRAMIterator.hasNext(CRAMIterator.java:261)
    at htsjdk.samtools.SamReader$AssertingIterator.hasNext(SamReader.java:592)
    at picard.sam.SamFormatConverter.doWork(SamFormatConverter.java:83)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

This is perhaps a bug in my code. I'm way out beyond the spec here (although I thought we discussed and changed it, but apparently that was just on my to-do list rather than a reality!) because there is no mention of zero in that situation. We so have a specification that 0 is used for unmapped or unsorted data, so this is a minor tweak. Is it a useful addition though?

What does picard.jar do when it creates referenceless CRAM files?

  1. "At least one RG record is required."

This is not true. C and Java implementations can read and write CRAM without RG tags. (It was true for version 1.0 I believe.)

  1. "The HD:SO sort order is always POS."

Firstly, POS is not a valid sort order; it is "coordinate" instead. Secondly we all support unsorted and name sorted data too, so the statement is incorrect.

Although name sorted data in picard is "interesting". With SO:queryname it strictly enforces one interpretation of name sorting, so A100 is before A99. This is not the same name sort order than samtools uses, so the two are incompatible. Perhaps the SAM spec needs tightening here as it has no definition of what sort actually means and whether it is an intelligent sort or not.

jmarshall commented 7 years ago

What SO:queryname is supposed to mean is #5.

jkbonfield commented 7 years ago

Thanks for the link. It's a bit of an aside anyway and I probably shouldn't have gone down that particular rabbit hole. :-) The particular issue here is that CRAM doesn't have a requirement of position sorted data so the text is invalid.