Modify SAM spec document to better represent restrictions in the SEQ field

Psy-Fer commented 4 days ago

Currently, the sam spec document (https://samtools.github.io/hts-specs/SAMv1.pdf) has the following for the mandatory SEQ field

10 SEQ String *|[A-Za-z=.]+ segment SEQuence

This doesn't really match up with the IUPAC base restriction within the BAM spec, and so anything that is not IUPAC is converted to N.

This isn't clear in the document.

I propose this should be updated to avoid confusion.

Cheers, James

jkbonfield commented 4 days ago

Agreed. I think it was rather lazy to avoid spelling out the exact list of accepted characters. It would be interesting to test what other SAM parsers do when it comes to non-IUPAC codes.

I should also note that IUPAC was explicitly designed with U as one of the possible base types. You'll note there are common themes in some of the codes. Weak vs Strong (W/S), Purine and Pyridmine (R, Y), etc. One of themes is not-A (C,G,T), not-C, not-G, not-T. These use the next letter along in the alphabet, so not-A is B, not-C is D, not-G is H and not-T is... V! It could have been U, but that's reserved already as T and U are synonymous (and note there is no IUPAC code for not-U). So right back when IUPAC was being designed it was already accepted that the two were essentially interchangeable with the meta-data on the sequence dictating which is it. This makes sense as both DNA and RNA still only have an alphabet of 4 symbols.

So... we should probably also expand the text in BAM to state that T (or U) is 8.

jmarshall commented 4 days ago

Context is this thread. I also suspect this was a simple regex chosen for its brevity rather than its accuracy. If someone does want to claim that all of A-Z must be taken as valid in SAM SEQ because that's what the spec says, I think it's incumbent on them to explain what Z should mean… :smile:

jkbonfield commented 4 days ago

Agreed. I think it was rather lazy to avoid spelling out the exact list of accepted characters. It would be interesting to test what other SAM parsers do when it comes to non-IUPAC codes.

So picard 2.26 (I can't use v3 as I don't have a modern enough Java available, but I doubt it's changed) with U gives me this:

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Invalid character in read bases; File _.sam; Line 2

Setting the validation stringency to silent allows ViewSam to process the file and it's reported back as-is, but then it also allows #, X, and other non-IUPAC characters.

With U in there, the conversion to BAM fails regardless of validation stringency.

java -jar ~/work/picard-2.26.11.jar SamFormatConverter O=_.bam I=_.sam VALIDATION_STRINGENCY=SILENT
Exception in thread "main" java.lang.IllegalStateException: Bad base passed to charToCompressedBaseLow: #(35) in read: r1

In Htslib all non IUPAC codes are silently converted to N. (Prior to my recent change that included U, which IMO was a mistake as IUPAC does acknowledge U as a real base type.) I haven't checked Noodles on the command line, but I see the BAM encoder doesn't permit U.

What this is suggesting to me is that while the spec may well say one thing, the implementations behave very differently. None (until very recently) would store U in bam AS t and at least two dislike U in SAM. I made the change in htslib because ONT are apparently already producing SAM in the wild with U in it, so something needed to be done to cope with real-world data, but I was remiss in not bringing this up here for spec clarification. Thanks for raising this issue.

My take on it is the spec should be clearer and define precisely the list of characters to something that is compatible with BAM. We should also add a recommendation that U is treated as T to aid parsing of data now being produced, but it may be problematic to make it a rule unless we bump the SAM minor version number (which is also not satisfactory as it's the BAM part of the SAM spec that needs fixing, and that doesn't have anything to do with the SAM VN header field). The issue of how to track the meta-data (ie source material type) is best dealt with over in #801.

jkbonfield commented 4 days ago

Context is this thread. I also suspect this was a simple regex chosen for its brevity rather than its accuracy. If someone does want to claim that all of A-Z must be taken as valid in SAM SEQ because that's what the spec says, I think it's incumbent on them to explain what Z should mean… 😄

Ironically it was ONT's use of Z in fastq that spurred me on to come up with a formal way of representing methylation in SAM/BAM! They were initially using Z as 5mC. Ghastly and never going to fly beyond a few custom tools.

Anyway, regarding limiting of the regexp I think we're all pretty much on the same page.

samtools / hts-specs

Modify SAM spec document to better represent restrictions in the SEQ field #800