samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
632 stars 174 forks source link

Add SAM recommendation of MAPQ=0 for unmapped reads #753

Open jkbonfield opened 5 months ago

jkbonfield commented 5 months ago

We already have a recommendation that MAPQ of 255 should not be used. Expand on this to recommend zero is used when unmapped.

This is purely a recommendation, made for maximum compatibility, and not a specification requirement.

Fixes #727

github-actions[bot] commented 5 months ago

Changed PDFs as of 79f23800d90ab64eef060b09d9edd61a14e612b2: SAMv1 (diff).

zaeleus commented 4 months ago

Should this be further constrained to unmapped records with missing mapping qualities? That is, recommend using 0 instead of 255. Otherwise, is this recommending that all unmapped records have their mapping qualities reset to 0?

jkbonfield commented 4 months ago

Is there any value in another mapping quality? By definition, if it's not mapped, the mapping quality doesn't make sense.

d-cameron commented 1 week ago

Is there any value in another mapping quality? By definition, if it's not mapped, the mapping quality doesn't make sense.

The use case for non-zero mapq is for an aligner that can confidently map to a sequence that is not included in the reference genome.

The primary use case for this is graph genome aligners. Graph genome aligners can have a mapq w.r.t. to graph but when projected onto a linear reference, that that alignment position is fully contained within an insertion. How should these be represented? The two approaches I've seen in the wild are:

1) Map with an insertion CIGAR to the insertion location (Seven Bridges aligner approach)

2) Unmapped read placed at the insertion location (DRAGEN aligner approach)

Making this mapq change will cause an existing implementation in widespread usage to break/become non-compliant. Ok to do, but it would be good to have a) some time to update the implementation, and b) spec-defined guideance on how non-reference alignments should be handled (both those with INS liftovers and those that do not map to reference contig (e.g. graph decoy sequence). For example, DRAGEN retains information about originating alignment via a SA-style custom SAM tag. This will be a lossy operation if mapq is forced to 0.

jkbonfield commented 4 days ago

I disagree. The specification already states "If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800" so adding a recommendation to use zero values for such cases cannot break existing implementations any more than they are already broken. Spec compliant software should ignore those fields. All this PR is doing is trying to encourage people to stop using them in such situations.

My recommendation is people should use additional aux fields if they need a way to represent something which is currently outside of the specification, or to raise a PR here requesting things are changed.

Mind you, that sentence above also rules out the well established (since day one) strategy of "placed" reads that are unmapped by POSitioned along side their mate pair, which is rather unfortunate.

Edit: sorry I'm being forgetful. There is a recommendations section which does define meaning to RNAME and POS for unmapped data (ie the mate-pair strategy above). It's rather poor form though for the same document to be essentially saying "these fields have no meaning" followed by "this is the meaning of these fields"! I accept that it's "no assumptions" rather than no meaning, but we're leading people up the garden path somewhat. The phrasing in section 1.4 ought to be amended to indicate this. Eg:

"Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800. However see the Recommended Practices in section 2 for suggested interpretations."

It's far from ideal and I don't believe it should ever have been written this way, but it is what it is.

As for the DRAGEN strategy of placing the read where the graph insertion is, I'm fully in agreement with it from a logical view point. Remember why we "place" unmapped reads along side their mate pair. It's because the sample may not be the same as the reference and possibly we have a large insertion. By putting reads together, we can pull out all data for such a region via a single SAM query and then do denovo assembly to construct a sample-specific set of alleles for that locus. Indeed this is exactly what a lot of variant callers do. If we have a better placement for that read via a graph genome, then it is sensible that we place the unmapped data closer to the exact site of insertion. Previously this wouldn't have made much difference with classic small insert sized templates, but we now have longer range library types and also the possibility of very long reads which may be broken up into supplementary reads and placed at different locations.

Also, why MAPQ and only MAPQ? If we have a path through a graph and we've flattened those coordinates back to their nearest place on a linear reference, then we have the alignment of this sequence against the graph, including the path in the graph, the cigar string against that path, it's mapping quality and maybe NM / MD tags against that path too. Basically everything we see in GAF format. I'm not arguing for adding meaning to those other fields, but questioning why only one field was chosen (mapq). This is all best done with additional tags rather than changing to meaning of the mandatory columns, especially given our assertion that we cannot make any assumptions on their meaning.

The spec really needs a refresh to accommodate such ideas, but IMO it should be driven by the people generating the data and doing the graph alignments. (Ie GA4GH "Driver projects" and similar.) It's also a much larger issue than this one minor tweak.

jkbonfield commented 4 days ago

On the related issue of POS having no meaning for reads with no aligned bases, I hit this many years ago when implementing Gap5. It's worse than it looks too as it's still problematic even when you do have aligned bases.

Insertions basically appear before the next base, so a read that starts with an insertion could have CIGAR of 99I10M and a POS for that M. As we get shorter and shorter reads all with the same insertion but less mapped data at the end, POS doesn't change, and 99I1M is the same location. Although it's undefined, logically, so is 99I (and 0M). So POS being the base after does make some sort of sense here.

Fundamentally though this breaks indexing. Even the normal case of 90I10M means our insertions are only known about after they've been and gone. Consider:

POS      12345      6789                                                        
Ref      AGCTA      AGCT                                                        
Sample   AGCTAGGTTGGAGCT                                                        
Seq0     AGCTAGGTTGGAGCT   Pos 1, CIGAR 5M6I4M
Seq1     AGCTAGGTT         Pos 1, CIGAR 5M4I                                    
Seq2            TTGGAGCT   Pos 6, CIGAR (2P)4I4M                                

A query for chr:5-5 wouldn't give us Seq2. (And a query for chr:6-6 wouldn't give us Seq1.) We need more nuance here.

Processing each read iteratively in a pileup orientation also doesn't work and would give us a set of reads with an insertion at position 5 that doesn't include Seq2, as it hasn't yet started. Then when we get to position 6 we're told there was additional data we should have been using for computing the insertion consensus. Doh!

In Gap5 I replaced every CIGAR with 1D and decremented POS by 1, and then during processing removed the initial deleted base. VERY ugly, but it made the pileup iterator work as our position coordinates are now the base before an insertion rather than after the insertion. It's a miss-design of the spec IMO, but we are where we are and I don't really know how to retrofit it without resorting to such hacks.