lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.81k stars 414 forks source link

NM and AS score with --splice option #546

Closed aroelo closed 4 years ago

aroelo commented 4 years ago

Hi, I have used minimap2 with the following settings: minimap2 -ax splice -t 20 --split-prefix tmp_test 4081.ref.fa Tomato_polyA_reads.fastq -o Tomato_polyA.sam -y

And I have some questions about the tags in the output file, see for example the following two alignments:

00000d96-a420-440e-9e0b-5eae4e66480b    0       Solanum_lycopersicum_chromosome_ch01_complete_genome     32989653        0       6S25M1D7M1D44M2D40M2S   *       0       0       CTTTTTAGGGATGCAACTTGAGGACTTCCCAGGGGTCACCATCCTAGTACTACTCTCGCCAAAGCACGCTTAACTTCGGAGTTGATGGGATCCGGTGCGTTGGTGCTGGTATGATCGCATCCCA    -===:5:??J?63.*+--*>=<32#%%%((&0/.789)0/944:656:68:+'9=>;))-<==)%&&56:=9:>@6>21,7-/10=8>B?33C<CC925?<-.<7@DB>;845/$:%,5++'%&    NM:i:8  ms:i:94 AS:i:94 nn:i:0  tp:Z:P  cm:i:9  s1:i:73 s2:i:77 de:f:0.0588     rl:i:0
00000d96-a420-440e-9e0b-5eae4e66480b    256     Solanum_lycopersicum_whole_genome_shotgun_sequence  32848247        0       6S25M1D7M1D44M2D20M4067N1I19M2S *       0       0       *       *       NM:i:8  ms:i:93 AS:i:61 nn:i:0  ts:Z:+  tp:Z:S  cm:i:9  s1:i:64 de:f:0.0588     rl:i:0 

In the splice mode, 1) long deletions are taken as introns and represented as the ‘N’ CIGAR operator; 2) long insertions are disabled; 3) deletion and insertion gap costs are different during chaining; 4) the computation of the ‘ms’ tag ignores introns to demote hits to pseudogenes.

Point 4) mentions explicitly that the 'ms' tag ignores introns. Does that mean that the 'AS' tag doesn't ignore introns? Is this why the AS is lower for the secondary alignment, because the only difference (based on the CIGAR string) is an intron of length 4067 and one extra insertion.

Total number of mismatches and gaps in the alignment

Based on the CIGAR strings and this definition I would think they are either 4 and 5 (without soft clipping at start/end) or 12 and 13 (with soft clipping at start/end)

Thanks!

lh3 commented 4 years ago

Does that mean that the 'AS' tag doesn't ignore introns?

That is correct: AS penalizes introns.

NM counts indels as well.

aroelo commented 4 years ago

Thank you for the clarification.

If this is the case, what would you suggest as a good approach to prioritize alignments that have the same identity percentage, but a higher coverage?

I thought of the 'AS' as a score more representative of the complete alignment and the 'ms' as a score for the best partial alignment. If minimap2 is used without the '--splice' option, I would just use 'AS' instead of 'ms' to achieve the above.

Also, what is the reasoning for penalizing introns in the AS with the '--splice' option?

lh3 commented 4 years ago

AS is similar to how Smith-Waterman works. You pay for gap/intron opens.

The manpage has explained why ms is preferred: to reduce the effect of pseudogenes.

aroelo commented 4 years ago

I want to thoroughly understand the score calculations and re-calculated some of the scores for my alignments, but it seems like I'm missing an aspect that I can't find in the manpage.

According to the manpage, in splice mode the gap penalties are calculated like this: lambda k: min(O1 + k * E1, O2 + k * E2)

Where k = length of the gap and

O1 = 2
O2 = 32
E1 = 1
E2 = 0

In the alignment below, there is a deletion of length 36. There are some smaller deletions as well, one of length 1 and one of length 2.

If you would use the above calculation, you would get a deletion penalty of:
32+3+4=39

However, it seems like only the first part of the equation is used:
38+3+4=45

This is the only way to get to the final AS of 92 that we can see in the alignment below.

scores matches = 164
penalty insertions = 17
penalty mismatches = 10
164-17-10 = 137

The only thing left is to apply the penalty for deletions.

137-39 = 98
137-45 = 92 <---

It seems like the penalty for this 36 bp long deletion was 38 instead of 32 and I don't understand why this is the case. I have seen this happen in another handful cases, but most of the time the penalty is according to the calculation from the manpage.

Is there anything that I'm missing why in some cases the gap penalty is calculated differently?

Minimap cmd used:
minimap2 -ax splice -y --eqx -t 20 --split-prefix tmp_test 4081.ref.fa Tomato_polyA_reads_tma.fastq -o Tomato_polyA.ntwgs.sam

Alignment:
02e175d4-33be-4b07-958d-496d5188048e 256 C663681072|tid|4081|HG975518.1|Solanum_lycopersicum_chromosome_ch06_complete_genome 41648805 0 40S18=1I1X4=1X23=36D36=1I1X6=1I12=1D24=2D12=1I5=1X13=3I3=1X8=4S * 0 0 * * NM:i:51 ms:i:92 AS:i:92 nn:i:0 tp:A:S cm:i:15 s1:i:87 de:f:0.0734 rl:i:0 RG:Z:248c77c670830a2239e7074a4ec1c68a35baeacd_barcode10_Tomato_polyA

lh3 commented 4 years ago

According to the manpage, in splice mode the gap penalties are calculated like this:

No. That formula is applied to genomics alignment, not to spliced alignment.

aroelo commented 4 years ago

I see, I think I was a bit quick to use the formula from the manpage, so I had another look at the 'Aligning spliced sequences' section from the paper.
image

Here I see that for deletions the formula is:
q + l * e

Which would translate to:
2 + 36 * 1 = 38 in my example above if I'm correct.

So this does correspond with the ms/AS I see in the output.

However, reading the last few lines I wonder why my deletion is not seen as an intron? I believe that q̃ - q / e = 32 - 2 / 1 in this case, so my deletion of 36 is longer than this.

It also seems like q + l * e is sometimes used when calculating the penalty for insertions. And for spliced alignment this should always be min{q + |l| · e, q̃} I thought.

Example: I have an alignment (see below) with one insertion of length 39, five insertions of length 1 and one insertion of length 2. Using the formula from the paper (min{q + |l| · e, q̃}): 3 + 3 + 32 + 4 + 3 + 3 + 3 = 51

Using q + l * e: 3 + 3 + 41 + 4 + 3 + 3 + 3 = 60

scores matches = 399
penalty deletions = 17
penalty mismatches = 28
399 - 17 - 28 = 354

The only thing left is to apply the penalty for insertions. If we use a penalty of 60 we get to the reported ms/AS of 294.

354 - 51 = 303
354 - 60 = 294 <---

Alignment:
0b3e0884-bd62-4d3f-bb72-e233866d0eb9 272 C170763669|tid|4081|AC215459.2|Solanum_lycopersicum_chromosome_2_clone_C02SLe0089P21_complete_sequence 3782 031S6=1I22=2D2=1X15=1I1X10=39I5=2I29=2D2X33=4D17=1I1X11=1D33=2X1=2X17=1I42=1X19=1X1=1X1=1I23=1X20=1X92=31S * 0 0 * * NM:i:69 ms:i:294 AS:i:294 nn:i:0 tp:Z:S cm:i:65 s1:i:306 de:f:0.059 rl:i:0 RG:Z:248c77c670830a2239e7074a4ec1c68a35baeacd_barcode10_Tomato_polyA AS:i:294 ti:Z:4081 er:Z:NA