lh3 / miniprot

Align proteins to genomes with splicing and frameshift
https://lh3.github.io/miniprot/
MIT License
310 stars 16 forks source link

Inserted internal stop codons are not penalized #58

Closed Percud closed 4 months ago

Percud commented 4 months ago

Hi,

I am using miniprot to identify genes and pseudogenes in primate genomes. It appears that miniprot (0.12-r237) does not penalize (nor acknowledge) an inserted stop codon.

For instance with these example sequences:

$ cat genomic.fa 
>example
CAAGTCTATGTGGAAGAAATCCCTTGAAAGTGGCTTGAAAAGgttaacacttatttgtgctctcagtttaattaatgaaattggctttatggcaagtactgaatttttcctagaaataatttgtttccctaattttcagAATGGAGTTAAGCATGTCCATGCATTTATTCACACT
$ cat pep.fa
>example_pep
VYVEEIPWKRLEKNGVKHVHAFIHT

The genomic sequence is a pseudogene with a W->* mutation. I can retrieve the correct alignment by lowering the penalty for frameshift/stop (-F):

$ miniprot -F11 --aln --gff genomic.fa pep.fa 
##gff-version 3
##PAF   example_pep     25      0       25      +       example 175     3       175     69      75      0       AS:i:72 ms:i:109        np:i:23 fs:i:0  st:i:1  da:i:-1 do:i:-1 cg:Z:13M97N12M cs:Z::7*tgaW:1*tggR:3~gt97ag:12
##ATN   GTCTATGTGGAAGAAATCCCTTGAAAGTGGCTTGAAAAGgttaacacttatttgtgctctcagtttaattaatgaaattggctttatggcaagtactgaatttttcctagaaataatttgtttccctaattttcagAATGGAGTTAAGCATGTCCATGCATTTATTCACACT
##ATA   V..Y..V..E..E..I..P..*..K..W..L..E..K..                                                                                                 N..G..V..K..H..V..H..A..F..I..H..T..
##AAS   |  |  |  |  |  |  |     |     |  |  |                                                                                                   |  |  |  |  |  |  |  |  |  |  |  |  
##AQA   V  Y  V  E  E  I  P  W  K  R  L  E  K                                                                                                   N  G  V  K  H  V  H  A  F  I  H  T  
example miniprot        mRNA    4       175     109     +       .       ID=MP000001;Rank=1;Identity=0.9200;Positive=0.9200;StopCodon=1;Target=example_pep 1 25
example miniprot        CDS     4       42      41      +       0       Parent=MP000001;Rank=1;Identity=0.8462;StopCodon=1;Target=example_pep 1 13
example miniprot        CDS     140     175     68      +       0       Parent=MP000001;Rank=1;Identity=1.0000;Target=example_pep 14 25

However, with the default parameter, miniprot prefers to insert a stop codon and open two gaps. This inserted stop codon is not acknowledged in the gff.

$ miniprot --aln --gff genomic.fa pep.fa 
##gff-version 3
##PAF   example_pep     25      0       25      +       example 175     3       175     69      81      0       AS:i:66 ms:i:103        np:i:23 fs:i:0  st:i:0  da:i:-1 do:i:-1 cg:Z:7M2D1M2I3M97N12M   cs:Z::7-tgaaag:1+KR:3~gt97ag:12
##ATN   GTCTATGTGGAAGAAATCCCTTGAAAGTGG------CTTGAAAAGgttaacacttatttgtgctctcagtttaattaatgaaattggctttatggcaagtactgaatttttcctagaaataatttgtttccctaattttcagAATGGAGTTAAGCATGTCCATGCATTTATTCACACT
##ATA   V..Y..V..E..E..I..P..*..K..W..-..-..L..E..K..                                                                                                 N..G..V..K..H..V..H..A..F..I..H..T..
##AAS   |  |  |  |  |  |  |        |        |  |  |                                                                                                   |  |  |  |  |  |  |  |  |  |  |  |  
##AQA   V  Y  V  E  E  I  P  -  -  W  K  R  L  E  K                                                                                                   N  G  V  K  H  V  H  A  F  I  H  T  
example miniprot        mRNA    4       175     103     +       .       ID=MP000001;Rank=1;Identity=0.8519;Positive=0.8519;Target=example_pep 1 25
example miniprot        CDS     4       42      35      +       0       Parent=MP000001;Rank=1;Identity=0.7333;Target=example_pep 1 13
example miniprot        CDS     140     175     68      +       0       Parent=MP000001;Rank=1;Identity=1.0000;Target=example_pep 14 25

Best regards, Riccardo

lh3 commented 4 months ago

Yes, you are correct. However, a fix is non-trivial. I will keep this in mind and come back to this problem later. Thanks for the test case.

lh3 commented 4 months ago

Actually simpler than I thought. Let me know if you still see this issue.