lh3 / miniprot

Align proteins to genomes with splicing and frameshift
https://lh3.github.io/miniprot/
MIT License
323 stars 17 forks source link

duplicate and close variants of the same alignment in the output #37

Open azat-badretdin opened 1 year ago

azat-badretdin commented 1 year ago

When I use these parameters:

 ./miniprot -G 100 -O 10 -J 34 -F 30 --gff -ut32 nucleotide.fasta proteins.fasta

I get very close variants of the same alignment:

gpipedev21:issue-34$ grep WP_004242317 miniprot.gff  | grep PAF
##PAF   gi|490362554|ref|WP_004242317.1|        343     149     343     +       gi|545778205|gb|U00096.3|       4641652 3221864 3222446 402     582     0       AS:i:680        ms:i:680      np:i:159 da:i:-1 do:i:0  cg:Z:194M       cs:Z::2*accC*gacS*aatA*atcV:2*atcV:2*cacS*gaaD*cccR*ggcQ:1*ggtD:9*cgcY:1*agtA*aaaQ*gaaS*atcV*atcT:2*tatF:1*aacA:2*gttY*aatD:7*gaaQ:1*gagS:1*ggcA*aagA:8*gcgT:3*cgaS:1*aaaR*caaG:3*gaaG:3*tggY:2*ggtD:3*tcgA:3*gaaA:7*cggG:1*gacS:19*attL:2*cgaQ*ggcH*ctgI*aacA:2*cagE:2*tcgA:10*cgaK:2*tttI:1*ccgS:9*atgV:8*gtgL*tatF:1*aaaR*gccL:2*ggtE:1*gcgQ*ctgE:2*ttaQ*gtcI:1*gttA*cccA:1*aaaR:1*aaaI:5*cgtK
##PAF   gi|490362554|ref|WP_004242317.1|        343     154     343     +       gi|545778205|gb|U00096.3|       4641652 3221879 3222446 396     567     0       AS:i:675        ms:i:675      np:i:157 da:i:-1 do:i:0  cg:Z:189M       cs:Z:*atcV:2*atcV:2*cacS*gaaD*cccR*ggcQ:1*ggtD:9*cgcY:1*agtA*aaaQ*gaaS*atcV*atcT:2*tatF:1*aacA:2*gttY*aatD:7*gaaQ:1*gagS:1*ggcA*aagA:8*gcgT:3*cgaS:1*aaaR*caaG:3*gaaG:3*tggY:2*ggtD:3*tcgA:3*gaaA:7*cggG:1*gacS:19*attL:2*cgaQ*ggcH*ctgI*aacA:2*cagE:2*tcgA:10*cgaK:2*tttI:1*ccgS:9*atgV:8*gtgL*tatF:1*aaaR*gccL:2*ggtE:1*gcgQ*ctgE:2*ttaQ*gtcI:1*gttA*cccA:1*aaaR:1*aaaI:5*cgtK

This also expresses itself, maybe, in duplication of some alignment output. For example:

gi|545778205|gb|U00096.3|       miniprot        CDS     729583  733323  6547    +       0       Parent=MP001848;Rank=18;Identity=0.9719;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        mRNA    729583  733323  6547    +       .       ID=MP001849;Rank=19;Identity=0.9719;Positive=0.9783;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        CDS     729583  733323  6547    +       0       Parent=MP001849;Rank=19;Identity=0.9719;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        mRNA    729583  733323  6547    +       .       ID=MP001850;Rank=20;Identity=0.9719;Positive=0.9783;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        CDS     729583  733323  6547    +       0       Parent=MP001850;Rank=20;Identity=0.9719;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        mRNA    729583  733323  6547    +       .       ID=MP001851;Rank=21;Identity=0.9719;Positive=0.9783;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        CDS     729583  733323  6547    +       0       Parent=MP001851;Rank=21;Identity=0.9719;Target=gi|15829983|ref|NP_308756.1| 1 1247

The alignments are the same, but the Rank=x value is different in each case.

lh3 commented 1 year ago

These two different hits. For now, you have to filter them out by yourself.

azat-badretdin commented 1 year ago

Thanks. Which example are you talking about? Or both?

lh3 commented 1 year ago

Both

azat-badretdin commented 1 year ago

For now

This seems that there is a hope that the hits will be on per region in the future?