lh3 / miniprot

Align proteins to genomes with splicing and frameshift
https://lh3.github.io/miniprot/
MIT License
310 stars 16 forks source link

limiting max intron size leads to removal of intronless alignment #36

Closed azat-badretdin closed 1 year ago

azat-badretdin commented 1 year ago

alignment of WP_001436857.1 and U00096.3 behaves differently in case of default parameters and addition of -G 100: it produces an alignment for the former, and does not - for the latter.

base command: ./miniprot -ut32 --gff

./miniprot --version
0.7-r215-dirty

Alignment is very straightforward, no gaps, 73.21% identity:

gpipedev21:case-WP_001436857.73.21$ ./miniprot -ut32 --aln nucleotide.fasta proteins.fasta 
[M::mp_ntseq_read@0.054*1.02] read 4641652 bases in 1 contigs
[M::mp_idx_build@0.054*1.02] 36264 blocks
[M::mp_idx_build@0.370*1.84] collected syncmers
[M::mp_idx_build@0.572*1.55] 2808389 kmer-block pairs
[M::mp_idx_print_stat] 1043256 distinct k-mers; mean occ of infrequent k-mers: 2.69; 0 frequent k-mers accounting for 0 occurrences
gi|485817732|ref|WP_001436857.1|        56      0       56      -       gi|545778205|gb|U00096.3|       4641652 1417840 1418008 123     168     0       AS:i:231        ms:i:231        np:i:4da:i:0   do:i:0  cg:Z:56M        cs:Z::12*aaaN:1*gacE*gttI:4*gctI:1*gaaT:1*atgP:4*cacR:1*attV*ttaI*tcaP*aaaT*gacT:1*ccgM*agaK:5*gctS:11
##ATN   ATGACTAAGAAAATAAAATGTGCTTACCACCTTTGCAAAAAAGACGTTGAAGAAAGCAAAGCTATTGAAAGAATGCTTCACTTCATGCACGGGATTTTATCAAAAGACGAACCGAGAAAATATTGCAGTGAAGCTTGTGCCGAAAAAGACCAGATGGCACATGAACTTTAA
##ATA   M..T..K..K..I..K..C..A..Y..H..L..C..K..K..D..V..E..E..S..K..A..I..E..R..M..L..H..F..M..H..G..I..L..S..K..D..E..P..R..K..Y..C..S..E..A..C..A..E..K..D..Q..M..A..H..E..L..*..
##AAS   |  |  |  |  |  |  |  |  |  |  |  |     |  +  +  |  |  |  |     |     |     |  |  |  |     |  +  +           |     +  |  |  |  |  |  +  |  |  |  |  |  |  |  |  |  |  |     
##AQA   M  T  K  K  I  K  C  A  Y  H  L  C  N  K  E  I  E  E  S  K  I  I  T  R  P  L  H  F  M  R  G  V  I  P  T  T  E  M  K  K  Y  C  S  E  S  C  A  E  K  D  Q  M  A  H  E  L     
[M::worker_pipeline::0.616*1.51] mapped 1 sequences
[M::main] Version: 0.7-r215-dirty
[M::main] CMD: ./miniprot -ut32 --aln nucleotide.fasta proteins.fasta
[M::main] Real time: 0.632 sec; CPU: 0.946 sec; Peak RSS: 0.128 GB
azat-badretdin commented 1 year ago

Similar example: same nucleotde, protein WP_000241145

Here is the list of 30 proteins that have this problem:

YP_009518766.1  1       14      U00096.3        1212304 1212263 100.00  42      42      0
WP_001436857.1  1       56      U00096.3        1418008 1417841 73.21   168     123     0
WP_000241145.1  1       58      U00096.3        1418014 1417841 70.69   174     123     0
WP_001594002.1  1       58      U00096.3        1418014 1417841 70.69   174     123     0
WP_015570047.1  1       147     U00096.3        1044676 1044236 68.71   441     303     0
WP_000928496.1  1       290     U00096.3        1924591 1923722 68.62   870     597     0
WP_000928497.1  1       290     U00096.3        1924591 1923722 67.59   870     588     0
WP_003845048.1  1       104     U00096.3        3656740 3656411 66.36   330     219     18
WP_006818529.1  1       68      U00096.3        2868563 2868766 66.18   204     135     0
WP_016153611.1  1       73      U00096.3        2167520 2167305 65.75   219     144     3
WP_015965066.1  19      169     U00096.3        1003499 1003951 60.26   453     273     0
WP_013575098.1  1       275     U00096.3        2474032 2474853 59.64   825     492     3
WP_001683185.1  1       288     U00096.3        2758519 2757656 59.03   864     510     0
WP_009874494.1  1       192     U00096.3        3545624 3546196 55.15   582     321     15
WP_061405576.1  1       287     U00096.3        1936649 1937509 55.05   861     474     0
WP_003227776.1  1       325     U00096.3        3401066 3400077 53.33   990     528     15
WP_002435259.1  1       672     U00096.3        1089857 1087842 53.12   2016    1071    0
WP_011283110.1  1       196     U00096.3        3545624 3546196 53.06   588     312     15
WP_003223915.1  1       154     U00096.3        434647  435111  52.90   465     246     3
WP_000880566.1  32      95      U00096.3        2561413 2561610 51.52   198     102     6
WP_001060729.1  159     315     U00096.3        739022  739492  49.68   471     234     0
WP_000880088.1  1       252     U00096.3        2095844 2096614 46.51   774     360     21
WP_011170650.1  12      390     U00096.3        3971275 3972486 44.30   1185    525     66
WP_012908414.1  1       315     U00096.3        3462924 3463895 44.14   972     429     27
WP_003405263.1  11      323     U00096.3        1261869 1260934 43.49   945     411     15
WP_003246242.1  7       477     U00096.3        3903698 3902316 40.66   1446    588     96
WP_001567702.1  5       334     U00096.3        2104503 2103514 40.48   1008    408     36
WP_016150595.1  1       344     U00096.3        2104509 2103493 39.49   1056    417     63
WP_000139169.1  11      261     U00096.3        1150673 1151392 38.65   753     291     33
WP_003243088.1  1       272     U00096.3        3906561 3905743 36.00   825     297     15
azat-badretdin commented 1 year ago

I am going to try to reproduce this with latest the greatest clone I just made.

azat-badretdin commented 1 year ago

Nope :-(

$ ./miniprot.0.9 -G 100 -ut32 --gff nucleotide.fasta proteins.fasta
[M::mp_ntseq_read@1.314*0.06] read 4641652 bases in 1 contigs
[M::mp_idx_build@1.314*0.06] 36264 blocks
[M::mp_idx_build@1.626*0.43] collected syncmers
[M::mp_idx_build@1.832*0.49] 2808389 kmer-block pairs
[M::mp_idx_print_stat] 1043256 distinct k-mers; mean occ of infrequent k-mers: 2.69; 0 frequent k-mers accounting for 0 occurrences
##gff-version 3
##PAF   gi|485817732|ref|WP_001436857.1|        56      0       0       *       *       0       0       0       0       0       0
[M::worker_pipeline::1.873*0.51] mapped 1 sequences
[M::main] Version: 0.9-r223
[M::main] CMD: ./miniprot.0.9 -G 100 -ut32 --gff nucleotide.fasta proteins.fasta
[M::main] Real time: 1.890 sec; CPU: 0.965 sec; Peak RSS: 0.128 GB
lh3 commented 1 year ago

Thank you. This is a problem. I will have a closer look when I have time. A bit busy recently.

azat-badretdin commented 1 year ago

Thank you for your fast reply, Heng!

Worth noting that I narrowed down the region of U00096 that still produces this problem to 1393484..4641649 smaller regions do not produce the problem, but expanding this region does.

lh3 commented 1 year ago

With the default -G, miniprot identified a wrong chain spanning a long distance. The SW step kicks in and fixes the wrong chain. With a small -G, no chains are found and thus no alignment. This is an unfortunate side effect and difficult to fix.

For small genomes, you may add -M0 to increase the density of seeds. This may improve the sensitivity at the cost of performance. -G100 -M0 finds the alignment.

azat-badretdin commented 1 year ago

Thanks. This seems to be increasing the density of the seeds mentioned in the article (every other seed) to "every seed". Are you anticipating combinatorial decrease in performance of the application because of that?

Also: can the original default setting M=1 lead to the instability of the results to odd-number shifts in nucleotide positions?

Just clarifying: giving the focus of the article on exclusively eurakyota in testing, you probably consider all prokaryotic genomes "small", correct?

lh3 commented 1 year ago

can the original default setting M=1 lead to the instability of the results to odd-number shifts in nucleotide positions?

no

giving the focus of the article on exclusively eurakyota in testing, you probably consider all prokaryotic genomes "small", correct?

yes