Closed azat-badretdin closed 1 year ago
Similar example: same nucleotde, protein WP_000241145
Here is the list of 30 proteins that have this problem:
YP_009518766.1 1 14 U00096.3 1212304 1212263 100.00 42 42 0
WP_001436857.1 1 56 U00096.3 1418008 1417841 73.21 168 123 0
WP_000241145.1 1 58 U00096.3 1418014 1417841 70.69 174 123 0
WP_001594002.1 1 58 U00096.3 1418014 1417841 70.69 174 123 0
WP_015570047.1 1 147 U00096.3 1044676 1044236 68.71 441 303 0
WP_000928496.1 1 290 U00096.3 1924591 1923722 68.62 870 597 0
WP_000928497.1 1 290 U00096.3 1924591 1923722 67.59 870 588 0
WP_003845048.1 1 104 U00096.3 3656740 3656411 66.36 330 219 18
WP_006818529.1 1 68 U00096.3 2868563 2868766 66.18 204 135 0
WP_016153611.1 1 73 U00096.3 2167520 2167305 65.75 219 144 3
WP_015965066.1 19 169 U00096.3 1003499 1003951 60.26 453 273 0
WP_013575098.1 1 275 U00096.3 2474032 2474853 59.64 825 492 3
WP_001683185.1 1 288 U00096.3 2758519 2757656 59.03 864 510 0
WP_009874494.1 1 192 U00096.3 3545624 3546196 55.15 582 321 15
WP_061405576.1 1 287 U00096.3 1936649 1937509 55.05 861 474 0
WP_003227776.1 1 325 U00096.3 3401066 3400077 53.33 990 528 15
WP_002435259.1 1 672 U00096.3 1089857 1087842 53.12 2016 1071 0
WP_011283110.1 1 196 U00096.3 3545624 3546196 53.06 588 312 15
WP_003223915.1 1 154 U00096.3 434647 435111 52.90 465 246 3
WP_000880566.1 32 95 U00096.3 2561413 2561610 51.52 198 102 6
WP_001060729.1 159 315 U00096.3 739022 739492 49.68 471 234 0
WP_000880088.1 1 252 U00096.3 2095844 2096614 46.51 774 360 21
WP_011170650.1 12 390 U00096.3 3971275 3972486 44.30 1185 525 66
WP_012908414.1 1 315 U00096.3 3462924 3463895 44.14 972 429 27
WP_003405263.1 11 323 U00096.3 1261869 1260934 43.49 945 411 15
WP_003246242.1 7 477 U00096.3 3903698 3902316 40.66 1446 588 96
WP_001567702.1 5 334 U00096.3 2104503 2103514 40.48 1008 408 36
WP_016150595.1 1 344 U00096.3 2104509 2103493 39.49 1056 417 63
WP_000139169.1 11 261 U00096.3 1150673 1151392 38.65 753 291 33
WP_003243088.1 1 272 U00096.3 3906561 3905743 36.00 825 297 15
I am going to try to reproduce this with latest the greatest clone I just made.
Nope :-(
$ ./miniprot.0.9 -G 100 -ut32 --gff nucleotide.fasta proteins.fasta
[M::mp_ntseq_read@1.314*0.06] read 4641652 bases in 1 contigs
[M::mp_idx_build@1.314*0.06] 36264 blocks
[M::mp_idx_build@1.626*0.43] collected syncmers
[M::mp_idx_build@1.832*0.49] 2808389 kmer-block pairs
[M::mp_idx_print_stat] 1043256 distinct k-mers; mean occ of infrequent k-mers: 2.69; 0 frequent k-mers accounting for 0 occurrences
##gff-version 3
##PAF gi|485817732|ref|WP_001436857.1| 56 0 0 * * 0 0 0 0 0 0
[M::worker_pipeline::1.873*0.51] mapped 1 sequences
[M::main] Version: 0.9-r223
[M::main] CMD: ./miniprot.0.9 -G 100 -ut32 --gff nucleotide.fasta proteins.fasta
[M::main] Real time: 1.890 sec; CPU: 0.965 sec; Peak RSS: 0.128 GB
Thank you. This is a problem. I will have a closer look when I have time. A bit busy recently.
Thank you for your fast reply, Heng!
Worth noting that I narrowed down the region of U00096
that still produces this problem to 1393484..4641649
smaller regions do not produce the problem, but expanding this region does.
With the default -G, miniprot identified a wrong chain spanning a long distance. The SW step kicks in and fixes the wrong chain. With a small -G, no chains are found and thus no alignment. This is an unfortunate side effect and difficult to fix.
For small genomes, you may add -M0
to increase the density of seeds. This may improve the sensitivity at the cost of performance. -G100 -M0
finds the alignment.
Thanks. This seems to be increasing the density of the seeds mentioned in the article (every other seed) to "every seed". Are you anticipating combinatorial decrease in performance of the application because of that?
Also: can the original default setting M=1
lead to the instability of the results to odd-number shifts in nucleotide positions?
Just clarifying: giving the focus of the article on exclusively eurakyota in testing, you probably consider all prokaryotic genomes "small", correct?
can the original default setting M=1 lead to the instability of the results to odd-number shifts in nucleotide positions?
no
giving the focus of the article on exclusively eurakyota in testing, you probably consider all prokaryotic genomes "small", correct?
yes
alignment of
WP_001436857.1
andU00096.3
behaves differently in case of default parameters and addition of-G 100
: it produces an alignment for the former, and does not - for the latter.base command:
./miniprot -ut32 --gff
Alignment is very straightforward, no gaps, 73.21% identity: