Closed SilasK closed 5 years ago
Without -c
or -a
, col 10 is greatly underestimated. Tag dv:f
gives a much better estimate of sequence divergence than col10/col11. I would recommend to use -c --secondary=no
.
I don't get dv:f
tags when aligning two fastas.
with minimap2 v2.17-r941
Log:
[M::mm_idx_gen::0.104*1.02] collected minimizers
[M::mm_idx_gen::0.121*1.29] sorted minimizers
[M::main::0.121*1.29] loaded/built the index for 28 target sequence(s)
[M::mm_mapopt_update::0.128*1.27] mid_occ = 4
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 28
[M::mm_idx_stat::0.134*1.25] distinct minimizers: 514621 (98.47% are singletons); average occurrences: 1.016; average spacing: 5.353
[M::worker_pipeline::0.403*2.33] mapped 22 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -c --secondary=no filtered_bins/ERR675627_maxbin_21.fasta filtered_bins/ERR675634_metabat_01.fasta
[M::main] Real time: 0.419 sec; CPU: 0.953 sec; Peak RSS: 0.060 GB
output:
ERR675634_0 425144 35 423465 + ERR675627_3 470019 46589 470019 423430 423430 60 NM:i:0 ms:i:846860 AS:i:846860 nn:i:0 tp:A:P cm:i:79017 s1:i:423423 s2:i:76 de:f:0 rl:i:60 cg:Z:423430M
ERR675634_0 425144 425089 425144 - ERR675627_44 161963 3585 3640 55 55 17 NM:i:0 ms:i:110 AS:i:110 nn:i:0 tp:A:P cm:i:6 s1:i:51 s2:i:0 de:f:0 rl:i:60 cg:Z:55M
ERR675634_0 425144 0 55 - ERR675627_3 470019 46334 46389 55 55 9 NM:i:0 ms:i:110 AS:i:110 nn:i:0 tp:A:P cm:i:6 s1:i:46 s2:i:0 de:f:0 rl:i:60 cg:Z:55M
ERR675634_24 201665 37 201665 + ERR675627_19 201665 37 201665 201628 201628 60 NM:i:0 ms:i:403256 AS:i:403256 nn:i:0 tp:A:P cm:i:37805 s1:i:201615 s2:i:138 de:f:0 rl:i:0 cg:Z:201628M
Others seem to have the same problem.
With -c
or -a
, you should look at the de
tag. dv
is approximate. de
is calculated from the alignment and is more accurate.
Great thank you!
I get negative values for the de tag. Makes this sense?
Should I just clip them to zero or take the absolute value?
I used asm20 preset to align bacterial genomes, do you think this is reasonable?
Negative de is a bug. Fixed in github HEAD. You can turn them to 0.
Perfect, thanks for the quick reply.
I encountered another small problem. In addition to the ANI I like to calculate the coverage of the alignment between two prokaryote genomes.
I calculate the coverage as the aligned fraction divided by the genome length of the smaller genome. I take care that the larger genome is the ref and the smaller the query, but even then I get coverage values > 100%. Sometimes >110%.
I use the -c --secondary=no
so I assume it's not due to secondary mapping.
Maybe gaps might be the reason.
Any idea?
Hey, I'm fascinated by how fast minimap2 is. I would like to use it to calculate Average nucleotide identity for metagenome assembled genomes or normal genomes to group them into species (95% identity).
I used the
-x asm10
option. And then I sum the matches (col 10) and divide it by the sum of the alignment length (col 11). This should give me the average identity or ANI, isn't it?I realized that this ANI calculated using minimap is much stricter than ANI based on mash or mummer.
E.g. for a comparison of a genome and a MAG, mash, and mummer gives my 99% identity and minimap 80% when running without the -x asm10 preset I get 90% which is still much less. Any Idea?