lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.77k stars 404 forks source link

minimap2 for ANI calculation #437

Closed SilasK closed 5 years ago

SilasK commented 5 years ago

Hey, I'm fascinated by how fast minimap2 is. I would like to use it to calculate Average nucleotide identity for metagenome assembled genomes or normal genomes to group them into species (95% identity).

I used the -x asm10 option. And then I sum the matches (col 10) and divide it by the sum of the alignment length (col 11). This should give me the average identity or ANI, isn't it?

I realized that this ANI calculated using minimap is much stricter than ANI based on mash or mummer.

E.g. for a comparison of a genome and a MAG, mash, and mummer gives my 99% identity and minimap 80% when running without the -x asm10 preset I get 90% which is still much less. Any Idea?

lh3 commented 5 years ago

Without -c or -a, col 10 is greatly underestimated. Tag dv:f gives a much better estimate of sequence divergence than col10/col11. I would recommend to use -c --secondary=no.

SilasK commented 5 years ago

I don't get dv:f tags when aligning two fastas.

with minimap2 v2.17-r941

Log:


[M::mm_idx_gen::0.104*1.02] collected minimizers
[M::mm_idx_gen::0.121*1.29] sorted minimizers
[M::main::0.121*1.29] loaded/built the index for 28 target sequence(s)
[M::mm_mapopt_update::0.128*1.27] mid_occ = 4
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 28
[M::mm_idx_stat::0.134*1.25] distinct minimizers: 514621 (98.47% are singletons); average occurrences: 1.016; average spacing: 5.353
[M::worker_pipeline::0.403*2.33] mapped 22 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -c --secondary=no filtered_bins/ERR675627_maxbin_21.fasta filtered_bins/ERR675634_metabat_01.fasta
[M::main] Real time: 0.419 sec; CPU: 0.953 sec; Peak RSS: 0.060 GB

output:

ERR675634_0 425144  35  423465  +   ERR675627_3 470019  46589   470019  423430  423430  60  NM:i:0  ms:i:846860 AS:i:846860 nn:i:0  tp:A:P  cm:i:79017  s1:i:423423 s2:i:76 de:f:0  rl:i:60 cg:Z:423430M
ERR675634_0 425144  425089  425144  -   ERR675627_44    161963  3585    3640    55  55  17  NM:i:0  ms:i:110    AS:i:110    nn:i:0  tp:A:P  cm:i:6  s1:i:51 s2:i:0  de:f:0  rl:i:60 cg:Z:55M
ERR675634_0 425144  0   55  -   ERR675627_3 470019  46334   46389   55  55  9   NM:i:0  ms:i:110    AS:i:110    nn:i:0  tp:A:P  cm:i:6  s1:i:46 s2:i:0  de:f:0  rl:i:60 cg:Z:55M
ERR675634_24    201665  37  201665  +   ERR675627_19    201665  37  201665  201628  201628  60  NM:i:0  ms:i:403256 AS:i:403256 nn:i:0  tp:A:P  cm:i:37805  s1:i:201615 s2:i:138    de:f:0  rl:i:0  cg:Z:201628M

Others seem to have the same problem.

lh3 commented 5 years ago

With -c or -a, you should look at the de tag. dv is approximate. de is calculated from the alignment and is more accurate.

lh3 commented 5 years ago

See also:

https://lh3.github.io/minimap2/minimap2.html#10

SilasK commented 5 years ago

Great thank you!

SilasK commented 4 years ago

I get negative values for the de tag. Makes this sense?

Should I just clip them to zero or take the absolute value?

I used asm20 preset to align bacterial genomes, do you think this is reasonable?

lh3 commented 4 years ago

Negative de is a bug. Fixed in github HEAD. You can turn them to 0.

SilasK commented 4 years ago

Perfect, thanks for the quick reply.

SilasK commented 4 years ago

I encountered another small problem. In addition to the ANI I like to calculate the coverage of the alignment between two prokaryote genomes.

I calculate the coverage as the aligned fraction divided by the genome length of the smaller genome. I take care that the larger genome is the ref and the smaller the query, but even then I get coverage values > 100%. Sometimes >110%.

I use the -c --secondary=no so I assume it's not due to secondary mapping. Maybe gaps might be the reason.

Any idea?