lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.77k stars 404 forks source link

Inconsistent mapping against full reference and a single chromosome #107

Open alexeigurevich opened 6 years ago

alexeigurevich commented 6 years ago

Hi, Heng! We experimented with Drosophila melanogaster genome and found several strange things in minimap2 behaviour (Version: 2.8-r686-dirty). I've attached two our files for reproducing the issues (full chrX extracted from the reference genome and its fragment from our simulated assembly which should perfectly map to it). We also used the full reference Drosophila_melanogaster.BDGP6.dna.toplevel.fa which could be downloaded e.g. here.

The issues are:

  1. ./minimap2 Drosophila_melanogaster.chrX.fa chrX_fragment.fa gives empty output while we expect to see several perfect mappings. Note that ./minimap2 chrX_fragment.fa Drosophila_melanogaster.chrX.fa results in many alignments.
  2. ./minimap2 Drosophila_melanogaster.BDGP6.dna.toplevel.fa chrX_fragment.fa produces expected matches and all of them are from chrX, so (1) looks strange even without taking into account the issue with target/query order.
  3. ./minimap2 -x asm10 Drosophila_melanogaster.BDGP6.dna.toplevel.fa chrX_fragment.fa results in no alignments again. Note that all alignments from (2) are perfect, so should not be excluded because of -x asm10 in theory.

Drosophila_melanogaster.chrX.fa.gz chrX_fragment.fa.gz

lh3 commented 6 years ago

This is all because chrX_fragment.fa has 65 copies on chrX.

  1. If you look at the minimap2 log, you will see this line: mid_occ = 37. This means minimizers occurring 37 times or more will be ignored. This fragment has 65 copies, more than the threshold.

  2. When you map to the whole genome, the threshold is changed to 115, probably because other regions have more copies. At 115, the fragment become mappable.

  3. -x asm10 uses k-mer of different lengths. The threshold will change.

You can choose a fixed threshold with -f 200. You will see consistent results from these three runs.

I wish minimap2 could output a random mapping even if a query has thousands of copies. However, I could not find an efficient way to achieve that. This is one of the things suffix array/BWT based algorithms are better at.

lh3 commented 6 years ago

PS: a high -f will make minimap2 slower. I have not tested how much slower this option may cause.

lh3 commented 6 years ago

PSS: there are not as many as 65 copies. Some of the inexact copies are duplicates.

alexeigurevich commented 6 years ago

Thanks for the explanation! Will try to use -f in our experiments.