lh3/minimap - Githubissues

Introduction

Minimap is an experimental tool to efficiently find multiple approximate mapping positions between two sets of long sequences, such as between reads and reference genomes, between genomes and between long noisy reads. By default, it is tuned to have high sensitivity to 2kb matches around 20% divergence but with low specificity. Minimap does not generate alignments as of now and because of this, it is usually tens of times faster than mainstream aligners. With four CPU cores, minimap can map 1.6Gbp PacBio reads to human in 2.5 minutes, 1Gbp PacBio E. coli reads to pre-indexed 9.6Gbp bacterial genomes in 3 minutes, to pre-indexed >100Gbp nt database in ~1 hour (of which ~20 minutes are spent on loading index from the network filesystem; peak RAM: 10GB), map 2800 bacteria to themselves in 1 hour, and map 1Gbp E. coli reads against themselves in a couple of minutes.

Minimap does not replace mainstream aligners, but it can be useful when you want to quickly identify long approximate matches at moderate divergence among a huge collection of sequences. For this task, it is much faster than most existing tools.

Usage

Map two sets of long sequences:
```
minimap target.fa.gz query.fa.gz > out.mini
```
The output is TAB-delimited with each line consisting of query name, length, 0-based start, end, strand, target name, length, start, end, the number of matching bases, the number of co-linear minimizers in the match and the fraction of matching bases.

All-vs-all PacBio read self-mapping for miniasm:

minimap -Sw5 -L100 -m0 reads.fa reads.fa | gzip -1 > reads.paf.gz

Prebuild index and then map:
```
minimap -d target.mmi target.fa.gz
minimap -l target.mmi query.fa.gz > out.mini
```
Minimap indexing is very fast (1 minute for human genome; 50 minutes for >100Gbp nt database retrieved on 2015-09-30), but for huge repeatedly used databases, prebuilding index is still preferred.
Map sequences against themselve without diagnal matches:
```
minimap -S sequences.fa sequences.fa > self-match.mini
```
The output may still contain overlapping matches in repetitive regions.

Algorithm Overview

Indexing. Collect all (w,k)-minimizers in a batch (-I=4 billion bp) of target sequences and store them in a hash table. Mark top -f=0.1% of most frequent minimizers as repeats. Minimap uses invertible hash function to avoid taking ploy-A as minimizers.
For each query, collect all (w,k)-minimizers and look up the hash table for matches (q_i,t_i,s_i), where q_i is the query position, t_i the target position and s_i indicates whether the minimizer match is on the same strand.
For matches on the same strand, sort by {q_i-t_i} and then cluster matches within a -r=500bp window. Minimap merges two windows if -m=50% of minimizer matches overlap. For matches on different strands, sort {q_i+t_i} and apply a similar clustering procedure. This is inspired by the Hough transformation.
For each cluster, sort (q_i,t_i) by q_i and solve a longest increasing sequence problem for t_i. This finds the longest co-linear matching chain. Break the chain whenever there is a gap longer than -g=10000.
Output the start and end of the chain if it contains -c=4 or more minimizer matches and the matching length is no less than -L=40.
Go to 1 and rewind to the first record of query if there are more target sequences; otherwise stop.

To increase sensitivity, we may decrease -w to index more minimizers; we may also decrease -k, though this may greatly impact performance for mammalian genomes.

Also note that by default, if the total length of target sequences is less than 4Gbp (1G=1 billion; controlled by -I), minimap creates one index and stream all the query sequences in one go. The multiple hits of a query sequence is adjacent to each other in the output. If the total length is greater than 4Gbp, minimap needs to read query sequences multiple times. The multiple hits of a query may not be adjacent.

lh3 / minimap

readme

Introduction

Usage

Algorithm Overview