lh3 / minimap

This repo is DEPRECATED. Please use minimap2, the successor of minimap.
https://github.com/lh3/minimap2
MIT License
106 stars 29 forks source link

Introduction

Minimap is an experimental tool to efficiently find multiple approximate mapping positions between two sets of long sequences, such as between reads and reference genomes, between genomes and between long noisy reads. By default, it is tuned to have high sensitivity to 2kb matches around 20% divergence but with low specificity. Minimap does not generate alignments as of now and because of this, it is usually tens of times faster than mainstream aligners. With four CPU cores, minimap can map 1.6Gbp PacBio reads to human in 2.5 minutes, 1Gbp PacBio E. coli reads to pre-indexed 9.6Gbp bacterial genomes in 3 minutes, to pre-indexed >100Gbp nt database in ~1 hour (of which ~20 minutes are spent on loading index from the network filesystem; peak RAM: 10GB), map 2800 bacteria to themselves in 1 hour, and map 1Gbp E. coli reads against themselves in a couple of minutes.

Minimap does not replace mainstream aligners, but it can be useful when you want to quickly identify long approximate matches at moderate divergence among a huge collection of sequences. For this task, it is much faster than most existing tools.

Usage

Algorithm Overview

  1. Indexing. Collect all (w,k)-minimizers in a batch (-I=4 billion bp) of target sequences and store them in a hash table. Mark top -f=0.1% of most frequent minimizers as repeats. Minimap uses invertible hash function to avoid taking ploy-A as minimizers.

  2. For each query, collect all (w,k)-minimizers and look up the hash table for matches (qi,ti,si), where qi is the query position, ti the target position and si indicates whether the minimizer match is on the same strand.

  3. For matches on the same strand, sort by {qi-ti} and then cluster matches within a -r=500bp window. Minimap merges two windows if -m=50% of minimizer matches overlap. For matches on different strands, sort {qi+ti} and apply a similar clustering procedure. This is inspired by the Hough transformation.

  4. For each cluster, sort (qi,ti) by qi and solve a longest increasing sequence problem for ti. This finds the longest co-linear matching chain. Break the chain whenever there is a gap longer than -g=10000.

  5. Output the start and end of the chain if it contains -c=4 or more minimizer matches and the matching length is no less than -L=40.

  6. Go to 1 and rewind to the first record of query if there are more target sequences; otherwise stop.

To increase sensitivity, we may decrease -w to index more minimizers; we may also decrease -k, though this may greatly impact performance for mammalian genomes.

Also note that by default, if the total length of target sequences is less than 4Gbp (1G=1 billion; controlled by -I), minimap creates one index and stream all the query sequences in one go. The multiple hits of a query sequence is adjacent to each other in the output. If the total length is greater than 4Gbp, minimap needs to read query sequences multiple times. The multiple hits of a query may not be adjacent.