tszalay / poreseq

Error correction and variant calling algorithm for nanopore sequencing
25 stars 4 forks source link

PoreSeq

The PoreSeq utility is an open source program and Python library for de novo sequencing, consensus and variant calling on data from Oxford Nanopore Technologies' MinION platform. Features include:

Update

As Jared Simpson explains in this post, there are many new tools available for assembling ONT data without having to do a cumbersome error-correction step using PoreSeq or nanocorrect. Users are encouraged to first assemble with those tools, followed with a post-assembly error correction/variant calling/etc using tools such as this one or nanopolish.

Install

The recommended way to install poreseq is from the source using git and pip:

  1. Checkout the source: git clone git://github.com/tszalay/poreseq.git
  2. In the main directory, run pip install -e . --user or sudo pip install -e .

Note that for a userspace install, ~/.local/bin needs to be on your PATH.

Or, using virtualenv:

  1. Checkout the source: git clone git://github.com/tszalay/poreseq.git
  2. Create virtual environment: virtualenv VE-PORESEQ
  3. Activate it: . VE-PORESEQ/bin/activate
  4. Install numpy if needed: pip install numpy
  5. And Cython: pip install cython
  6. And poreseq into the virtual environment: cd poreseq
  7. pip install -e .

The Python requirements are handled by setuptools, while some of the utility scripts depend on the LAST aligner and the Celera/PBcR assembler. If you would like to change these requirements, edit the scripts in the scripts/ folder (and reinstall). Note that the poreseq_assemble script, in addition to needing PBcR, also needs the correct path to poreseq.spec and should be updated as such.

Examples

The basic aspects of the usage are described below. For any command, the -h flag displays help.

Sequence extraction from fast5 files:

Overlap alignment via LAST (outputs to allaligns.bam):

Split fasta into 10kb regions in 16 files for processing:

Error correction (correct the regions in the 15th file as defined by the splitting above):

Error correction (manually correct a specific region of a read from allreads):

Merge previously split and corrected sequences:

Assemble sequences using Celera assembler (with 4 threads):

Train skip and stay parameters on bases 46kb-48kb of a known reference (with 16 threads):

Score mutations listed in mutations.txt against reference:

Example mutations.txt (start, original, mutated):

11      AC    G
1994    .     C
2301    A     .

(note that start index is 0-based)

Example of Python package usage:

import poreseq

params = poreseq.LoadParams('./params.conf')
reginfo = poreseq.RegionInfo('10000:20000')  # poreseq.RegionInfo() for whole strand
pa = poreseq.LoadAlignedEvents('reference.fasta','alignment.bam','/media/run-data',reginfo,params)
# pa is now a PSAlign class, pa.sequence is the consensus/ref sequence
# pa.events are the loaded events and pa.params is the same params as loaded above
# get likelihood score of each event, also realign events to ref.
pa.ScoreEvents()
# display which reference bases are aligned to by current levels 2000:3000 for event 2
# note that these are 1-based for internal reasons
print pa.events[2].ref_align[2000:3000]

At the time of writing, “poreseq consensus” takes around 2 minutes to error-correct a 1 kb region at 10X coverage, meaning that it can take tens of hours on a single CPU to error-correct the fragments before assembling λ DNA, or thousands of CPU-hours to correct bacterial genomes. If the coverage is sufficient, it is more efficient to assemble without doing poreseq error correction, and then using poreseq to align and refine the resulting assembly:

Additionally, if only a specific region of the genome is desired, poreseq can simply error correct that region, greatly reducing the time required.