matthiasblanke / App-SpaM

Alignment-free Phylogenetic Placement using filtered SPAced word Matches
GNU General Public License v3.0
10 stars 1 forks source link

Alignment-free phylogenetic placement algorithm based on Spaced-word Matches

Alignment-free phylogenetic placement algorithm based on SPAced-word Matches (App-SpaM) is a software for performing phylogenetic placement. Phylogenetic placement is the task of placing (usually short) query sequences of unknown taxonomic origin into an existing phylogeny of reference sequences. The input normally consists of three files:

App-SpaM will place each of the query sequences into the reference phylogeny at a phylogenetically appropriate position. The placement is based on the concept of Spaced Word Matches (FSWM1). Depending on the chosen placement heuristic, App-SpaM uses either the number of spaced word matches between query and references, or the estimated number of nucleotide substitutions per sequence position between query and references. The output is a JPlace2 file containing all query placements.

We are planning to have App-SpaM also available in PEWO, the Placement Evaluation WOrkflows3, a tool developed to rapidly test and compare tools for phylogenetic placement.

When using App-SpaM, please cite: doi

Installation

Prerequisites

If not already installed on your system, install Git and CMake; on Ubuntu e.g.:

sudo apt-get install git
sudo apt-get install cmake

At the moment, App-SpaM's parallelization is performed via OpenMP. Most modern compilers are supporting OpenMP, but it is advisable to update your compiler to the newest version. If you experience problems during the compilation do not hesitate to contact us.

Installing App-SpaM

On the command line, download the newest version of App-SpaM (alternatively download and unpack the .zip-Archive from this Github page):

git clone https://github.com/matthiasblanke/App-SpaM

Navigate into the App-SpaM directory, create a build folder, and navigate into it:

cd App-SpaM
mkdir build
cd build

Build the program with CMake:

cmake ..
make

Running App-SpaM

From within the build directory you can just run App-SpaM like so:

./appspam -h

If you want to run it from anywhere, add it to your path:

export PATH=$PATH:~/path/to/appspam

If you want this change to be permanent add this line to your ~/.profile or ~/.bash_profile.

Placing query reads

When running App-SpaM, at least you need to specify the three input file parameters: The reference sequences (-s), the reference phylogeny (-t) and the query sequences (-q):

./appspam -s path/to/references.fasta -t path/to/referencetree.nwk -q path/to/query.fasta

The paths can be either absolut paths, or relative to your current working directory. All other parameters will be set to default values. All output files will be placed in your current working directory. You can specify the output location and file name with the flag -o, e.g.:

./appspam -s references.fasta -t referencetree.nwk -q query.fasta -o path/to/output.jplace

If other output files are produced (see below) they will be placed in the same folder as the JPlace file.

Using unassembled references

App-SpaM can perform phylogenetic placement based on unassembled query sequences. To enable this use the -u or --unassembled flag.

In this mode you still supply only one fasta-file for the reference sequences. All sequences within this file that share the same prefix before the specified separator will be regarded as originating from the same reference sequence. E.g., the following fasta-file will be interpreted as having only two references named Seq1 and Seq2, both consisting of two sequences:

>Seq1-1
AAAA
>Seq1-2
CCCC
>Seq2-a
GGGG
>Seq2-b
TTTT

The sequence names before the separator (Seq1 and Seq2) must be identical to the sequence names in the reference tree. The default separator is set to - but can be changed to any other string using the --delimiter argument.

Parameters

There are several other parameters that can influence the accuracy, speed, and output of App-SpaM:

Performance Parameter Full name Default Info
-w --weight 12 Weight of pattern (number of match positions (1s)). Higher weight generally leads to faster computation, but on small datasets it may result in too few spaced words, resulting in low accuracy.
-d --dontCare 32 Number of don't care positions in pattern (number of 0s).
-p --pattern 10 Number of patterns used. For every pattern, spaced words are extracted from the sequences. Use fewer patterns for faster running speeds.
-o --out_jplace appspam.jplace Path and name of output jplace file.
-g --mode LCACOUNT Assignment mode determines how a placement position is chosen from the calculated reference-query distances. For more information see paper. Possible values are: MINDIST,SPAMCOUNT,LCADIST,LCACOUNT, APPLES...
-u --unassembled Enables support for unassembled references, see below.
--delimiter "-" Specifies delimiter in reference names when unassembled mode is executed. All reads from the same reference should have this delimiter in their name. They are then regarded as one reference sequence.
-h --help Show help and exit.
Further Parameter Full name Default Info
-v --verbose Outputs additional information about the current run on the standard output.
--threads 1 Specify number of threads to use.
--write-histogram Write a histogram of all spaced word matches to file histogram.txt.
--write-scoring Write file with all pairwise distances between references and queries to file scoring_table.txt.
--threshold 0 Specifies filtering threshold of spaced word filtering procedure.

Further help

Write to matthias.blanke@biologie.uni-goettingen.de

Publication

Associated publication

doi

Links

1: C.-A. Leimeister, S. Sohrabi-Jahromi, B. Morgenstern (2017) Fast and Accurate Phylogeny Reconstruction using Filtered Spaced-Word Matches Bioinfomatics 33, 971-979, https://doi.org/10.1093/bioinformatics/btw776 https://github.com/burkhard-morgenstern/FSWM

2: Matsen FA, Hoffman NG, Gallagher A, Stamatakis A (2012) A Format for Phylogenetic Placements. PLoS ONE 7(2): e31009. https://doi.org/10.1371/journal.pone.0031009

3: Benjamin Linard, Nikolai Romashchenko, Fabio Pardi, Eric Rivals PEWO: a collection of workflows to benchmark phylogenetic placement Bioinformatics, btaa657, https://doi.org/10.1093/bioinformatics/btaa657 https://github.com/phylo42/PEWO