rdpstaff / Framebot

Dynamic programming based frame shift detection and correction tool with nearest neighbor classification.
GNU General Public License v3.0
7 stars 7 forks source link

FrameBot version 1.2 Last Update Date: 03/28/2015 Contact: Qiong Wang at wangqion@msu.edu, RDP Staff at rdpstaff@msu.edu


Introduction

RDP FrameBot is a frameshift correction and nearest neighbor classification tool for use with high-throughput amplicon sequencing. It uses a dynamic programming algorithm to align each query DNA sequence against a set of target protein sequences, produces frameshift-corrected protein and DNA sequences and an optimal global or local protein alignment. It also helps filter out non-target reads. The online version of FrameBot is available on http://fungene.cme.msu.edu/FunGenePipeline. Read the quick tutorial at http://rdp.cme.msu.edu/tutorials/framebot/RDPtutorial_FRAMEBOT.html before you start.

CITATION: Wang, Q., Quensen, J. F., Fish, J. A., Lee, T. K., Sun, Y., Tiedje, J. M., and Cole, J. R. 2013. Ecological patterns of nifH genes in four terrestrial climatic zones explored with targeted metagenomics using FrameBot, a new informatics tool. mBio 4:e00592-13; doi: 10.1128/mBio.00592-13.

Copyright 2012 Ribosomal Database Project This project is distributed under the terms of the GNU GPLv3


How to Build FrameBot

This project depends on https://github.com/rdpstaff/ReadSeq, https://github.com/rdpstaff/SeqFilters and https://github.com/rdpstaff/AlignmentTools. See RDPTools (https://github.com/rdpstaff/RDPTools) to install.

ant jar to build jar files.


How to Run FrameBot

  1. Reference set preparation The refset directory is preloaded with reference sets for a list of genes. If not found, provide your own set of reference sequences representative of the gene of interest (http://fungene.cme.msu.edu is a good resource). The reference set contains protein or DNA representative sequences of the gene target and should be compiled to have a good coverage of diversity of the gene family. FrameBot is significantly more accurate when the nearest target protein sequence (from the reference set) is at least 50% identical to the query read. Running FrameBot is computationally intensive in no-metric-search mode because it performs all-against-all comparisons between query DNA and the target protein sequences. Therefore we recommend limiting your reference set to 200 protein sequences for no-metric-search mode. The index metic-search mode gains more than 10-fold speedup by reducing the number of comparisons (see FrameBot citation). A larger DNA reference set can be used.

  2. Frameshift correction and nearest neighbor assignment The is the main step to run frameshift correction. It produces frameshift-corrected protein and DNA sequences and an optimal global or local protein alignment. The nearest matching sequences in the pairwise alignment can be used as the nearest neighbor assignment. Any query sequence that does not satisfy the minimum length and protein identity is considered as non-target read and put in the failed output files.

usage: FramebotMain [options]

[Required] One protein reference fasta file or index file, and one DNA query fasta file are required.

[Options] -a,--alignment-mode Alignment mode: glocal, local or global (default = glocal) -b,--denovo-abund-cutoff minimum abundance for de-novo mode. default = 10 -d,--denovo-id-cutoff maxmimum aa identity cutoff for de-novo mode. default = 0.7 -e,--gap-ext-penalty gap extension penalty for no-metric-search ONLY. Default is -1 -f,--frameshift-penalty frameshift penalty for no-metric-search ONLY. Default is -10 -g,--gap-open-penalty gap opening penalty for no-metric-search ONLY. Default is -10 -i,--identity-cutoff Percent identity cutoff [0-1] (default = .4) -k,--knn The top k closest protein targets. (default = 10) -l,--length-cutoff Length cutoff in number of amino acids (default = 80) -m,--max-radius maximum radius for metric-search ONLY, range [1-2147483647], default uses the maxRadius specified in the index file -N,--no-metric-search Disable metric search (provide fasta file of seeds instead of index file) -o,--result-stem Result file name stem (default=stem of query nucl file) -P,--no-prefilter Disable the pre-filtering step for non-metric search. -q,--quality-file Sequence quality data -t,--transl-table Protein translation table to use (integer based on ncbi's translation tables, default=11 bacteria/archaea) -w,--word-size The word size used to find closest protein targets. (default = 4, recommended range [3 - 6]) -x,--scoring-matrix the protein scoring matrix for no-metric-search ONLY. Default is Blosum62 -z,--de-novo Enable de novo mode to add abundant query seqs to refset

The framebot step produces six output files: _framebot.txt - the alignment to the nearest match satisfying the minimum length and protein identity cutoff. _nucl_corr.fasta and all_seqs_derep_prot_corr.fasta - the frameshift-corrected nucleotide and protein sequences satisfying the minimum length and protein identity cutoff. _failed_framebot.txt - the alignment to the nearest match that failed the minimum length and protein identity cutoff. _nucl_failed.fasta - fasta file containing the nucleotide sequences that failed the minimum length and protein identity cutoff.

[Example command from a terminal] java -jar /PATH/FrameBot.jar framebot -o nifH_test nifH_test.index example/nifH_test_query.fa

[Note] FrameBot uses a kmer pre-filtering heuristic for no-metric-search. This pre-filtering may increase the speed by one to two orders of magnitude. Using this heuristic may cause FrameBot to return a different equally high-scoring (or occasionally almost as high) reference sequence. Pre-filtering is the default setting. Use option "-P" to disable the pre-filtering if necessary.

The option "Add de novo References" may help with genes with high diversity or lack of closely related reference sequences in the reference set (such as biphenyl dioxygenase). The de novo strategy was designed by Michal Strejcek from Dr. Ondrej Uhlík group at Institute of Chemical Technology Prague. This is based on the assumption that abundant sequences are more likely to be correct. The experimental sequences are dereplicated and sorted by abundance in descending order first. Each query is tested against the reference set. If a query doesn't have a close reference with above 70% aa identity, the corresponding protein sequence of the query will be added to the reference set if the following criteria are met:

  1. Length Cutoff and Identity Cutoff.
  2. The abundance is above certain cutoff, default is 10
  3. No frameshifts or stop codon present.

[To run FrameBot using the de novo strategy, use the following commands from a terminal] java -jar /PATH/Clustering.jar derep --sorted -o all_seqs_derep.fasta all_seqs.ids all_seqs.samples query_nucl.fasta java -jar /PATH/FrameBot.jar framebot -N -o bpha --de-novo bpha_ref.fasta all_seqs_derep.fasta mkdir filtered_mapping java -jar /PATH/Clustering.jar refresh-mappings bpha_corr_prot.fasta all_seqs.ids all_seqs.samples filtered_mapping/filtered_ids.txt filtered_mapping/filtered_samples.txt mkdir filtered_sequences java -jar /PATH/Clustering.jar explode-mappings -w -o filtered_sequences filtered_mapping/filtered_ids.txt filtered_mapping/filtered_samples.txt bpha_corr_prot.fasta

  1. Building index for metric indexed search This step builds an index file based on the input DNA sequences using global pairwise alignment mode. The parameters and metric scoring matrix are also stored in the index file. The reference DNA sequences should cover the exact same protein-coding region.

usage: FramebotIndex [options] [Required] One DNA reference fasta file, and output index file are required.

[Options] -e,--gap-ext-penalty gap extension penalty. Default is -4 -f,--frameshift-penalty frameshift penalty. Default is -10 -g,--gap-open-penalty gap opening penalty. Default is -13 -m,--max-radius maximum radius for metric-search ONLY, range [1-2147483647]>, default is Integer.MAX_VALUE: 2147483647 -t,--transl-table Protein translation table to use (integer based on ncbi's translation tables, default=11 bacteria/archaea) -x,--scoring-matrix the metric protein scoring matrix. Default is blosum62_metric.txt from Weijia Xu's thesis: On Integrating Biological Sequence Analysis with Metric Distance

[Example command from a terminal] java -jar /PATH/dist/FrameBot.jar index example/nifH_test_refseq_nucl.fa nifH_test.index

  1. Convert FrameBot outputs to different output formats The output _framebot.txt file contains the detailed pairwise protein alignment for each sequence. This tool generates a few type of statistics outputs that can be useful as input for third party tools, or spotting problems in the query or reference set. For example, the matrix output is equivalent to a data matrix that can be used in R ordination analysis (see FrameBot publication). IF the hist output showes large portion of query sequences shared less than 50% to the reference sequences, it indicates a broader reference set is needed.

usage: GetFrameBotStatMain [options]

[Required] One FrameBot pairwise alignment file or a directory and an output file are required.

[Options] -d,--subject-description the description of the reference seq, tab-delimited file -i,--id-mapping Id mapping file. Output from Dereplicator (http://fungene.cme.msu.edu/FunGenePipeline/derep/form.spr). -s,--sample-mapping Sample mapping file. Output from Dereplicator (http://fungene.cme.msu.edu/FunGenePipeline/derep/form.spr). -t,--stat-type stat | hist | summary | matrix | subset stat ouptuts the # of seqs passed FrameBot, # of frameshifts for each sample hist outputs a nearest match refseq, description and # of seqs close to the refseq at different identity% ranges summary outputs a list of subject(refseq), description and # of seqs close to the subject matrix outputs the number of sequences to the nearest match. The format is similar to a data matrix used for R analysis subset outputs the number of sequences to the nearest match for only sequence IDs in sample mapping file

[Example command from a terminal] java -jar /PATH/dist/FrameBot.jar stat -t matrix -i example/ids.txt -s example/samples.txt nifH_test_framebot.txt nifH_test_stat.txt

  1. Miscellaneous help tool -- random sequence selection This tool randomly selects a subset of sequence IDs from the sample Mapping file, same number of sequences for each sample. Can specify the number of sequences to be selected, or by default select the smallest size of all the samples. The resulting subset sampleMapping file can then be used by GetFrameBotStatMain to generate a data matrix for R analysis.

usage: RdmSelectSampleMapping [options]

[Required] A sampleMapping file and out file are required.

[Options] -n,--num-selection number of sequence IDs for each sample. Default is the smallest sample size -x,--exclude-samples list of sample names to be excluded from selection

[Example command from a terminal] java -jar /PATH/dist/FrameBot.jar rdmselect rdmselect example/samples.txt subsetSamples.txt java -jar /PATH/dist/FrameBot.jar stat -t matrix -s subsetSamples.txt nifH_test_framebot.txt nifH_test_stat_subset.txt