NAME psytrans.py Parasite & Symbiont Transcriptome Separation
SYNOPSIS python psytrans.py [QUERIES] [-H FILE] [-S FILE] [OPTIONS] python psytrans.py [QUERIES] [-b BLASTRESULTSFILE] [OPTIONS]
DESCRIPTION psytrans.py separates the sequences of a host species from those of its main symbiont(s) or parasite(s) based on Support Vector Machine classification. The program takes as input a file in fasta format with the sequences to be classified. The program also requires a file with sequences of a species related to the host, and a file with sequences related to the symbiont (or parasite). The queries will be compared to these two files using BLASTX. Alternatively, the user can provide the output of pre-computed BLASTX searches (in tabular format: -outfmt 6 or 7). The classification is then carried out using the command line tools from libsvm.
DEPENDENCIES psytrans requires makeblastdb and blastx from the NCBI blast+ distribution, unless the user provides pre-computed blast results. psytrans also requires a few command line utilities from libsvm: svm-scale, svm-train and svm-predict
OPTIONS Generic Program Information
-h, --help
Print a usage message briefly summarizing the command-line options.
Global options
-R, --restart
Restart the script from the last checkpoint.
-p, --nbThreads
Number of threads to use for the blast searches and for the SVM training.
-V, --verbosemode
Runs the script in verbose mode.
-t, --tempDir
Specify the name of the temporary directory.
-X, --clearTemp
Clears all temporary data in the temporary directory upon completion.
-z, --stopAfter
Choices:['db','runBlast','parseBlast','kmers','SVM']
This option allows the user to choose whether the process should stop, once the process has completed a specific stage.
db refers to the database creation stage;
runBlast refers to the BLAST search stage;
parseBlast refers to the separation of unambiguous and ambiguous sequence stage;
kmers refers to the preparation of SVM input stage;
SVM refers to the SVM training and testing stage.
Preparation of training set options
-e, --maxBestEvalue
Set the maximum value for the best e-value to be used to classify unambiguous sequences.
-n, --numberOfSeq
Set the maximum number of training & testing sequences.
Kmer parameters
-c, --minWordSize
Set the minimum value of DNA word length.
-k, --maxWordSize
Set the maximum value of DNA word length.
EXAMPLE You start with an assembly containing a mixture of sequences from a host A and a symbiont (or parasite) B: host_and_symb.fasta You also provide a file with proteins from a species related to the host: related_host_proteins.fasta and a file with proteins from a species related to the symbiont: related_symb_proteins.fasta You can then start the Psytrans process as follows:
python psytrans.py host_and_symb.fasta -H related_host_proteins.fasta -S related_symb_proteins.fasta
Wait a few hours (depending on the number of sequences), and in the current directory you will find a new file starting with the prefix `host_' and another file starting with the prefix `symb_', corresponding to the sequences classified as host or symbiont (or parasite) respectively.
To run the program using 8 threads and using /home/user/tmp as a temporary directory, use the following command:
python psytrans.py host_and_symb.fasta -H related_host_proteins.fasta -S related_symb_proteins.fasta -p 8 -t /home/user/tmp
AUTHOR Written by Sylvain Forêt and Jue-Sheng Ong.
REPORTING BUGS Report bugs at sylvain.foret@anu.edu.au Psytrans repository https://github.com/sylvainforet/psytrans
COPYRIGHT Copyright © 2014 Sylvain Forêt & Jue-Sheng Ong.
psytrans is a free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License
versions 3 or later. For more information about these matters see http://www.gnu.org/licenses/.