jueshengong / psytrans

Parasite Symbiont Transcriptome Separation
1 stars 3 forks source link

NAME psytrans.py Parasite & Symbiont Transcriptome Separation

SYNOPSIS python psytrans.py [QUERIES] [-H FILE] [-S FILE] [OPTIONS] python psytrans.py [QUERIES] [-b BLASTRESULTSFILE] [OPTIONS]

DESCRIPTION psytrans.py separates the sequences of a host species from those of its main symbiont(s) or parasite(s) based on Support Vector Machine classification. The program takes as input a file in fasta format with the sequences to be classified. The program also requires a file with sequences of a species related to the host, and a file with sequences related to the symbiont (or parasite). The queries will be compared to these two files using BLASTX. Alternatively, the user can provide the output of pre-computed BLASTX searches (in tabular format: -outfmt 6 or 7). The classification is then carried out using the command line tools from libsvm.

DEPENDENCIES psytrans requires makeblastdb and blastx from the NCBI blast+ distribution, unless the user provides pre-computed blast results. psytrans also requires a few command line utilities from libsvm: svm-scale, svm-train and svm-predict

OPTIONS Generic Program Information

   -h, --help
          Print a usage message briefly summarizing the command-line options.

Global options

   -R, --restart
          Restart the script from the last checkpoint.

   -p, --nbThreads
          Number of threads to use for the blast searches and for the SVM training.

   -V, --verbosemode
          Runs the script in verbose mode.

   -t, --tempDir
          Specify the name of the temporary directory.

   -X, --clearTemp
          Clears all temporary data in the temporary directory upon completion.

   -z, --stopAfter
          Choices:['db','runBlast','parseBlast','kmers','SVM']
          This option allows the user to choose whether the process should stop, once the process has completed a specific stage.
          db refers to the database creation stage;
          runBlast refers to the BLAST search stage;
          parseBlast refers to the separation of unambiguous and ambiguous sequence stage;
          kmers refers to the preparation of SVM input stage;
          SVM refers to the SVM training and testing stage.

Preparation of training set options

   -e, --maxBestEvalue
          Set the maximum value for the best e-value to be used to classify unambiguous sequences.

   -n, --numberOfSeq
          Set the maximum number of training & testing sequences.

Kmer parameters

   -c, --minWordSize
          Set the minimum value of DNA word length.

   -k, --maxWordSize
          Set the maximum value of DNA word length.

EXAMPLE You start with an assembly containing a mixture of sequences from a host A and a symbiont (or parasite) B: host_and_symb.fasta You also provide a file with proteins from a species related to the host: related_host_proteins.fasta and a file with proteins from a species related to the symbiont: related_symb_proteins.fasta You can then start the Psytrans process as follows:

          python psytrans.py host_and_symb.fasta  -H related_host_proteins.fasta -S related_symb_proteins.fasta

   Wait a few hours (depending on the number of sequences), and in the current directory you will find a new file starting with the prefix `host_' and another file starting with the prefix `symb_', corresponding to the sequences classified as host or symbiont (or parasite) respectively.

   To run the program using 8 threads and using /home/user/tmp as a temporary directory, use the following command:

          python psytrans.py host_and_symb.fasta  -H related_host_proteins.fasta -S related_symb_proteins.fasta -p 8 -t /home/user/tmp

AUTHOR Written by Sylvain Forêt and Jue-Sheng Ong.

REPORTING BUGS Report bugs at sylvain.foret@anu.edu.au Psytrans repository https://github.com/sylvainforet/psytrans

COPYRIGHT Copyright © 2014 Sylvain Forêt & Jue-Sheng Ong.

   psytrans  is a free  software and comes with ABSOLUTELY NO WARRANTY.  You are welcome to redistribute it under the terms of the GNU General Public License
   versions 3 or later.  For more information about these matters see http://www.gnu.org/licenses/.