TRUP is a pipeline designed for analyzing RNA-seq data from Tumor samples.
Cancer cells express many rearranged transcripts posing increased complexity to transcriptome analysis. As an unified pipeline, TRUP is designed to sensitively and accurately dissect the complexity of the cancer transcriptome by analyzing RNA-seq data obtained from tumour tissues. The current functionalities of TRUP include: 1) identification of fusion transcripts; 3) RNA-seq quality assesment; 2) Gene-read counting. The fusion detection module in TRUP combines split-read/read-pair mapping with regional de-novo assembly to achieve a balance between sensitivity and precision.
The hg19 and mm10 annoatation is downloadable via the link below:
hg19
https://drive.google.com/file/d/0B8NcFh4TSjvOQVN1anhrSVdLVjA/view?usp=sharing
https://drive.google.com/file/d/0B8NcFh4TSjvOWVlvOVVFWjRxZDQ/view?usp=sharing
mm10
https://drive.google.com/file/d/0B8NcFh4TSjvOaFNCM2FmLVU1eFE/view?usp=sharing
https://drive.google.com/file/d/0B8NcFh4TSjvOOFNXWDRKVUtUVzg/view?usp=sharing
All the annotation files and directories should be put under an central annotation directory whose name will be used for argument "--anno" when running TRUP. The structure of the annotation directory and sub-directories are illustrated as the following example:
Example:
TRUP_ANNOTATION / hg19 / hg19.SelfChain_UCSC.txt
TRUP_ANNOTATION / hg19 / hg19.bowtie2_index / *
| | |
central_directory / version / annotation files and directories
If users choose to prepare annotation directories and files by themselves, it can be achieved in the following way (using hg19 as an example): I am devising a script to generate the annotations automatically and also make them more consistent in terms of naming.
TRUP_ANNOTATION/hg19/
hg19.genome_UCSC.2bit (for running BLAT: runlevel 4) How to get: Use faToTwoBit to generate a 2.bit file from a genome fasta file (http://genome.ucsc.edu/goldenPath/help/blatSpec.html#faToTwoBitUsage)
hg19.transcriptome_length.txt (for generating html report: runlevel 2) How to get: Sum up the length of all exons for each chromosome. Each line is formatted as "chromosome sun_of_exons_length" (tab delimited).
hg19.genes_Ensembl.bed12 (for gene read counting: runlevel 2)
hg19.gencode/gencode.v14.annotation.gene.bed12 (not available for organism other than hg19) How to get: These 12 column bed files can be generated by my script "exon_union_gtf2bed.pl" under the TRUP directory src/ by using normal .gtf files downloadable from Ensembl ftp site for a specific organism.
hg19.transcripts_Ensembl.gtf (for tophat2 mapping and cufflinks: runlevel 2 and 5) How to get: This file can be downloaded from Ensembl ftp. use this to generate hg19.genes_Ensembl.bed12 mentioned above.
hg19.gencode/gencode.v14.annotation.gtf (not available for organism other than hg19) How to get: downloadable from UCSC table browser, use this to generate hg19.gencode/gencode.v14.annotation.gene.bed12 mentioned above.
hg19.gencode/gencode.v14.genemap (for runlevel 2. not available for organism other than hg19) How to get: use the script "gencode_annotation.map.pl" under TRUP/src and the gencode gtf file to generate this gene map file.
hg19.genes_RefSeq.bed12 (for runlevel 2. may not be used in the new version of TRUP)
hg19.SelfChain_UCSC.txt (for fusion detection: runlevel 4)
hg19.repeats_UCSC.gff (for fusion detection: runlevel 4) How to get: Downloadable from UCSC table browser
hg19.transcripts_Ensembl.gff How to get: turn hg19.transcripts_Ensembl.gtf to generate the corresponding gff file by using "gtf2gff3" (http://search.cpan.org/~lds/GBrowse-2.54/bin/gtf2gff3.pl)
hg19.biomart.txt How to get: use R package biomaRt or web converter at (http://www.ensembl.org/biomart/martview/), choose appropriate databse, feed in filter names (all ensembl gene id), and select attributes for "Associated Gene Name", "Gene Biotype", "Description", "EntrezGene ID", "WikiGene Name", "WikiGene Description".
hg19.gmap_index/hg19/hg19.* How to get: download from (http://research-pub.gene.com/gmap/) or follow instructions provided by gmap/gsnap
hg19.bowtie2_index/hg19/hg19.*
hg19.bowtie2_index/hg19_trans/hg19_known_ensemble_trans.* How to get: download from (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) or follow instructions provided by bowtie2. For annotation files under hg19_trans/, run tophat2 once with --transcriptome-index hg19.bowtie2_index/hg19_trans/hg19_known_ensemble_trans and --GTF hg19.transcripts_Ensembl.gtf to get tophat index for transcriptome (tophat2 can be stopped as long as the indexes are generated, thus user can use any fastq file as input for running tophat2, just to get the indexes). These transcriptome indexes can be used for later tophat2 calls.
No installation is required if pre-compiled binaries in bin/ are compatible with your system.
If pre-compiled binaries are incompatible with your system, you will need to rebuild them from source.
Make sure you have installed Bamtools (https://github.com/pezmaster31/bamtools). Then write down the bamtools_directory where ./lib/ and ./include/ sub-directories are located. Currently bamtools 2.3.0 has been tested. Also make sure you have zlib and boost installed. Then write down the zlib or boost_directory where ./lib/ and ./include/ sub-directories are located
run make in following way to install
$ make BAMTOOLS_ROOT=/bamtools_directory/ ZLIB_ROOT=/zlib_directory/ BOOST_ROOT=/boost_directory/
The binaries will be built at bin/. Any existing files will be overwritten.
TRUP can be run with UNIX command-line interface.
Assuming that in the directory "RP" there are two gzipped fastq-files are called as SAMPLE_R?1(_XXX)?.fq.gz and SAMPLE_R?2(_XXX)?.fq.gz (where "_R?1" and "_R?2" means mate 1 and mate 2 reads, XXX means a run ID, R and _XXX are optional that can be missing), the path to the directory of the pipeline is called "PD" and the path to tha annotation files are called "AD", one can run the pipeline in the following way:
Run-level 1 (Quality check, mapping of spiked-in read pairs and give a rough estimate of the insert size etc.):
$ perl RTrace.pl --runlevel 1 --sampleName SAMPLE --seqType p --readpool RP --root PD --threads TH --anno AD 2>>run.log
where "TH" is the number of computing threads. If one wants to stop after the quality check, add --QC
in the command call. --seqType
indicates sequencing type, either s
(single-end) or p
(paired-end). If your fastq files have names with a run ID (e.g., _001, _002, etc), usually produced by CASAVA, set runID
to 001, 002, etc. --Rbinary`` can be used to set binary name for R (if you have an alternative binary name for R)
Run-level 2 (mapping with gsnap [default] or STAR, generating reports):
$ perl RTrace.pl --runlevel 2 --sampleName SAMPLE --seqType p --readpool RP --root PD --anno AD --threads TH --WIG --patient ID --tissue type --threads TH --gf pdf 2>>run.log
where --WIG
is set to generate bigWiggle file for visualization purpose. --patient
and --tissue
can be set if user need to run edgeR and cuffdiff after processing all the samples. If a X11 is not always available, set --gf
to be pdf
to generate the report figures in pdf format instead of png. However, if X11 is available, do not set the --gf option.
Run-level 3 (collect regions containning potential breakpoints from the mapping of gsnap, perform regional assembly in the candidate regions):
$ perl RTrace.pl --runlevel 3 --sampleName SAMPLE --seqType p --readpool RP --root PD --anno AD --threads TH --RA 1 2>>run.log
where --RA
is set to 1 to indicate an independent regional assembly around each breakpoint.
Run-level 4 (Dectecting fusion events using the assembled transcripts):
$ perl RTrace.pl --runlevel 4 --sampleName SAMPLE --seqType p --readpool RP --root PD --anno AD --threads TH 2>>run.log
User can use GMAP instead of BLAT for mapping the assembled contigs back to the reference, in which case, --GMAP
should be set to use GMAP in this step. --misPen
can be adjusted to integer numbers higher than 2 (default) to suppress low quality blat results. --uniqueBase
can be set to integers more than 100 (default, which is exactly just one match, suggest for 300 maximum, which allows three times of matches to the reference) to allow multiple mapping segments to be granted as "unique" to increase sensitivity.
Run-level 5 (run cufflinks for gene/isoform quantification):
$ perl RTrace.pl --runlevel 5 --sampleName SAMPLE --seqType p --readpool RP --root PD --anno AD --threads TH --gtf-guide --known-trans refseq 2>>run.log
where --known-trans
can be set to 'ensembl' or 'refseq' to indicate the annotation to be used for cufflinks.
Run-level 6 (run cuffdiff for diffrential gene/isoform expression analysis):
$ perl RTrace.pl --runlevel 6 --root PD --anno AD --threads TH 2>>run.log
Run-level 7 (run edgeR for diffrential gene expression analysis):
$ perl RTrace.pl --runlevel 7 --root PD --anno AD --threads TH --priordf 5 --spaired 1 2>>run.log
where --priordf
sets the prior.df parameter in edgeR. --spaired
should be set to 1 if the samples are paired (paired normal-diese samples from the sample patient).
Note: do not set --lanename
for runlevel 6 and 7, as these two runlevels are dealing with multiple samples.
The Run-levels metioned above can also be combined in the same command line, as following,
$ perl RTrace.pl --runlevel 1-4 --sampleName SAMPLE --seqType p --readpool RP --root PD --anno AD --threads TH --WIG --patient ID --tissue type --gf pdf --RA 1 2>>run.log
One can also call combinations of different Run-levels, such as --$runlevel 1,2
or --runlevel 2,3,4
. But in these combined ways, one should specify all necessary options in the command line.
Runlevel dependencies (->): 4->3->2->1, 6->5->2->1, 7->2->1
check the options by running the program without options or with --help
.
Sun Ruping
If you are willing to receive updates and notice, send an email to regularhand@gmail.com.
Current Affiliation: Califano Lab, Department of Systems Biology, Columbia University, New York, NY, USA
Previous Affiliation: Dept. Vingron (Computational Molecular Biology) Max Planck Institute for Molecular Genetics. Ihnestr. 63-73, D-14195 Berlin, Germany
Email: regularhand@gmail.com
Project Website: https://github.com/ruping/TRUP