Version 2.0.6
A pipeline for profiling TE-derived small RNAs.
Created by Wen-Wei Liao, Kat O'Neill & Molly Gale Hammell, March 2017
Contact: mghcompbio@gmail.com
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
$ conda config --add channels conda-forge
$ conda config --add channels bioconda
$ git clone https://github.com/mhammell-laboratory/TEsmall.git
$ cd TEsmall
$ conda env create -f environment.yaml -n TEsmall
$ conda activate TEsmall
$ python setup.py install
Before executing TEsmall, make sure you have activated the environment
$ conda activate TEsmall
For example, you would like to apply TEsmall on 2 FASTQ files: Parental_1.fastq.gz
and DroKO_1.fastq.gz
$ TEsmall -f Parental_1.fastq.gz DroKO_1.fastq.gz -l Parental DroKO
When it's done, deactivate the environment
$ conda deactivate
If you would like to specify the directory to which the genomes
TEsmall uses for annotation are downloaded and read from, you can
specify it at runtime using the --dbfolder
parameter
$ TEsmall -f Parental_1.fastq.gz DroKO_1.fastq.gz -g hg19 -l
Parental DroKO --dbfolder /path/to/another/folder/
The files used by TEsmall will be downloaded to/access from the
genomes
folder inside /path/to/another/folder/
.
The default location is $HOME/TEsmall_db/
$ TEsmall -h
usage: TEsmall [-h] [-a STR] [-m INT] [-M INT] [-g STR] [--maxaln INT]
[--mismatch INT] [-o STR [STR ...]] [-p INT] [-f STR [STR ...]]
[-l STR [STR ...]] [--dbfolder STR] [--verbose INT] [-v]
optional arguments:
-h, --help show this help message and exit
-a STR, --adapter STR
Sequence of an adapter that was ligated to the 3' end.
The adapter itself and anything that follows is
trimmed. (default: TGGAATTCTCGGGTGCCAAGG)
-m INT, --minlen INT Discard trimmed reads that are shorter than INT. Reads
that are too short even before adapter removal are
also discarded. (default: 16)
-M INT, --maxlen INT Discard trimmed reads that are longer than INT. Reads
that are too long even before adapter removal are also
discarded. (default: 36)
-g STR, --genome STR Version of reference genome (default: hg38)
--maxaln INT Suppress all alignments for a particular read if more
than INT reportable alignments exist for it. (default:
100)
--mismatch INT Report alignments with at most INT mismatches.
(default: 0)
-o STR [STR ...], --order STR [STR ...]
Annotation priority. (default: structural_RNA miRNA
hairpin exon TE intron piRNA_cluster)
-p INT, --parallel INT
Parallel execute by INT CPUs. (default: 1)
-f STR [STR ...], --fastq STR [STR ...]
Input in FASTQ format. Compressed input is supported
and auto-detected from the filename extension (.gz).
-l STR [STR ...], --label STR [STR ...]
Unique label for each sample.
--dbfolder STR Custom location of TEsmall database folder (containing the "genomes" folder).
DEFAULT: $HOME/TEsmall_db/
--verbose INT Set verbose level.
0: only show critical message
1: show additional warning message
2: show process information
3: show debug messages.
DEFAULT: 2
-v, --version show program's version number and exit
Here are some brief explanations of the output files generated by TEsmall
count_summary.txt - This is the file containing the combined count table
of all libraries processed by TEsmall. This is typically
the file you want to use for differential analysis.
report.html - HTML report of QC and annotation statistics
For the following files, they are generated for each library, using the -l, --label
parameter the user provided.
[label].trimmed1.fastq - FASTQ file after 3' adapter trimming
[label].cutadapt1.log - Cutadapt log from 3' adapter trimming
[label].trimmed2.fastq - FASTQ file after 3' & 5' adapter trimming
[label].cutadapt2.log - Cutadapt log from 5' adapter trimming
[label].bam - BAM output for reads that aligned to rRNA (in older versions)
[label].rRNA.bam - BAM output for reads that aligned to rRNA
[label].rRNA.log - Bowtie log for rRNA mapping
[label].rm_rRNA.fastq - FASTQ file depleted for rRNA reads
Used for subsequent analysis
[label].log - Bowtie log for genome alignment (in older versions)
[label].genome.log - Bowtie log for genome alignment
[label].unaligned.fastq - FASTQ containing reads that failed to align to genome
[label].exceeded.fastq - FASTQ containing reads that aligned too many times to genome
[label].rinfo - Length & alignment counts for each aligned read (in older versions)
[label].aligned.rinfo - Length & alignment counts for each aligned read
[label].multi.bam - BAM output for reads aligned to genome (in older versions)
[label].genome.bam - BAM output for reads aligned to genome
[label].cca.fa - FASTA file containing aligned reads terminating with CCA, with CCA tail cleaved
[label].tRNA.bam - BAM output for CCA-trimmed reads that aligned to tRNA
[label].3trf.log - Bowtie log for CCA-trimmed reads aligning to tRNA (in older versions)
[label].tRNA.log - Bowtie log for CCA-trimmed reads aligning to tRNA
[label].unaligned.cca.fa - FASTA file containing CCA-trimmed reads that failed to align
[label].trna_for_intersect.bam - BAM file of CCA-trimmed reads that aligned to tRNA, converted to genomic coordinates
[label].3trf_free.bam - BAM file of reads aligned to genome that are not tRF
[label].3trf.bam - BAM file of reads aligned to genome that are tRF
[label].anno - Annotation of aligned reads that are not tRF
[label].3trf.struc.mapper.anno - tRF that annotated to structural RNA (e.g. tRNA)
[label].3trf.TE.mapper.anno - tRF that annotated to TE
[label].comp - Length distribution of reads based on annotation (in older versions)
[label].anno.rlen.info - Length distribution of reads based on annotation
[label].bedgraph - BEDgraph of annotated reads weighted by EM
TEsmall is part of TEToolkit suite.
TEsmall is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with TEsmall. If not, see this website.
If using the software in a publication, please cite the following:
O'Neill K, Liao WW, Patel A, Hammell MG. (2018) TEsmall Identifies Small RNAs Associated With Targeted Inhibitor Resistance in Melanoma. Front Genet. Oct 5;9:461.