mhammell-laboratory/TEsmall

TEsmall

Version 2.0.6

A pipeline for profiling TE-derived small RNAs.

Created by Wen-Wei Liao, Kat O'Neill & Molly Gale Hammell, March 2017

Contact: mghcompbio@gmail.com

Install Miniconda 3 (Linux)

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Setup channels

$ conda config --add channels conda-forge
$ conda config --add channels bioconda

Install TEsmall

$ git clone https://github.com/mhammell-laboratory/TEsmall.git
$ cd TEsmall
$ conda env create -f environment.yaml -n TEsmall
$ conda activate TEsmall
$ python setup.py install

How to run TEsmall

Before executing TEsmall, make sure you have activated the environment
```
$ conda activate TEsmall
```
For example, you would like to apply TEsmall on 2 FASTQ files: Parental_1.fastq.gz and DroKO_1.fastq.gz
```
$ TEsmall -f Parental_1.fastq.gz DroKO_1.fastq.gz -l Parental DroKO
```
When it's done, deactivate the environment
```
$ conda deactivate
```
If you would like to specify the directory to which the genomes TEsmall uses for annotation are downloaded and read from, you can specify it at runtime using the --dbfolder parameter
```
$ TEsmall -f Parental_1.fastq.gz DroKO_1.fastq.gz -g hg19 -l
Parental DroKO --dbfolder /path/to/another/folder/
```
The files used by TEsmall will be downloaded to/access from the genomes folder inside /path/to/another/folder/.

The default location is $HOME/TEsmall_db/

For more information

$ TEsmall -h
usage: TEsmall [-h] [-a STR] [-m INT] [-M INT] [-g STR] [--maxaln INT]
               [--mismatch INT] [-o STR [STR ...]] [-p INT] [-f STR [STR ...]]
               [-l STR [STR ...]] [--dbfolder STR] [--verbose INT] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -a STR, --adapter STR
                        Sequence of an adapter that was ligated to the 3' end.
                        The adapter itself and anything that follows is
                        trimmed. (default: TGGAATTCTCGGGTGCCAAGG)
  -m INT, --minlen INT  Discard trimmed reads that are shorter than INT. Reads
                        that are too short even before adapter removal are
                        also discarded. (default: 16)
  -M INT, --maxlen INT  Discard trimmed reads that are longer than INT. Reads
                        that are too long even before adapter removal are also
                        discarded. (default: 36)
  -g STR, --genome STR  Version of reference genome (default: hg38)
  --maxaln INT          Suppress all alignments for a particular read if more
                        than INT reportable alignments exist for it. (default:
                        100)
  --mismatch INT        Report alignments with at most INT mismatches.
                        (default: 0)
  -o STR [STR ...], --order STR [STR ...]
                        Annotation priority. (default: structural_RNA miRNA
                        hairpin exon TE intron piRNA_cluster)
  -p INT, --parallel INT
                        Parallel execute by INT CPUs. (default: 1)
  -f STR [STR ...], --fastq STR [STR ...]
                        Input in FASTQ format. Compressed input is supported
                        and auto-detected from the filename extension (.gz).
  -l STR [STR ...], --label STR [STR ...]
                        Unique label for each sample.
  --dbfolder STR        Custom location of TEsmall database folder (containing the "genomes" folder).
                        DEFAULT: $HOME/TEsmall_db/

  --verbose INT         Set verbose level.
                        0: only show critical message
                        1: show additional warning message
                        2: show process information
                        3: show debug messages.
                        DEFAULT: 2
  -v, --version         show program's version number and exit

Output files

Here are some brief explanations of the output files generated by TEsmall

Final output

count_summary.txt    -    This is the file containing the combined count table
                          of all libraries processed by TEsmall. This is typically
                      the file you want to use for differential analysis.
report.html          -    HTML report of QC and annotation statistics

For the following files, they are generated for each library, using the -l, --label parameter the user provided.

Preprocessing output

[label].trimmed1.fastq    -   FASTQ file after 3' adapter trimming
[label].cutadapt1.log     -   Cutadapt log from 3' adapter trimming
[label].trimmed2.fastq    -   FASTQ file after 3' & 5' adapter trimming
[label].cutadapt2.log     -   Cutadapt log from 5' adapter trimming
[label].bam               -   BAM output for reads that aligned to rRNA (in older versions)
[label].rRNA.bam          -   BAM output for reads that aligned to rRNA
[label].rRNA.log          -   Bowtie log for rRNA mapping
[label].rm_rRNA.fastq     -   FASTQ file depleted for rRNA reads
                              Used for subsequent analysis

Genome alignment output

[label].log               -   Bowtie log for genome alignment (in older versions)
[label].genome.log        -   Bowtie log for genome alignment
[label].unaligned.fastq   -   FASTQ containing reads that failed to align to genome
[label].exceeded.fastq    -   FASTQ containing reads that aligned too many times to genome
[label].rinfo             -   Length & alignment counts for each aligned read (in older versions)
[label].aligned.rinfo     -   Length & alignment counts for each aligned read
[label].multi.bam         -   BAM output for reads aligned to genome (in older versions)
[label].genome.bam        -   BAM output for reads aligned to genome

Identifying tRNA fragment (tRF)

Schorn et al. 2017

[label].cca.fa                    -   FASTA file containing aligned reads terminating with CCA, with CCA tail cleaved
[label].tRNA.bam                  -   BAM output for CCA-trimmed reads that aligned to tRNA
[label].3trf.log                  -   Bowtie log for CCA-trimmed reads aligning to tRNA (in older versions)
[label].tRNA.log                  -   Bowtie log for CCA-trimmed reads aligning to tRNA
[label].unaligned.cca.fa          -   FASTA file containing CCA-trimmed reads that failed to align
[label].trna_for_intersect.bam    -   BAM file of CCA-trimmed reads that aligned to tRNA, converted to genomic coordinates
[label].3trf_free.bam             -   BAM file of reads aligned to genome that are not tRF
[label].3trf.bam                  -   BAM file of reads aligned to genome that are tRF

Annotation output

[label].anno                      -   Annotation of aligned reads that are not tRF
[label].3trf.struc.mapper.anno    -   tRF that annotated to structural RNA (e.g. tRNA)
[label].3trf.TE.mapper.anno       -   tRF that annotated to TE
[label].comp                      -   Length distribution of reads based on annotation (in older versions)
[label].anno.rlen.info            -   Length distribution of reads based on annotation
[label].bedgraph                  -   BEDgraph of annotated reads weighted by EM

Copying & distribution

TEsmall is part of TEToolkit suite.

TEsmall is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with TEsmall. If not, see this website.

Citation

If using the software in a publication, please cite the following:

O'Neill K, Liao WW, Patel A, Hammell MG. (2018) TEsmall Identifies Small RNAs Associated With Targeted Inhibitor Resistance in Melanoma. Front Genet. Oct 5;9:461.