plagnollab / RNASeq_pipeline

Set of scripts for RNA-Seq data processing
2 stars 2 forks source link

Set of scripts for RNA-Seq data processing, in particular differential expression analysis

Description of pipeline (RNA-Seq pipeline version 8)

With Pubmed references for each tool:

Reads were aligned to the hg38 human genome build using STAR (2.4.2a) (23104886). BAM files were sorted and duplicate reads flagged using NovoSort (1.03.09) (Novocraft). The aligned reads overlapping human exons (Ensembl 82) were counted using HTSeq (0.1) (25260700) . Differential gene expression was assessed with DESeq2 (1.8.2) (25516281) , and differential splicing was assessed with SGSeq (27218464) and DEXSeq (22722343), running on R (3.3.2) (R project for statistical computing).

Requirements

Join the chat at https://gitter.im/plagnollab/RNASeq_pipeline

R packages and software:

Notes for installation of DEXSeq/DESeq2:

Input 1: Sample Table

The key input is a table in tab delimited plain text format that contains one row per sample, with a header line. Here is a simple example:

sample f1 f2 condition_A condition_B type_1
control_1 sample1_R1.fastq.gz sample1_R2.fastq.gz control control female
mutation_A_1 sample2_lane1_R1.fastq.gz,sample2_lane._R1.fastq.gz sample2_lane1.fastq.gz,sample2_lane2.fastq.gz mutation_A NA male
mutation_B_1 sample3_R1.fastq.gz sample3_R2.fastq.gz NA mutation_B female

Input 2: Submission Form

The second input file is a list of variables that will be used by the pipeline script. An example is included in the repository and should be filled in by the user. Each variable is named and explained below:

Supported species

Below is a list of currently supported species genomes and the species code needed for the submission file:

Code Animal Genome build GTF source and version
human_hg38 homo sapiens hg38 Ensembl 82
mouse mus musculus mm10 Ensembl 82
rat rattus norvegicus rnor6 Ensembl 90
worm caenorhabditis elegans WBcel235 WormBase 235/ Ensembl 89
fly drosophila melanogaster dm6 Ensembl 82
macaque macaca mullata Mmul_8.0.1 Ensembl 90
mosquito anopheles gambiae AgamP4 VectorBase 36

If you would like to use the RNA-seq pipeline with any other species then please raise it as an issue on GitHub.

A note on stranding

Strand-specific library preparations are now commonplace, which improves the accuracy of feature quantification. However, there are two possible ways of stranding a paired RNAseq library:

Advanced usage

Two flags in the submission script, summary and force are now deprecated. They have now been repurposed for non-standard use cases.

Only outputting lists of splice junctions from STAR

Realign files that have already been trimmed

Useful if pipeline has crashed downstream of trimming.

Two-pass mapping with STAR

If you're interested in novel/unannotated splicing events you should consider using STAR's two-pass mapping mode. This first aligns your reads using a known set of transcripts (provided by Ensembl), with any novel splice junctions being mapped. It then incorporates those novel junctions into the referene and re-aligns. The two-pass approach will not necessarily increase the detection of novel junctions but will improve the number of splice reads mapping to them. It is however much slower than the usual mode.

Chimeric alignments

If you're interested in circular RNAs or fusion transcripts then you can turn on STAR's reporting of chimeric alignments with

This will align samples as usual but for each sample will output separate Chimeric.out.sam and Chimeric.out.junction files. See the STAR manual for more details.