Set of scripts for RNA-Seq data processing, in particular differential expression analysis
Reads were aligned to the hg38 human genome build using STAR (2.4.2a) (23104886). BAM files were sorted and duplicate reads flagged using NovoSort (1.03.09) (Novocraft). The aligned reads overlapping human exons (Ensembl 82) were counted using HTSeq (0.1) (25260700) . Differential gene expression was assessed with DESeq2 (1.8.2) (25516281) , and differential splicing was assessed with SGSeq (27218464) and DEXSeq (22722343), running on R (3.3.2) (R project for statistical computing).
R packages and software:
/cluster/project8/vyp/vincent/Software/R-3.2.2/bin/R
Notes for installation of DEXSeq/DESeq2:
scl enable devtoolset-1.1 'bash'
and then run R
The key input is a table in tab delimited plain text format that contains one row per sample, with a header line. Here is a simple example:
sample | f1 | f2 | condition_A | condition_B | type_1 |
---|---|---|---|---|---|
control_1 | sample1_R1.fastq.gz | sample1_R2.fastq.gz | control | control | female |
mutation_A_1 | sample2_lane1_R1.fastq.gz,sample2_lane._R1.fastq.gz | sample2_lane1.fastq.gz,sample2_lane2.fastq.gz | mutation_A | NA | male |
mutation_B_1 | sample3_R1.fastq.gz | sample3_R2.fastq.gz | NA | mutation_B | female |
sample
must be a unique sample name.f1
and f2
specify the input fastqs, which must be gzipped - they will have the extension <file>.fastq.gz
. If the data are single ended then the fastq file names will go under f1 and all cells in f2 must be set to NA. Only put file name in here - the path to the fastq files should be set in the submission scriptcondition
is what you want to run the differential analyses on. You can include multiple condition columns and any sample to be excluded from a condition should have a value of NA. DESeq2 and DEXSeq will assume that the topmost or alphabetically first condition label is the reference condition in the comparison so order your rows accordingly.type
is an optional covariate column which will be included in the differential expression model. You can have as many type columns as you wish as long as the column name contains the word "type"./subfolder/sample1.fastq
.The second input file is a list of variables that will be used by the pipeline script. An example is included in the repository and should be filled in by the user. Each variable is named and explained below:
pipeline
: path to where the repository is downloadedoFolder
: where the processed data will go. Conventionally samples/processed
.iFolder
: where the input fastq files are. If in separate subfolders then include the subfolder in the sample table.dataframe
: the path to the sample tablecode
: the name of the project, used as the job name when submitting each step.species
: the species genome used for alignment. See supported species for detailsstranded=(unstranded/fr-firststrand/fr-secondstrand)
: whether the RNA-seq library is stranded. submit=(yes|no)
: whether the pipeline should automatically submit jobs to the cluster.step0_QC=(yes|no)
: each fastqc is checked with FastQC. This step should ideally be completed before the rest of the pipeline is run.trim_galore=(yes|no)
: should the fastq files be trimmed before alignment?starStep1=(yes|no)
: the initial alignment step with STAR, followed by sorting and duplicate marking.starStep1b=(yes|no)
: indexing of the resulting bam file and computing of summary statistics.starStep2=(yes|no)
: counting the reads in each sample with HTSeq.prepareCounts=(yes|no)
: preparation of read counts for DESeq and DEXSeq; creation of per-sample FPKMs.Rdeseq=(yes|no)
: differential gene expression with the DESeq2 package.Rdexseq=(yes|no)
: differential exon usage with the DEXSeq package.goseq=(yes|no)
; Gene Ontology enrichment analysis with the GOSeq package.Below is a list of currently supported species genomes and the species code needed for the submission file:
Code | Animal | Genome build | GTF source and version |
---|---|---|---|
human_hg38 |
homo sapiens | hg38 | Ensembl 82 |
mouse |
mus musculus | mm10 | Ensembl 82 |
rat |
rattus norvegicus | rnor6 | Ensembl 90 |
worm |
caenorhabditis elegans | WBcel235 | WormBase 235/ Ensembl 89 |
fly |
drosophila melanogaster | dm6 | Ensembl 82 |
macaque |
macaca mullata | Mmul_8.0.1 | Ensembl 90 |
mosquito |
anopheles gambiae | AgamP4 | VectorBase 36 |
If you would like to use the RNA-seq pipeline with any other species then please raise it as an issue on GitHub.
Strand-specific library preparations are now commonplace, which improves the accuracy of feature quantification. However, there are two possible ways of stranding a paired RNAseq library:
Two flags in the submission script, summary
and force
are now deprecated. They have now been repurposed for non-standard use cases.
starStep1a
to yes
and force
to SJsOnly
.Useful if pipeline has crashed downstream of trimming.
trim_galore
to yes
and summary
to trimmed_exist
.If you're interested in novel/unannotated splicing events you should consider using STAR's two-pass mapping mode. This first aligns your reads using a known set of transcripts (provided by Ensembl), with any novel splice junctions being mapped. It then incorporates those novel junctions into the referene and re-aligns. The two-pass approach will not necessarily increase the detection of novel junctions but will improve the number of splice reads mapping to them. It is however much slower than the usual mode.
summary
to twopass
If you're interested in circular RNAs or fusion transcripts then you can turn on STAR's reporting of chimeric alignments with
force
to chimeric
This will align samples as usual but for each sample will output separate Chimeric.out.sam
and Chimeric.out.junction
files. See the STAR manual for more details.