This project is now deprecated. For a RNA-Seq pipeline that works on SLURM clusters like Stanford's Sherlock with minimal setup, see here
I've recently discovered Cromwell, a workflow engine that easily scales tasks like RNASeq processing on HPC clusters. variant-discovery-pipeline
uses Cromwell to abstract large amounts of pipeline code. rna-pipeline
needs a major update to take advantage of Cromwell features.
RNA-Seq analysis pipeline for the Lee Lab at Stanford University. The pipeline is designed to mimic dnanexus function on Stanford's Sherlock HPC cluster. The pipeline receives as input paired-end FASTQ files and performs alignment as well as several analysis functions.
The pipeline consists of 3 different steps:
Alignment is first completed using the STAR Aligner. Its settings mimic those found in ENCODE's long-rna-seq-pipeline. RNA Expression Quantification and Gene Expression Analysis are performed in parallel. RNA Expression Quantification is computed using RSEM software package. Gene Expression Analysis is performed first by counting RNA reads with HTSeq-count and then analyzing counts with DESeq2.
pipeline.py
is the master program to fully automate the pipeline. The user enters a directory containing paired end FASTQ files they wish to process, and the pipeline runs creates an output directory containing the aligned bam files, rsem quantification files, and htseq-count read counts file. Once the rsem and htseq jobs have all finished, use the DESeq-launch.py
command to analyze the differential expression of all the samples.
-I
|--inputDirectory
Input directory containing paired end gzipped FASTQ files. FASTQ files must end in .gz extension
-gDir
|--genomeDirectory
Location of the genome directory required by STAR. Must be generated by the user prior to program use. Consult STAR documentation to generate the directory.
-rDir
|rsemReferenceDirectory
Location of the reference directory required by RSEM. Must be generated by the user prior to program use. Consult RSEM documentation to generate the directory.
-output
|--outputDirectory
(Optional) Directory location for where to store all files generated by the pipeline. If not specified, the pipeline will create a new directory, named *input_directory_path*-pipeline-output
Example command:
python pipeline.py -I /scratch/PI/carilee/tomas/CACYBP_KD_fastqs/ -rDir /scratch/users/tbencomo/RNA_seq/refs/out/rsem -gDir /scratch/users/tbencomo/RNA_seq/refs/out
star-aligner.py
performs sequence alignment with the STAR aligner
star-aligner.py
acts as a wrapper for star.sh
the bash script that executes the star aligner. star-aligner.py
submits a slurm sbatch job to run the STAR aligner.
-wd
|--workingDirectory
Working Directory: Where all output files will be located
-f1
|--fastq1
FASTQ file 1: 1 of the two paired end fastq files to align
-f2
|--fastq2
FASTQ file 2: 1 of the two paired end fastq files to align
-gDir
|--genomeDirectory
Genome Directory: Location of reference files for star aligner
-prefix
|--outFilePrefix
Output File Prefix: file prefix that will be appended to start of output files. Optional. By default, the prefix is set to the filename of FASTQ1
(Files are prefixed with specified or default prefix above)
_Aligned.sortedByCoord.out.bam
Aligned BAM file sorted by coordinate. Used for programs such as HTSeq-count_Aligned.toTranscriptome.out.bam
Aligned BAM of translated coordinates. Used for programs such as RSEM_Log.out
Main file with information about run_Log.progress.out
Job progress statisticsExample command:
python star-aligner.py -wd /scratch/users/tbencomo/RNA_seq/pipeline-tests -gDir /scratch/users/tbencomo/RNA_seq/refs/out -f1 /scratch/users/tbencomo/RNA_seq/input_files/SG13_004_004_CGCTCATT-ATAGAGGC_R1.fastq -f2 /scratch/users/tbencomo/RNA_seq/input_files/SG13_004_004_CGCTCATT-ATAGAGGC_R2.fastq
rsem-calculate.py
performs rna expression quantification with the RSEM software package.
rsem-calculate.py
is a wrapper for rsem.sh, which actually calls the rsem-calculate-expression program. rsem-calculate.py
submits a SLURM sbatch job of rsem.sh
-I
|--input-bam-file
Input Bam: RNA-Seq aligned transcriptome bam file
-rDir
|--rsem-ref-directory
RSEM Reference Directory: Prebuilt files RSEM needs to run
wd
|--working-directory
Working Directory: Where all output files are located
(Files are prefixed with prefix computated from input bam file - contains filename until 'Aligned')
.genes.results
Gene level expression estimates.isoforms.results
Isoform level expression estimates.stat
Folder containing model statistics.transcript.bam
Read alignments in transcript coordinatesExample command:
python rsem-calculate.py -wd /scratch/users/tbencomo/RNA_seq/pipeline-tests/ -I /scratch/users/tbencomo/RNA_seq/pipeline-tests/SG13_004_004_CGCTCATT-ATAGAGGC_Aligned.toTranscriptome.out.bam -rDir /scratch/users/tbencomo/RNA_seq/refs/out/rsem
htseq-launch.py
performs a count of all the read sequences from the aligned sorted by coordinate bam input file via the HTSeq software package.
htseq-launch.py
is a wrapper for htseq.sh,
which actually calls the htseq-count program. htseq-launch.py
submits a SLURM sbatch job of htseq.sh
-I
|--input-bam-file
Input Bam: RNA-Seq aligned sorted by coordinates bam file
wd
|--working-directory
Working Directory: Where all output files are located
(Files are prefixed with prefix computated from input bam file - contains filename until 'Aligned')
_counts.txt
Read counts for each geneExample command:
python htseq-launch.py -I SG13_004_004_CGCTCATT-ATAGAGGC_Aligned.sortedByCoord.out.bam -wd /scratch/users/tbencomo/RNA_seq/pipeline-tests/
DESeq-launch.py
performs differential expression analysis on several samples, reporting which genes are upregulated and downregualted.
DESeq-launch.py
is a wrapper for desq.sh. deseq.sh is a shell script to submit a SLURM sbatch job. The actual analysis is performed by an R script DESeq-analysis.R
NOTE: As of now, the R script determines if a sample is tumor or normal based on its filename. If the file contains 'NS' or 'ctrl', it is classified as normal. Otherwise, it is classified as a tumor. In the future, the user will be able to enter strings for the program to classify each condition.
-counts
|--counts-Directory
Directory containing HTSeq-count counts files
-results
|--results-Directory
Directory to store results files
unfiltered_normal_tumor_results.csv
CSV file containing all genes. Not filtered for pvalues or log2FoldChange levels
upregulated_genes.csv
CSV file containing all genes with log2FoldChange scores > 0 and with adjusted p-values <= .05
downregulated_genes.csv
CSV file containing all genes with log2FoldChange scores < 0 and with adjusted p-values <= .05
Example command:
python DESeq-launch.py -counts /scratch/PI/carilee/NatComm-Analysis/counts/ -results /scratch/PI/carilee/NatComm-Analysis/results/