tjbencomo / rna-pipeline

RNA sequencing pipeline for Stanford's Sherlock HPC cluster
1 stars 2 forks source link
rna-seq-pipeline

rna-pipeline

Project Status

This project is now deprecated. For a RNA-Seq pipeline that works on SLURM clusters like Stanford's Sherlock with minimal setup, see here

I've recently discovered Cromwell, a workflow engine that easily scales tasks like RNASeq processing on HPC clusters. variant-discovery-pipeline uses Cromwell to abstract large amounts of pipeline code. rna-pipeline needs a major update to take advantage of Cromwell features.

RNA-Seq analysis pipeline for the Lee Lab at Stanford University. The pipeline is designed to mimic dnanexus function on Stanford's Sherlock HPC cluster. The pipeline receives as input paired-end FASTQ files and performs alignment as well as several analysis functions.

Pipeline Design

The pipeline consists of 3 different steps:

  1. Alignment
  2. RNA Expression Quantification
  3. Differential Expression Analysis

Alignment is first completed using the STAR Aligner. Its settings mimic those found in ENCODE's long-rna-seq-pipeline. RNA Expression Quantification and Gene Expression Analysis are performed in parallel. RNA Expression Quantification is computed using RSEM software package. Gene Expression Analysis is performed first by counting RNA reads with HTSeq-count and then analyzing counts with DESeq2.

Pipeline Usage

pipeline.py is the master program to fully automate the pipeline. The user enters a directory containing paired end FASTQ files they wish to process, and the pipeline runs creates an output directory containing the aligned bam files, rsem quantification files, and htseq-count read counts file. Once the rsem and htseq jobs have all finished, use the DESeq-launch.py command to analyze the differential expression of all the samples.

Input

-I|--inputDirectory Input directory containing paired end gzipped FASTQ files. FASTQ files must end in .gz extension

-gDir|--genomeDirectory Location of the genome directory required by STAR. Must be generated by the user prior to program use. Consult STAR documentation to generate the directory.

-rDir|rsemReferenceDirectory Location of the reference directory required by RSEM. Must be generated by the user prior to program use. Consult RSEM documentation to generate the directory.

-output|--outputDirectory (Optional) Directory location for where to store all files generated by the pipeline. If not specified, the pipeline will create a new directory, named *input_directory_path*-pipeline-output

Example command: python pipeline.py -I /scratch/PI/carilee/tomas/CACYBP_KD_fastqs/ -rDir /scratch/users/tbencomo/RNA_seq/refs/out/rsem -gDir /scratch/users/tbencomo/RNA_seq/refs/out

Individual Pipeline Components

STAR Aligner

star-aligner.py performs sequence alignment with the STAR aligner

star-aligner.py acts as a wrapper for star.sh the bash script that executes the star aligner. star-aligner.py submits a slurm sbatch job to run the STAR aligner.

Inputs

-wd|--workingDirectory Working Directory: Where all output files will be located

-f1|--fastq1 FASTQ file 1: 1 of the two paired end fastq files to align

-f2|--fastq2 FASTQ file 2: 1 of the two paired end fastq files to align

-gDir|--genomeDirectory Genome Directory: Location of reference files for star aligner

-prefix|--outFilePrefix Output File Prefix: file prefix that will be appended to start of output files. Optional. By default, the prefix is set to the filename of FASTQ1

Outputs

(Files are prefixed with specified or default prefix above)

Example command: python star-aligner.py -wd /scratch/users/tbencomo/RNA_seq/pipeline-tests -gDir /scratch/users/tbencomo/RNA_seq/refs/out -f1 /scratch/users/tbencomo/RNA_seq/input_files/SG13_004_004_CGCTCATT-ATAGAGGC_R1.fastq -f2 /scratch/users/tbencomo/RNA_seq/input_files/SG13_004_004_CGCTCATT-ATAGAGGC_R2.fastq

RSEM Expression Quantification

rsem-calculate.py performs rna expression quantification with the RSEM software package.

rsem-calculate.py is a wrapper for rsem.sh, which actually calls the rsem-calculate-expression program. rsem-calculate.py submits a SLURM sbatch job of rsem.sh

Inputs

-I|--input-bam-file Input Bam: RNA-Seq aligned transcriptome bam file

-rDir|--rsem-ref-directory RSEM Reference Directory: Prebuilt files RSEM needs to run

wd|--working-directory Working Directory: Where all output files are located

Outputs

(Files are prefixed with prefix computated from input bam file - contains filename until 'Aligned')

Example command: python rsem-calculate.py -wd /scratch/users/tbencomo/RNA_seq/pipeline-tests/ -I /scratch/users/tbencomo/RNA_seq/pipeline-tests/SG13_004_004_CGCTCATT-ATAGAGGC_Aligned.toTranscriptome.out.bam -rDir /scratch/users/tbencomo/RNA_seq/refs/out/rsem

HTSeq Read Counts

htseq-launch.py performs a count of all the read sequences from the aligned sorted by coordinate bam input file via the HTSeq software package.

htseq-launch.py is a wrapper for htseq.sh, which actually calls the htseq-count program. htseq-launch.py submits a SLURM sbatch job of htseq.sh

Inputs

-I|--input-bam-file Input Bam: RNA-Seq aligned sorted by coordinates bam file

wd|--working-directory Working Directory: Where all output files are located

Outputs

(Files are prefixed with prefix computated from input bam file - contains filename until 'Aligned')

Example command: python htseq-launch.py -I SG13_004_004_CGCTCATT-ATAGAGGC_Aligned.sortedByCoord.out.bam -wd /scratch/users/tbencomo/RNA_seq/pipeline-tests/

DESeq2 Differential Expression Analysis

DESeq-launch.py performs differential expression analysis on several samples, reporting which genes are upregulated and downregualted.

DESeq-launch.py is a wrapper for desq.sh. deseq.sh is a shell script to submit a SLURM sbatch job. The actual analysis is performed by an R script DESeq-analysis.R

NOTE: As of now, the R script determines if a sample is tumor or normal based on its filename. If the file contains 'NS' or 'ctrl', it is classified as normal. Otherwise, it is classified as a tumor. In the future, the user will be able to enter strings for the program to classify each condition.

Inputs

-counts|--counts-Directory Directory containing HTSeq-count counts files

-results|--results-Directory Directory to store results files

Outputs

unfiltered_normal_tumor_results.csv CSV file containing all genes. Not filtered for pvalues or log2FoldChange levels upregulated_genes.csv CSV file containing all genes with log2FoldChange scores > 0 and with adjusted p-values <= .05 downregulated_genes.csv CSV file containing all genes with log2FoldChange scores < 0 and with adjusted p-values <= .05

Example command: python DESeq-launch.py -counts /scratch/PI/carilee/NatComm-Analysis/counts/ -results /scratch/PI/carilee/NatComm-Analysis/results/