jfear / ncbi_remap

This is the drosSRA project, where we are remapping all Drosophila melanogaster RNA-seq data to FlyBase release 6 and updating annotations.
2 stars 1 forks source link

DrosSRA Workflow

Objectives

Outline

  1. Use SraMongo to download associated metadata for all Drosophila melanogaster samples
  2. The Pre-Alignment workflow pre-processes all samples to automatically discover/validate technical metadata associated with each sample.
  3. The Alignment workflow process all RNA-Seq samples that pass filtering to generate coverage counts and genomic browser tracks.
  4. The Metadata workflow normalizes biological metadata to FlyBase controlled vocabulary.
  5. All data is accessible directly from GEO. We are also developing a command line tool drosSRA to allow users to easily access the data and perform different types of queries. Finally, we provide genome browser visualization through the FlyBase JBRowse instance.

Setup

This project uses a set snakemake workflows. The snakemake workflows run in a pre-built singularity container which has all of the required software.

To run the singularity container you need to have singulairty, mongoDB, and miniconda installed.

To create the running environment run.

$ conda env create --file environment.yaml

Environmental Variables

This project makes use of the following environmental variables.

export SLACK_SNAKEMAKE_BOT_TOKEN=<secret>
export ENTREZ_API_KEY=<secret>

export PROJECT_PATH=<absolute path to where folder is cloned>
export SINGULARITY_IMG=$PROJECT_PATH/singularity/drosSRA_workflow.sif
export SINGULARITY_BINDPATH=<list of paths that need mounted in the singularity container>
export SLURM_JOBID=<optional, used to make temp directories on /lscratch>
export TMPDIR=<optional, used to make temp directories>

Development

For development I have some extra packages that are useful.

$ conda env update -n drosSRA_workflow --file environment_dev.yaml

It is also helpful to pip install the two python packages that are distributed as part of this project.

$ conda activate drosSRA_workflow
[drosSRA_workflow] $ pip install -e src/
[drosSRA_workflow] $ pip install -e biometa-app/

Project Overview

Initialization workflow ./Snakefile

This workflow runs SraMongo and builds a list of all SRXs and thier SRRs (./output/srx2srr.csv).

NOTE 1: SraMongo requires an environment variable $ENTREZ_API_KEY to be set. This API key can be generated following these directions here.

NOTE 2: The majority of workflows only need the ./output/srx2srr.csv they do not require the local mongoDB. I am not able to have an always on instance of mongoDB on our cluster, so workflows that are typically run on a the cluster (./prealn-wf/Snakefile and ./rnaseq-wf/Snakefile) do not need access.

FASTQ Download sub-workflow ./fastq-wf/Snakefile

This workflow downloads FASTQ data from the SRA and checks that the download was sucessful. It was initially designed as a subworkflow, but snakemake was not running groups correctly with subworkflows. Currently, I just pull the rules from this workflow into ./prealn-wf/Snakefile and ./rnaseq-wf/Snakefile.

Pre-Alignment workflow ./prealn-wf/Snakefile

This workflow downloads FASTQs for all samples and generates various QC features describing each sample.

Library Strategy workflow ./library_strategy-wf/Snakefile

This workflow runs outlier detection on the RNA-Seq samples. It creates a golden set of RNA-Seq samples to proceed with.

Alignment workflow ./rnaseq-wf/Snakefile

This is the meat of the project. It provides all deliverables.

Metadata workflow (WIP)

I am currently working on this section. I am hand normalizing samples using a web app to assist me ./biometa-app. Once this is done I will build up the outlier detection for each discrete group.

Deprecated Workflows

I am in the middle of a major refactoring to simply the project. The following workflows are deprecated or broken.