openvax / neoantigen-vaccine-pipeline

Bioinformatics pipeline for selecting patient-specific cancer neoantigen vaccines
Apache License 2.0
75 stars 25 forks source link

neoantigen-vaccine-pipeline

This repository is the public version of the bioinformatics pipeline for selecting patient-specific cancer neoantigen vaccines developed by the openvax group at Mount Sinai. This pipeline is currently the basis for two phase I clinical trials using synthetic long peptides, NCT02721043 and NCT03223103.

The pipeline used for these trials differs slightly from the version given here due to licensing restrictions on the NetMHC suite of tools, which prevent their inclusion in the provided Docker image. To circumvent this issue, the open source pipeline performs MHC binding prediction using the IEDB web interface to these tools. This may be slower but should give the same results. If you have a license to the NetMHC tools (free for non-commercial use) and wish to run these tools locally in the pipeline, please contact us or file an issue.

The pipeline is implemented using the Snakemake workflow management system. We recommend running it using our provided Docker image, which includes all needed dependencies.

Overview

This pipeline assumes you have the following datasets.

The steps performed by the workflow are as follows.

Running using Docker

For best results, you will need a machine with Docker installed and the following requirements:

The pipeline is run by invoking a Docker entrypoint in the image while providing three directories as mounted Docker volumes: /inputs (FASTQ files and a configuration YAML), /outputs (directory to write results to), and /reference-genome (data shared across patients, such as the genome reference).

Note that all three directories and their contents must be world writable. This is necessary because the Docker pipeline runs as an unprivileged user and not as you. Data is modified in the outputs directory as well as in the reference-genome directory, since indexing the reference genome for use by aligners and other tools requires writing results to this directory.

First we will make a world-writable reference-genome directory.

mkdir -p reference-genome
chmod -R a+w reference-genome

While the pipeline supports preparing your choice of reference FASTA for use by aligners and other tools, for a quick start we have made available a processed version of the b37decoy and mm10 genomes in Google Cloud. Visit https://storage.cloud.google.com/reference-genomes/b37decoy.tar.gz to download a zipped version of all b37decoy files (~30GB), or https://storage.cloud.google.com/reference-genomes/mm10.tar.gz for the zipped mm10 archive (~28GB), and save it to the reference-genome directory. If you have installed gsutil, you can download the file faster. See here for more details on accessing public data in Google Cloud.

For the remainder of this README, we will assume that you are working with the b37decoy reference data.

gsutil -m cp gs://reference-genomes/b37decoy.tar.gz reference-genome/

Now we will uncompress the reference genome data, inside your reference-genome directory.

cd reference-genome
tar -zxvf b37decoy.tar.gz

Now we will download a test sequencing dataset, consisting of a YAML config file and two small FASTQ files of reads overlapping a single somatic mutation. For this simple test, we will re-use the tumor DNA sequencing as our RNA reads.

mkdir inputs
cd inputs
wget https://github.com/openvax/neoantigen-vaccine-pipeline/raw/master/test/idh1_config.yaml
wget https://github.com/openvax/neoantigen-vaccine-pipeline/raw/master/datagen/idh1_r132h_normal.fastq.gz
wget https://github.com/openvax/neoantigen-vaccine-pipeline/raw/master/datagen/idh1_r132h_tumor.fastq.gz
cd ..
chmod -R a+w inputs

And we’ll make an empty output directory:

mkdir outputs
chmod -R a+w outputs

Now we may run the pipeline. Note that the Docker volume option (-v) requires absolute paths. We use $(realpath <dirname>) to get the absolute path to the directories created above on the host machine. You can also just replace those with absolute paths.

docker run -it \
-v $(realpath inputs):/inputs \
-v $(realpath outputs):/outputs \
-v $(realpath reference-genome):/reference-genome \
openvax/neoantigen-vaccine-pipeline:latest \
--configfile=/inputs/idh1_config.yaml

This should create the final vaccine peptide results as well as many intermediate files in your outputs directory, under a subdirectory specified in the id field in the YAML config (in this case, idh1-test-sample). See the final results at $(realpath outputs)/idh1-test-sample/vaccine-peptide-report_netmhcpan-iedb_mutect-strelka.txt.

If the pipeline errors, the offending job will be highlighted in red and will mention a log file. The first step in understanding what went wrong is to look at that log file, which will live in the directory $(realpath outputs)/idh1-test-sample.

If you want to see all available pipeline options (e.g. ability to execute a dry run, specify memory/CPU resources for the pipeline), run:

docker run openvax/neoantigen-vaccine-pipeline:latest -h

If you want to poke around in the image to execute tools manually or inspect versions:

docker run -it \
-v $(realpath inputs):/inputs \
-v $(realpath outputs):/outputs \
-v $(realpath reference-genome):/reference-genome \
--entrypoint /bin/bash \
openvax/neoantigen-vaccine-pipeline:latest

Running the dockerized pipeline with your own data

Use the test IDH config file as a template for your own config file you can pass to the pipeline. Assuming you've successfully run the pipeline on the test data, you should only need to customize the input section.

Please note:

Using the mm10 or your own reference genome

The pipeline supports processing for any reference genome you want to use. You must include:

A link to COSMIC is only relevant for human genomes, and is optional in any config used to run the pipeline. For the mm10 genome, we have provided these reference files, contained in the archive on Google Cloud. See the test IDH config file as an example of how to specify the aforementioned 3 reference paths.

Note that if the reference genome you want to use is not part of the Ensembl standard (GRCh37/hg19, GRCh38/hg20, GRCm38/mm10, etc.), you can use this pipeline to do Strelka/Mutect/Mutect2 variant calling. However, you cannot use this pipeline to compute ranked vaccine peptides. This will be available in a future version.

Intermediate files

As a result of the full pipeline run, many intermediate files are generated in the output directory. In case you want to reuse these for a different pipeline run (e.g. if you have one normal sample and several tumor samples, each of which you want to run against the normal), any intermediate file you copy to the new location will tell Snakemake to not repeat that step (or its substeps, unless they're needed for some other workflow node). For that reason, it's helpful to know the intermediate file paths. You can also run parts of the pipeline used to generate any of the intermediate files, specifying one or more as a target to the Docker run invocation. Example, if you use the test IDH config:

docker run -it \
-v <your inputs dir>:/inputs \
-v <your outputs dir>:/outputs \
-v <your reference genome dir>:/reference-genome \
openvax/neoantigen-vaccine-pipeline:latest \
--configfile=/inputs/idh1_config.yaml \
--target=/outputs/idh1-test-sample/tumor_aligned_coordinate_sorted_dups_indelreal_bqsr.bam \
--target=/outputs/idh1-test-sample/normal_merged_aligned_coordinate_sorted.bam

This (somewhat arbitrary) example will run alignment on normal DNA, and alignment + full GATK process on the tumor DNA.

Here are some of the intermediate file names you might use as targets, in a sample's output directory:

Running without Docker

To get started with pipeline development and rule definition, install the Python dependencies:

pip install -r requirements.txt

Testing

You can run a small local unit test which simulates a pipeline dependency graph and does not require Docker. Once you clone this repo and install the Python requirements, run:

nosetests