h3abionet / h3agatk

MIT License
21 stars 14 forks source link

H3ABioNet GATK Germline Workflow

Overview

A GATK best-practices germline workflow designed to work with GATK 3.5 (Van der Auwera et al., 2013).

More details can be found in our Google doc.

See our Trello board for more information about our development process.

Authors

Workflow Summary

pipeline

Original diagram document here.

Gray items can be optionally executed outside of the workflow as stand-alone tools. Indels can be recalibrated with VQSR or simply filtered depending on the number of indels being analyzed.

Dependencies

The workflow relies on the following requirements.

System Resources

We have tested on a 16 core, 128GB+ Ubuntu VM running on Azure. Similar systems should work. We recommend 1TB+ disk space if processing a 30x coverage genome, about 500G if processing an exome. We use the directories /data/working and /data/output in the examples below but whatever you choose make sure these are on volumes with plenty of space free.

System Services

Make sure you have installed the following on your Linux system:

To install cwltool:

pip install setuptools==28.8.0
pip install cwl-runner cwltool==1.0.20170217172322 schema-salad==2.2.20170222151604 avro==1.8.1

Alternatively, you can use the Dockstore command line, see the Dockstore install guide for more information.

Sample Data

We use two test datasets for this workflow. The first is a small NA12878 exome that takes about 8 hours to run on a 16 core box.

Exome

GIAB provides an exome from Garvin for NA12878 which is available using the test/prepare_test.sh script. See ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/ to download fastq pairs.

Whole Genome

GIAB also provides a 300x coverage genome for NA12878 from the Garvin, see ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/ to download fastq pairs.

Reference Files

The reference files are pulled from the GATK bundle v2.8 (hg19). The used files are:

These can be downloaded with the test/prepare_test.sh script.

Running

We use the CWL reference implementation with the cwltool. The following shows how to run the whole exome workflow with a working directory of /data/working and output going to /data/output. This assumes you have downloaded the exome test data and reference files using the test/prepare_test.sh script.

cwltool --debug --tmpdir-prefix /data/working --outdir /data/output --tmp-outdir-prefix /data/working GATK-complete-WES-Workflow-h3abionet.cwl GATK-complete-WES-Workflow-h3abionet.json  > >(tee stdout.exome.log) 2> >(tee stderr.exome.log >&2)

Alternatively you can use the Dockstore command line to execute the workflow.

Launch a cloud VM: e.g. AWS EC2

The user can run the workflow on any Unix-like platform, given that all the depenedencies are satisfied. For example the user can launch an EC2 instance. There are several ways to do so:

1- Through AWS web console:AWS own documentation, or Connor Leech's excellent step by step guide.

2- AWS CLI.

3- StarCluster

You can follow the instructions above to get the pipeline up and running.

Output

Here's the sample output, your output paths will differ depending on your system.

{
    "output_printReads": {
        "checksum": "sha1$ae24bdbb2aa253a217f181af13e91993f3df2d0c",
        "basename": "PrintReads-2017-04-23.bam",
        "location": "file:///mnt2/PrintReads-2017-04-23.bam",
        "path": "/mnt2/PrintReads-2017-04-23.bam",
        "class": "File",
        "size": 18211753946
    },
    "output_SnpVQSR_annotated_snps": {
        "checksum": "sha1$b79dd0de9da7b2ea13c8129c2ad2e6fd0f1c79ff",
        "basename": "HaplotypeCaller-2017-04-23.SNP.vqsr.ann.vcf",
        "location": "file:///mnt2/HaplotypeCaller-2017-04-23.SNP.vqsr.ann.vcf",
        "path": "/mnt2/HaplotypeCaller-2017-04-23.SNP.vqsr.ann.vcf",
        "class": "File",
        "size": 214860597
    },
    "output_IndelFilter_annotated_indels": {
        "checksum": "sha1$3cdd41a1b6c683813e29822854981431b74793e4",
        "basename": "filtered.indels.ann.vcf",
        "location": "file:///mnt2/filtered.indels.ann.vcf",
        "path": "/mnt2/filtered.indels.ann.vcf",
        "class": "File",
        "size": 31696522
    },
    "output_bamstat": {
        "format": "http://edamontology.org/format_3615",
        "checksum": "sha1$0cbb6ba1fc6182bd9c6d279fed07993ca2fee86e",
        "basename": "bamstats_report.zip",
        "location": "file:///mnt2/bamstats_report.zip",
        "path": "/mnt2/bamstats_report.zip",
        "class": "File",
        "size": 2507012
    },
    "output_SnpVQSR_recal_File": {
        "checksum": "sha1$861151b0c967466ce914bf1e2d09dd6e77030c02",
        "basename": "vqsr_recal.snps.recal",
        "location": "file:///mnt2/vqsr_recal.snps.recal",
        "path": "/mnt2/vqsr_recal.snps.recal",
        "class": "File",
        "size": 31796999
    },
    "output_HaplotypeCaller": {
        "checksum": "sha1$9a8264fc177950bfdec1b8e6aaf3484458166581",
        "basename": "HaplotypeCaller-2017-04-23.vcf",
        "location": "file:///mnt2/HaplotypeCaller-2017-04-23.vcf",
        "path": "/mnt2/HaplotypeCaller-2017-04-23.vcf",
        "class": "File",
        "size": 93031041
    }
}

With the runtime being approximately 7 hours on a 16 core VM.

real    407m37.153s
user    3m37.612s
sys     2m16.264s

Release

Releases can be found on the GitHub releases page.

The workflow is also published to the Dockstore here.

Workflow Tool Details

FastQC

FastQC is used as an initial QC step where the input files are checked for usual metrics such as:

Trimmomatic

Trimmomatic is the entry point of the pipeline, it is used to cleanup the reads in the input fastq files from any sequencing adaptors.

BWA

BWA is used to align the reads from the the input fastq files -paired-ends- (Li, 2013). We use specifically bwa mem as recommended by the GATK best-practices. BWA produces a SAM file containing the aligned reads against the human reference genome (hg19, GATK bundle build 2.8).

As GATK tools downstream requires properly formatted Read Group information. We add by default 'toy' Read Group information while processing the alignment to the output SAM file. we specifically use the flag -R '@RG\tID:foo\tSM:bar\tLB:library1'.

SAMtools

SAMtools (Li et al., 2009) are used few times in the pipeline:

  1. Convert BWA's output from a SAM format to a BAM format
  2. Sort the reads in the generated BAM file in step 1 (above)
  3. Indexing the BAM file for the following tools to use

Picard

Picard tools are used to mark duplicated reads in the aligned and sorted BAM file, making thus the files lighter and less prone to errors in the downstream steps of the pipeline.

GATK

Genome Analysis Tool Kit refered to as GATK (DePristo et al., 2011) is used to process the data throught multiple steps as described by the GATK best-practices (i.e. figure bellow). GATK best-practices pipeline The GATK steps are the following:

  1. Indel Realignment:
    1. Realign Target Creator
    2. Indel Realigner
  2. Mark Duplicates (a picard step)
  3. Base Quality Score Recalibration (BQSR):
    1. Base Recalibrator
    2. Print Reads
  4. Haplotype Caller
  5. Variant Quality Score Recalibration (VQSR):
    1. Variant Recalibrator
    2. Apply Recalibration

SnpEff

SnpEff is used in this pipeline to annotate the variant calls (Cingolani et al., 2012). The annotation is extensive and uses multi-database approach to provide the user with as much information about the called variants as possible.

BAMStat

BAMStats, is a simple software tool built on the Picard Java API (2), which can calculate and graphically display various metrics derived from SAM/BAM files of value in QC assessments.

TODO

See our Trello board for more information.

References

To cite this pipeline, please use:

Baichoo, S., Souilmi, Y., Panji, S., Botha, G., Meintjes, A., Hazelhurst, S., Bendou, H., Beste, E. de, et al. 2018. Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics. BMC Bioinformatics. 19(1):457. DOI: 10.1186/s12859-018-2446-1.