A GATK best-practices germline workflow designed to work with GATK 3.5 (Van der Auwera et al., 2013).
More details can be found in our Google doc.
See our Trello board for more information about our development process.
Original diagram document here.
Gray items can be optionally executed outside of the workflow as stand-alone tools. Indels can be recalibrated with VQSR or simply filtered depending on the number of indels being analyzed.
The workflow relies on the following requirements.
We have tested on a 16 core, 128GB+ Ubuntu VM running on Azure. Similar systems should work. We recommend 1TB+ disk space if processing a 30x coverage genome, about 500G if processing an exome. We use the directories /data/working
and /data/output
in the examples below but whatever you choose make sure these are on volumes with plenty of space free.
Make sure you have installed the following on your Linux system:
To install cwltool
:
pip install setuptools==28.8.0
pip install cwl-runner cwltool==1.0.20170217172322 schema-salad==2.2.20170222151604 avro==1.8.1
Alternatively, you can use the Dockstore command line, see the Dockstore install guide for more information.
We use two test datasets for this workflow. The first is a small NA12878 exome that takes about 8 hours to run on a 16 core box.
GIAB provides an exome from Garvin for NA12878 which is available using the test/prepare_test.sh
script. See ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/ to download fastq pairs.
GIAB also provides a 300x coverage genome for NA12878 from the Garvin, see ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/ to download fastq pairs.
The reference files are pulled from the GATK bundle v2.8 (hg19). The used files are:
These can be downloaded with the test/prepare_test.sh
script.
We use the CWL reference implementation with the cwltool
. The following shows how to run the whole exome workflow with a working directory of /data/working
and output going to /data/output
. This assumes you have downloaded the exome test data and reference files using the test/prepare_test.sh
script.
cwltool --debug --tmpdir-prefix /data/working --outdir /data/output --tmp-outdir-prefix /data/working GATK-complete-WES-Workflow-h3abionet.cwl GATK-complete-WES-Workflow-h3abionet.json > >(tee stdout.exome.log) 2> >(tee stderr.exome.log >&2)
Alternatively you can use the Dockstore command line to execute the workflow.
The user can run the workflow on any Unix-like platform, given that all the depenedencies are satisfied. For example the user can launch an EC2 instance. There are several ways to do so:
1- Through AWS web console:AWS own documentation, or Connor Leech's excellent step by step guide.
2- AWS CLI.
3- StarCluster
You can follow the instructions above to get the pipeline up and running.
Here's the sample output, your output paths will differ depending on your system.
{
"output_printReads": {
"checksum": "sha1$ae24bdbb2aa253a217f181af13e91993f3df2d0c",
"basename": "PrintReads-2017-04-23.bam",
"location": "file:///mnt2/PrintReads-2017-04-23.bam",
"path": "/mnt2/PrintReads-2017-04-23.bam",
"class": "File",
"size": 18211753946
},
"output_SnpVQSR_annotated_snps": {
"checksum": "sha1$b79dd0de9da7b2ea13c8129c2ad2e6fd0f1c79ff",
"basename": "HaplotypeCaller-2017-04-23.SNP.vqsr.ann.vcf",
"location": "file:///mnt2/HaplotypeCaller-2017-04-23.SNP.vqsr.ann.vcf",
"path": "/mnt2/HaplotypeCaller-2017-04-23.SNP.vqsr.ann.vcf",
"class": "File",
"size": 214860597
},
"output_IndelFilter_annotated_indels": {
"checksum": "sha1$3cdd41a1b6c683813e29822854981431b74793e4",
"basename": "filtered.indels.ann.vcf",
"location": "file:///mnt2/filtered.indels.ann.vcf",
"path": "/mnt2/filtered.indels.ann.vcf",
"class": "File",
"size": 31696522
},
"output_bamstat": {
"format": "http://edamontology.org/format_3615",
"checksum": "sha1$0cbb6ba1fc6182bd9c6d279fed07993ca2fee86e",
"basename": "bamstats_report.zip",
"location": "file:///mnt2/bamstats_report.zip",
"path": "/mnt2/bamstats_report.zip",
"class": "File",
"size": 2507012
},
"output_SnpVQSR_recal_File": {
"checksum": "sha1$861151b0c967466ce914bf1e2d09dd6e77030c02",
"basename": "vqsr_recal.snps.recal",
"location": "file:///mnt2/vqsr_recal.snps.recal",
"path": "/mnt2/vqsr_recal.snps.recal",
"class": "File",
"size": 31796999
},
"output_HaplotypeCaller": {
"checksum": "sha1$9a8264fc177950bfdec1b8e6aaf3484458166581",
"basename": "HaplotypeCaller-2017-04-23.vcf",
"location": "file:///mnt2/HaplotypeCaller-2017-04-23.vcf",
"path": "/mnt2/HaplotypeCaller-2017-04-23.vcf",
"class": "File",
"size": 93031041
}
}
With the runtime being approximately 7 hours on a 16 core VM.
real 407m37.153s
user 3m37.612s
sys 2m16.264s
Releases can be found on the GitHub releases page.
The workflow is also published to the Dockstore here.
FastQC is used as an initial QC step where the input files are checked for usual metrics such as:
Trimmomatic is the entry point of the pipeline, it is used to cleanup the reads in the input fastq files from any sequencing adaptors.
BWA is used to align the reads from the the input fastq files -paired-ends- (Li, 2013). We use specifically bwa mem
as recommended by the GATK best-practices. BWA produces a SAM file containing the aligned reads against the human reference genome (hg19, GATK bundle build 2.8).
As GATK tools downstream requires properly formatted Read Group information. We add by default 'toy' Read Group information while processing the alignment to the output SAM file. we specifically use the flag -R '@RG\tID:foo\tSM:bar\tLB:library1'
.
SAMtools (Li et al., 2009) are used few times in the pipeline:
Picard tools are used to mark duplicated reads in the aligned and sorted BAM file, making thus the files lighter and less prone to errors in the downstream steps of the pipeline.
Genome Analysis Tool Kit refered to as GATK (DePristo et al., 2011) is used to process the data throught multiple steps as described by the GATK best-practices (i.e. figure bellow). The GATK steps are the following:
SnpEff is used in this pipeline to annotate the variant calls (Cingolani et al., 2012). The annotation is extensive and uses multi-database approach to provide the user with as much information about the called variants as possible.
BAMStats, is a simple software tool built on the Picard Java API (2), which can calculate and graphically display various metrics derived from SAM/BAM files of value in QC assessments.
See our Trello board for more information.
WARN 09:30:55,672 RMDTrackBuilder - Unable to write to /var/lib/cwl/stg24cab595-694a-4cac-b90a-abe40abdeaa2/Mills_and_1000G_gold_st.vcf.idx for the index file, creating index in memory only
To cite this pipeline, please use:
Baichoo, S., Souilmi, Y., Panji, S., Botha, G., Meintjes, A., Hazelhurst, S., Bendou, H., Beste, E. de, et al. 2018. Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics. BMC Bioinformatics. 19(1):457. DOI: 10.1186/s12859-018-2446-1.