GATK
A Nextflow wrapper for the Genome Analysis Toolkit (GATK), modified from the pipeline described in the Bioinformatic Workbook: GATK Best Practices Workflow for DNA-Seq.
This Nextflow wrapper extends GATK's functionality from DNAseq data, to enabling variant calling from RNA-Seq and long-read sequencing data. It provides a comprehensive solution for diverse sequencing platforms, enhancing variant analysis capabilities.
Installation
You will need a working version of nextflow, see here on how to install nextflow. Nextflow modules are avialable on some of the HPC computing resources.
See modules on HPC clusters
```
# === Nova
module load gcc/7.3.0-xegsmw4 nextflow
module load singularity
NEXTFLOW=nextflow
# === Condo
module load gcc/7.3.0-xegsmw4 nextflow
module load singularity
NEXTFLOW=nextflow
# === Ceres
module load nextflow
# singularity already available, no need for module
NEXTFLOW=nextflow
# === Atlas (will need a local install of nextflow and will need the --account "projectname" flag)
module load singularity
NEXTFLOW=/project/isu_gif_vrsc/programs/nextflow
```
nextflow run isugifNF/GATK --help
See help statement
```
N E X T F L O W ~ version 20.10.0
Launching `isugifNF/GATK` [big_kare] - revision: ca139b5b5f
Usage:
The typical command for running the pipeline are as follows:
DNAseq:
nextflow run main.nf --genome GENOME.fasta --reads "*_{R1,R2}.fastq.gz" --seq "dna" -profile singularity
nextflow run main.nf --genome GENOME.fasta --reads_file READ_PATHS.txt --seq "dna" -profile singularity
RNAseq:
nextflow run main.nf --genome GENOME.fasta --gtf "genes.gtf" --reads "*_{R1,R2}.fastq.gz" --seq "rna" -profile singularity
PacBio Long Reads:
nextflow run main.nf --genome GENOME.fasta --long_reads "*.fastq.gz" --seq "longread" -profile singularity
Mandatory arguments:
--seq Specify input sequence type as 'dna', 'rna', or 'longread' [default:'dna'].
--genome Reference genome fasta file, against which reads will be mapped to find Variant sites
Read input arguments:
--reads Paired-end reads in fastq.gz format, will need to specify glob (e.g. "*_{R1,R2}.fastq.gz")
--reads_file Text file (tab delimited) with three columns [readname left_fastq.gz right_fastq.gz]. Will need full path for files.
--long_reads Long read file in fastq.gz format, will need to specify glob (e.g. "*.fastq.gz")
Optional analysis arguments:
--invariant Output invariant sites [default:false]
--gtf Gene Transfer Format file, only required for RNAseq input [default:false]
--window Window size passed to bedtools for parallel GATK Haplotype calls [default:100000]
Optional configuration arguments:
-profile Configuration profile to use. Can use multiple (comma separated)
Available: local, slurm, singularity, docker [default:local]
--container_img Container image used for singularity and docker [default:'docker://ghcr.io/aseetharam/gatk:master']
GATK:
--gatk_app Link to gatk executable [default: 'gatk']
--java_options Java options for gatk [default:'-Xmx80g -XX:+UseParallelGC -Djava.io.tmpdir=$TMPDIR']
--gatk_cluster_options GATK cluster options [default:'false']
Aligners:
--bwamem2_app Link to bwamem2 executable [default: 'bwa-mem2']
--star_app Link to star executable [default: 'STAR']
--star_index_params Parameters for star index [default: ' ']
--star_index_params Parameters to pass to STAR index [default:'']
--star_index_file Optional: speedup by providing a prebuilt STAR indexed genome [default: 'false']
--pbmm2_app Link to pbmm2 executable [default: 'pbmm2']
Other:
--samtools_app Link to samtools executable [default: 'samtools']
--bedtools_app Link to bedtools executable [default: 'bedtools']
--datamash_app Link to datamash executable [default: 'datamash']
--vcftools_app Link to vcftools executable [default: 'vcftools']
Optional other arguments:
--outdir Output directory [default:'./GATK_Results']
--threads Threads per process [default:4 for local, 16 for slurm]
--queueSize Maximum jobs to submit to slurm [default:40]
--account HPC account name for slurm sbatch, atlas and ceres requires this
--help Print this help message
NOTE: Genome name should not contain any periods before the ".fasta"
```
Singularity Container
Tools required for the workflow are included in the container aseetharam/gatk:latest and should be automatically pulled by nextflow. (Will only need to run singularity pull
if website connection is unstable.)
If website connection is unstable, pull singularity and use the `-with-singularity` flag
#### To pull the image
```
singularity pull --name gatk.sif shub://aseetharam/gatk:latest
```
#### Link image to Nextflow using the `-with-singularity` flag.
```
nextflow run isugifNF/GATK \
--genome "test-data/ref/b73_chr1_150000001-151000000.fasta" \
--reads "test-data/fastq/*_{R1,R2}.fastq.gz" \
-profile slurm \
-with-singularity gatk.sif
```
Test Dataset
A simple DNAseq test dataset (test-data
) is available on ISU Box. This dataset contains a small genome (portion of chr1, B73v5 ), and Illumina short reads for 26 NAM lines (including B73) and B73Ab10 line (27 lines total).
Only the reads that map to the region of the v5 genome is included, so that this can be tested quickly.
There are examples of multiple files belonging to same NAM line as well as single file per NAM line to make sure both conditions works correctly.
The end VCF file should have exactly 27 individuals (lines) in them.
wget https://iastate.box.com/shared/static/wt85l6s4nw4kycm2bo0gpgjq752osatu.gz
tar -xf wt85l6s4nw4kycm2bo0gpgjq752osatu.gz
ls -1 test-data/
#> fastq # <= folder of paired reads
#> read-group.txt
#> ref # <= folder containing one genome reference
Running the Pipeline
This pipeline can process DNAseq, RNAseq and PacBio long read data, which require different arguments and different input files. Please jump to the section based off of your input below.
GATK variant calling for DNAseq data
Because this pipeline was initially developed to process DNAseq data, we have provided a test DNAseq dataset. You can fetch the pipeline and the test-data folder.
# Fetch pipeline
nextflow run isugifNF/GATK --help
# Fetch the test-data folder from ISU box
wget https://iastate.box.com/shared/static/wt85l6s4nw4kycm2bo0gpgjq752osatu.gz
tar -xf wt85l6s4nw4kycm2bo0gpgjq752osatu.gz
The general format of a run with the pipeline is to provide a genome file (--genome
) and Illumina Paired-End Reads files (--reads
or --reads_file
).
nextflow run isugifNF/GATK \
--genome test-data/ref/b73_chr1_150000001-151000000.fasta \
--reads "test-data/fastq/*_{R1,R2}.fastq.gz" \
-profile slurm,singularity \
-resume
or
nextflow run isugifNF/GATK \
--genome test-data/ref/b73_chr1_150000001-151000000.fasta \
--reads_file read-path.txt \
-profile slurm,singularity \
-resume
If you are on a HPC (Nova/Condo/Ceres/Atlas), it is highly recommend to use the submit_nf.slurm
script. The # === Load Modules
section will need to be modified to get nextflow and singulariy running.
See Module Changes
```
# === Nova
module load gcc/7.3.0-xegsmw4 nextflow
module load singularity
NEXTFLOW=nextflow
# === Condo
module load gcc/7.3.0-xegsmw4 nextflow
module load singularity
NEXTFLOW=nextflow
# === Ceres
module load nextflow
# singularity already available, no need for module
NEXTFLOW=nextflow
# === Atlas
module load singularity
NEXTFLOW=/project/isu_gif_vrsc/programs/nextflow
```
Using the --reads_file
Instead of using a pattern to specify paired reads ("test-data/fastq/*_{R1,R2}.fastq.gz"
), we can use a tab-delimited textfile to specify the path to left and right files. The textfile should contain three columns: readname, left read path, right read path.
In our case, for the test-data
run the following one-liner to generate a read-path.txt
.
for f in test-data/fastq/*_R1.fastq.gz; do echo -e "$(basename $f |cut -f 1 -d "_")\t$(realpath $f)\t$(realpath $f | sed 's/_R1.fastq.gz/_R2.fastq.gz/g')"; done > read-path.txt
See example read-path.txt
```
BioSample01 /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample01_R1.fastq.gz /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample01_R2.fastq.gz
BioSample02 /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample02_R1.fastq.gz /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample02_R2.fastq.gz
BioSample03 /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample03_R1.fastq.gz /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample03_R2.fastq.gz
BioSample04 /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample04_R1.fastq.gz /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample04_R2.fastq.gz
BioSample05 /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample05_R1.fastq.gz /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample05_R2.fastq.gz
```
DNAseq final output
The final output will be in a results
folder. SNPs will be in the VCF file, probably the file with the longest name (e.g. first-round_merged_snps-only_snp-only.pass-only.vcf
).
results/
|_ 01_MarkAdapters/ #<= folders contain intermediate files
|_ 02_MapReads/
|_ 03_PrepGATK/
|_ 04_GATK/
|_ 05_FilterSNPs/
| |_ first-round_merged_snps-only_sorted_snp-only.pass-only.vcf #<= final SNP file
|
|_ report.html
|_ timeline.html # <= runtime information for all processes
GATK variant calling for RNAseq
In order to process RNAseq, we use the STAR
aligner. To use the RNAseq variant calling pipeline, make the following parameter changes.
nextflow run isugifNF/GATK \
--seq "rna" \
--genome "genome/genome.fa" \
--reads "data/*.{r1,r2}.fq.gz" \
--gtf "genome/genome.gtf" \
-profile slurm,singularity \
-resume
RNAseq final output
The final output will be in a results
folder. SNPs will be in the VCF file, probably the file with the longest name (e.g. first-round_merged_snps-only_snp-only.pass-only.vcf
).
results/
|_ 01_MarkAdapters/ #<= folders contain intermediate files
|_ 02_MapReads/
|_ 03_PrepGATK/
|_ 04_GATK/
|_ 05_FilterSNPs/
| |_ first-round_merged_snps-only_sorted_snp-only.pass-only.vcf #<= final SNP file
|
|_ report.html
|_ timeline.html # <= runtime information for all processes
GATK variant calling for long read data
In order to process long read data obtained from the PacBio platform, we use the pbmm2
aligner. To use the long read variant calling pipeline, make the following parameter changes.
nextflow run isugifNF/GATK \
--seq "longread" \
--genome "genome/genome.fa" \
--reads "data/long-reads.fastq" \
-profile slurm,singularity \
-resume
Long read final output
The final output will be in a results
folder. SNPs will be in the VCF file, probably the file with the longest name (e.g. output_snp-only.pass-only.vcf
).
results/
|_ 02_MapReads/
|_ 03_PrepGATK/
|_ 04_GATK/
|_ 05_FilterSNPs/
| |_ output_snp-only.pass-only.vcf #<= final SNP file
|
|_ report.html
|_ timeline.html # <= runtime information for all processes
Example Runs
Some example runs provided to show nextflow output.
Example runs for DNAseq data
See example run on Ceres HPC - last update: 14 April 2021
Runtime: 1 hour 9 minutes and 27 seconds.
```
$ module load nextflow
$ nextflow run main.nf \
--genome "test-data/ref/b73_chr1_150000001-151000000.fasta" \
--reads "test-data/fastq/*_{R1,R2}.fastq.gz" \
--queueSize 25 \
-profile slurm,singularity \
-resume
N E X T F L O W ~ version 20.07.1
Launching `main.nf` [exotic_poincare] - revision: ca139b5b5f
executor > slurm (155)
[8c/e6342a] process > FastqToSam (BioSample26) [100%] 27 of 27 ✔
[6d/13aa51] process > MarkIlluminaAdapters (27_Bi... [100%] 27 of 27 ✔
[dc/032273] process > SamToFastq (20_BioSample24_... [100%] 27 of 27 ✔
[0e/a8d488] process > bwamem2_index (b73_chr1_150... [100%] 1 of 1 ✔
[2f/b3746a] process > bwamem2_mem (20_BioSample24) [100%] 27 of 27 ✔
[bc/8e43cc] process > CreateSequenceDictionary (b... [100%] 1 of 1 ✔
[fb/c2b500] process > samtools_faidx (b73_chr1_15... [100%] 1 of 1 ✔
[64/b1241a] process > MergeBamAlignment (20_BioSa... [100%] 27 of 27 ✔
[16/ea1c05] process > bedtools_makewindows (b73_c... [100%] 1 of 1 ✔
[34/14a2e1] process > gatk_HaplotypeCaller (chr1:... [100%] 10 of 10 ✔
[a2/6d9b6d] process > merge_vcf [100%] 1 of 1 ✔
[dd/187a85] process > vcftools_snp_only (first-ro... [100%] 1 of 1 ✔
[cc/367a9a] process > SortVcf (first-round_merged... [100%] 1 of 1 ✔
[ef/102130] process > calc_DPvalue (first-round_m... [100%] 1 of 1 ✔
[8f/281023] process > VariantFiltration (first-ro... [100%] 1 of 1 ✔
[fc/5d8e7c] process > keep_only_pass (first-round... [100%] 1 of 1 ✔
Completed at: 14-Apr-2021 01:51:21
Duration : 1h 9m 27s
CPU hours : 7.3
Succeeded : 155
```
See example run on Atlas HPC - last update: 14 April 2021
Runtime: 50 minutes and 50 seconds.
```
$ module load singularity
$ NEXTFLOW=/project/isu_gif_vrsc/programs/nextflow
$ $NEXTFLOW run main.nf \
--genome "test-data/ref/b73_chr1_150000001-151000000.fasta" \
--reads "test-data/fastq/*_{R1,R2}.fastq.gz" \
--queueSize 50 \
--account isu_gif_vrsc \
-profile slurm,singularity \
-resume
N E X T F L O W ~ version 20.07.1
Launching `main.nf` [tiny_cori] - revision: ca139b5b5f
executor > slurm (155)
[92/cfaf35] process > FastqToSam (BioSample24) [100%] 27 of 27 ✔
[59/37e2e8] process > MarkIlluminaAdapters (20_Bi... [100%] 27 of 27 ✔
[b3/cc67d6] process > SamToFastq (20_BioSample24_... [100%] 27 of 27 ✔
[46/9d4a5f] process > bwamem2_index (b73_chr1_150... [100%] 1 of 1 ✔
[e4/ad114c] process > bwamem2_mem (20_BioSample24) [100%] 27 of 27 ✔
[a7/044560] process > CreateSequenceDictionary (b... [100%] 1 of 1 ✔
[50/e81af8] process > samtools_faidx (b73_chr1_15... [100%] 1 of 1 ✔
[67/1c03a5] process > MergeBamAlignment (20_BioSa... [100%] 27 of 27 ✔
[f6/d94e63] process > bedtools_makewindows (b73_c... [100%] 1 of 1 ✔
[49/3a6d6c] process > gatk_HaplotypeCaller (chr1:... [100%] 10 of 10 ✔
[68/e88beb] process > merge_vcf [100%] 1 of 1 ✔
[55/66c4cf] process > vcftools_snp_only (first-ro... [100%] 1 of 1 ✔
[92/7bad2f] process > SortVcf (first-round_merged... [100%] 1 of 1 ✔
[0b/e824cf] process > calc_DPvalue (first-round_m... [100%] 1 of 1 ✔
[a9/33a4f2] process > VariantFiltration (first-ro... [100%] 1 of 1 ✔
[ef/769ed6] process > keep_only_pass (first-round... [100%] 1 of 1 ✔
Completed at: 14-Apr-2021 01:03:12
Duration : 50m 50s
CPU hours : 6.4
Succeeded : 155
```
See example run on Nova HPC - last update: 14 April 2021
Runtime: 1 hour 46 minutes and 17 seconds.
```
$ module load gcc/7.3.0-xegsmw4 nextflow
$ module load singularity
$ nextflow run main.nf \
--genome "test-data/ref/b73_chr1_150000001-151000000.fasta" \
--reads "test-data/fastq/*_{R1,R2}.fastq.gz" \
--queueSize 25 \
-profile slurm,singularity \
-resume
N E X T F L O W ~ version 20.07.1
Launching `main.nf` [condescending_monod] - revision: ca139b5b5f
executor > slurm (155)
[5c/b08536] process > FastqToSam (BioSample04) [100%] 27 of 27 ✔
[88/46d321] process > MarkIlluminaAdapters (27_Bi... [100%] 27 of 27 ✔
[96/200ac5] process > SamToFastq (21_BioSample24_... [100%] 27 of 27 ✔
[4c/8735b5] process > bwamem2_index (b73_chr1_150... [100%] 1 of 1 ✔
[86/06740e] process > bwamem2_mem (21_BioSample24) [100%] 27 of 27 ✔
[c0/3e6521] process > CreateSequenceDictionary (b... [100%] 1 of 1 ✔
[e2/856737] process > samtools_faidx (b73_chr1_15... [100%] 1 of 1 ✔
[25/529408] process > MergeBamAlignment (21_BioSa... [100%] 27 of 27 ✔
[40/ca5ca7] process > bedtools_makewindows (b73_c... [100%] 1 of 1 ✔
[8a/0d6f00] process > gatk_HaplotypeCaller (chr1:... [100%] 10 of 10 ✔
[96/0b957b] process > merge_vcf [100%] 1 of 1 ✔
[96/1f9848] process > vcftools_snp_only (first-ro... [100%] 1 of 1 ✔
[60/edb33d] process > SortVcf (first-round_merged... [100%] 1 of 1 ✔
[b0/372a4e] process > calc_DPvalue (first-round_m... [100%] 1 of 1 ✔
[f3/cfb966] process > VariantFiltration (first-ro... [100%] 1 of 1 ✔
[86/024c8b] process > keep_only_pass (first-round... [100%] 1 of 1 ✔
Completed at: 14-Apr-2021 02:17:48
Duration : 1h 46m 17s
CPU hours : 6.6
Succeeded : 155
```
Example run for RNAseq data
see example run on NOVA hpc - last update: 8 august 2023
runtime: 4 hours 38 minutes and 54 seconds.
```
$ module load gcc/7.3.0-xegsmw4 nextflow
$ module load singularity
$ nextflow run main.nf \
--seq "rna" \
--genome "genome/s_lycopersicum_chromosomes.3.00.fa" \
--reads "01_data/s*.{r1,r2}.fq.gz" \
--queuesize 25 \
--java_options "-xmx80g -xx:+useparallelgc -djava.io.tmpdir=/work/gif3/satheesh/2023_gatk_ms/tmp" \
--gtf "genome/itag3.0_gene_models.gtf" \
-profile slurm \
-resume
N E X T F L O W ~ version 22.10.4
launching `/work/gif3/satheesh/programs/gatk_testing/gatk/main.nf` [chaotic_brahmagupta] dsl2 - revision: d07ccde0d9
executor > slurm (8304)
[c4/48a5ae] process > fastqtosam (subsampled_p7r1) [100%] 1 of 1 ✔
[ee/c542a3] process > markilluminaadapters (1_sub... [100%] 1 of 1 ✔
[f8/a5cd97] process > samtofastq_rna (1_subsample... [100%] 1 of 1 ✔
[8b/c7dafa] process > star_index (s_lycopersicum_... [100%] 1 of 1 ✔
[2c/f96408] process > star_align (1_subsampled_p7r1) [100%] 1 of 1 ✔
[ee/72b0af] process > createsequencedictionary (s... [100%] 1 of 1 ✔
[29/6454c2] process > samtools_faidx (s_lycopersi... [100%] 1 of 1 ✔
[af/2ff081] process > mergebamalignment_rna (1_su... [100%] 1 of 1 ✔
[51/589571] process > markduplicates (1_subsample... [100%] 1 of 1 ✔
[4c/53bb96] process > splitncigarreads (1_subsamp... [100%] 1 of 1 ✔
[ee/79c9ea] process > bedtools_makewindows (s_lyc... [100%] 1 of 1 ✔
[f6/a83bb1] process > gatk_haplotypecaller_rna (s... [100%] 8287 of 8287, fai...
[5c/efa2eb] process > merge_vcf [100%] 1 of 1 ✔
[ec/6b40be] process > vcftools_snp_only (first_ro... [100%] 1 of 1 ✔
[ab/604154] process > sortvcf (first_round_merged... [100%] 1 of 1 ✔
[28/765be4] process > calc_dpvalue (first_round_m... [100%] 1 of 1 ✔
[5b/0b530f] process > variantfiltration (first_ro... [100%] 1 of 1 ✔
[99/878369] process > keep_only_pass (first_round... [100%] 1 of 1 ✔
warn: failed to render execution report -- see the log file for details
warn: failed to render execution timeline -- see the log file for details
completed at: 08-aug-2023 21:37:02
duration : 4h 38m 54s
cpu hours : 6.3
succeeded : 8'303
```
Example run for long read data
see example run on NOVA hpc - last update: 31 October 2023
runtime: 41 minutes and 23 seconds.
```
$ module load gcc/7.3.0-xegsmw4 nextflow
$ module load singularity
$ nextflow run main.nf \
--seq "longread" \
--genome "genome/ecoli_2000001_3000000.fasta" \
--reads "01_Data/CP007799_2000001-3000000.fastq" \
--queueSize 25 \
--java_options "-Xmx80g -XX:+UseParallelGC -Djava.io.tmpdir=/work/gif3/satheesh/2023_GATK_ms/tmp" \
-profile slurm,singularity \
-resume
N E X T F L O W ~ version 22.10.4
launching `/work/gif3/satheesh/programs/gatk_testing/gatk/main.nf` [chaotic_brahmagupta] dsl2 - revision: d07ccde0d9
executor > slurm (21)
[c8/aa3601] process > LONGREAD_VARIANT_CALLING:Cr... [100%] 1 of 1 ✔
[51/b76b1a] process > LONGREAD_VARIANT_CALLING:sa... [100%] 1 of 1 ✔
[be/0cb73f] process > LONGREAD_VARIANT_CALLING:be... [100%] 1 of 1 ✔
[26/4f4559] process > LONGREAD_VARIANT_CALLING:pb... [100%] 1 of 1 ✔
[d1/78ce63] process > LONGREAD_VARIANT_CALLING:pb... [100%] 1 of 1 ✔
[bb/1e7433] process > LONGREAD_VARIANT_CALLING:Ma... [100%] 1 of 1 ✔
[aa/4c9f99] process > LONGREAD_VARIANT_CALLING:ga... [100%] 10 of 10 ✔
[a6/142af0] process > LONGREAD_VARIANT_CALLING:Co... [100%] 1 of 1 ✔
[fa/069ebb] process > LONGREAD_VARIANT_CALLING:Ge... [100%] 1 of 1 ✔
[09/ea1b74] process > LONGREAD_VARIANT_CALLING:ca... [100%] 1 of 1 ✔
[65/cc76dd] process > LONGREAD_VARIANT_CALLING:Va... [100%] 1 of 1 ✔
[77/452275] process > LONGREAD_VARIANT_CALLING:ke... [100%] 1 of 1 ✔
Completed at: 31-Oct-2023 14:39:44
Duration : 41m 23s
CPU hours : 5.7
Succeeded : 21
```