isugifNF / GATK-flow

Nextflow implementation of the GATK HaplotypeCaller pipeline
12 stars 2 forks source link
gatk nextflow

GATK

A Nextflow wrapper for the Genome Analysis Toolkit (GATK), modified from the pipeline described in the Bioinformatic Workbook: GATK Best Practices Workflow for DNA-Seq.

This Nextflow wrapper extends GATK's functionality from DNAseq data, to enabling variant calling from RNA-Seq and long-read sequencing data. It provides a comprehensive solution for diverse sequencing platforms, enhancing variant analysis capabilities.

Installation

You will need a working version of nextflow, see here on how to install nextflow. Nextflow modules are avialable on some of the HPC computing resources.

See modules on HPC clusters ``` # === Nova module load gcc/7.3.0-xegsmw4 nextflow module load singularity NEXTFLOW=nextflow # === Condo module load gcc/7.3.0-xegsmw4 nextflow module load singularity NEXTFLOW=nextflow # === Ceres module load nextflow # singularity already available, no need for module NEXTFLOW=nextflow # === Atlas (will need a local install of nextflow and will need the --account "projectname" flag) module load singularity NEXTFLOW=/project/isu_gif_vrsc/programs/nextflow ```
nextflow run isugifNF/GATK --help
See help statement ``` N E X T F L O W ~ version 20.10.0 Launching `isugifNF/GATK` [big_kare] - revision: ca139b5b5f Usage: The typical command for running the pipeline are as follows: DNAseq: nextflow run main.nf --genome GENOME.fasta --reads "*_{R1,R2}.fastq.gz" --seq "dna" -profile singularity nextflow run main.nf --genome GENOME.fasta --reads_file READ_PATHS.txt --seq "dna" -profile singularity RNAseq: nextflow run main.nf --genome GENOME.fasta --gtf "genes.gtf" --reads "*_{R1,R2}.fastq.gz" --seq "rna" -profile singularity PacBio Long Reads: nextflow run main.nf --genome GENOME.fasta --long_reads "*.fastq.gz" --seq "longread" -profile singularity Mandatory arguments: --seq Specify input sequence type as 'dna', 'rna', or 'longread' [default:'dna']. --genome Reference genome fasta file, against which reads will be mapped to find Variant sites Read input arguments: --reads Paired-end reads in fastq.gz format, will need to specify glob (e.g. "*_{R1,R2}.fastq.gz") --reads_file Text file (tab delimited) with three columns [readname left_fastq.gz right_fastq.gz]. Will need full path for files. --long_reads Long read file in fastq.gz format, will need to specify glob (e.g. "*.fastq.gz") Optional analysis arguments: --invariant Output invariant sites [default:false] --gtf Gene Transfer Format file, only required for RNAseq input [default:false] --window Window size passed to bedtools for parallel GATK Haplotype calls [default:100000] Optional configuration arguments: -profile Configuration profile to use. Can use multiple (comma separated) Available: local, slurm, singularity, docker [default:local] --container_img Container image used for singularity and docker [default:'docker://ghcr.io/aseetharam/gatk:master'] GATK: --gatk_app Link to gatk executable [default: 'gatk'] --java_options Java options for gatk [default:'-Xmx80g -XX:+UseParallelGC -Djava.io.tmpdir=$TMPDIR'] --gatk_cluster_options GATK cluster options [default:'false'] Aligners: --bwamem2_app Link to bwamem2 executable [default: 'bwa-mem2'] --star_app Link to star executable [default: 'STAR'] --star_index_params Parameters for star index [default: ' '] --star_index_params Parameters to pass to STAR index [default:''] --star_index_file Optional: speedup by providing a prebuilt STAR indexed genome [default: 'false'] --pbmm2_app Link to pbmm2 executable [default: 'pbmm2'] Other: --samtools_app Link to samtools executable [default: 'samtools'] --bedtools_app Link to bedtools executable [default: 'bedtools'] --datamash_app Link to datamash executable [default: 'datamash'] --vcftools_app Link to vcftools executable [default: 'vcftools'] Optional other arguments: --outdir Output directory [default:'./GATK_Results'] --threads Threads per process [default:4 for local, 16 for slurm] --queueSize Maximum jobs to submit to slurm [default:40] --account HPC account name for slurm sbatch, atlas and ceres requires this --help Print this help message NOTE: Genome name should not contain any periods before the ".fasta" ```

Singularity Container

Tools required for the workflow are included in the container aseetharam/gatk:latest and should be automatically pulled by nextflow. (Will only need to run singularity pull if website connection is unstable.)

If website connection is unstable, pull singularity and use the `-with-singularity` flag #### To pull the image ``` singularity pull --name gatk.sif shub://aseetharam/gatk:latest ``` #### Link image to Nextflow using the `-with-singularity` flag. ``` nextflow run isugifNF/GATK \ --genome "test-data/ref/b73_chr1_150000001-151000000.fasta" \ --reads "test-data/fastq/*_{R1,R2}.fastq.gz" \ -profile slurm \ -with-singularity gatk.sif ```

Test Dataset

A simple DNAseq test dataset (test-data) is available on ISU Box. This dataset contains a small genome (portion of chr1, B73v5 ), and Illumina short reads for 26 NAM lines (including B73) and B73Ab10 line (27 lines total). Only the reads that map to the region of the v5 genome is included, so that this can be tested quickly. There are examples of multiple files belonging to same NAM line as well as single file per NAM line to make sure both conditions works correctly. The end VCF file should have exactly 27 individuals (lines) in them.

wget https://iastate.box.com/shared/static/wt85l6s4nw4kycm2bo0gpgjq752osatu.gz
tar -xf wt85l6s4nw4kycm2bo0gpgjq752osatu.gz

ls -1 test-data/
#> fastq          # <= folder of paired reads
#> read-group.txt
#> ref            # <= folder containing one genome reference

Running the Pipeline

This pipeline can process DNAseq, RNAseq and PacBio long read data, which require different arguments and different input files. Please jump to the section based off of your input below.

GATK variant calling for DNAseq data

Because this pipeline was initially developed to process DNAseq data, we have provided a test DNAseq dataset. You can fetch the pipeline and the test-data folder.

# Fetch pipeline
nextflow run isugifNF/GATK --help

# Fetch the test-data folder from ISU box
wget https://iastate.box.com/shared/static/wt85l6s4nw4kycm2bo0gpgjq752osatu.gz
tar -xf wt85l6s4nw4kycm2bo0gpgjq752osatu.gz

The general format of a run with the pipeline is to provide a genome file (--genome) and Illumina Paired-End Reads files (--reads or --reads_file).

nextflow run isugifNF/GATK \
  --genome test-data/ref/b73_chr1_150000001-151000000.fasta \
  --reads "test-data/fastq/*_{R1,R2}.fastq.gz" \
  -profile slurm,singularity \
  -resume

or

nextflow run isugifNF/GATK \
  --genome test-data/ref/b73_chr1_150000001-151000000.fasta \
  --reads_file read-path.txt \
  -profile slurm,singularity \
  -resume

If you are on a HPC (Nova/Condo/Ceres/Atlas), it is highly recommend to use the submit_nf.slurm script. The # === Load Modules section will need to be modified to get nextflow and singulariy running.

See Module Changes ``` # === Nova module load gcc/7.3.0-xegsmw4 nextflow module load singularity NEXTFLOW=nextflow # === Condo module load gcc/7.3.0-xegsmw4 nextflow module load singularity NEXTFLOW=nextflow # === Ceres module load nextflow # singularity already available, no need for module NEXTFLOW=nextflow # === Atlas module load singularity NEXTFLOW=/project/isu_gif_vrsc/programs/nextflow ```

Using the --reads_file

Instead of using a pattern to specify paired reads ("test-data/fastq/*_{R1,R2}.fastq.gz"), we can use a tab-delimited textfile to specify the path to left and right files. The textfile should contain three columns: readname, left read path, right read path.

In our case, for the test-data run the following one-liner to generate a read-path.txt.

for f in test-data/fastq/*_R1.fastq.gz; do echo -e "$(basename $f |cut -f 1 -d "_")\t$(realpath $f)\t$(realpath $f | sed 's/_R1.fastq.gz/_R2.fastq.gz/g')"; done > read-path.txt
See example read-path.txt ``` BioSample01 /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample01_R1.fastq.gz /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample01_R2.fastq.gz BioSample02 /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample02_R1.fastq.gz /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample02_R2.fastq.gz BioSample03 /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample03_R1.fastq.gz /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample03_R2.fastq.gz BioSample04 /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample04_R1.fastq.gz /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample04_R2.fastq.gz BioSample05 /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample05_R1.fastq.gz /Users/jenchang/Maize_WGS_Build/test-data/fastq/BioSample05_R2.fastq.gz ```

DNAseq final output

The final output will be in a results folder. SNPs will be in the VCF file, probably the file with the longest name (e.g. first-round_merged_snps-only_snp-only.pass-only.vcf).

results/
  |_ 01_MarkAdapters/            #<= folders contain intermediate files
  |_ 02_MapReads/
  |_ 03_PrepGATK/
  |_ 04_GATK/
  |_ 05_FilterSNPs/
  |  |_ first-round_merged_snps-only_sorted_snp-only.pass-only.vcf     #<= final SNP file
  |
  |_ report.html
  |_ timeline.html               # <= runtime information for all processes

GATK variant calling for RNAseq

In order to process RNAseq, we use the STAR aligner. To use the RNAseq variant calling pipeline, make the following parameter changes.

nextflow run isugifNF/GATK \
  --seq "rna" \
  --genome "genome/genome.fa" \
  --reads "data/*.{r1,r2}.fq.gz" \
  --gtf "genome/genome.gtf" \
  -profile slurm,singularity \
  -resume

RNAseq final output

The final output will be in a results folder. SNPs will be in the VCF file, probably the file with the longest name (e.g. first-round_merged_snps-only_snp-only.pass-only.vcf).

results/
  |_ 01_MarkAdapters/            #<= folders contain intermediate files
  |_ 02_MapReads/
  |_ 03_PrepGATK/
  |_ 04_GATK/
  |_ 05_FilterSNPs/
  |  |_ first-round_merged_snps-only_sorted_snp-only.pass-only.vcf     #<= final SNP file
  |
  |_ report.html
  |_ timeline.html               # <= runtime information for all processes

GATK variant calling for long read data

In order to process long read data obtained from the PacBio platform, we use the pbmm2 aligner. To use the long read variant calling pipeline, make the following parameter changes.

nextflow run isugifNF/GATK \
  --seq "longread" \
  --genome "genome/genome.fa" \
  --reads "data/long-reads.fastq" \
  -profile slurm,singularity \
  -resume

Long read final output

The final output will be in a results folder. SNPs will be in the VCF file, probably the file with the longest name (e.g. output_snp-only.pass-only.vcf).

results/
  |_ 02_MapReads/
  |_ 03_PrepGATK/
  |_ 04_GATK/
  |_ 05_FilterSNPs/
  |  |_ output_snp-only.pass-only.vcf     #<= final SNP file
  |
  |_ report.html
  |_ timeline.html               # <= runtime information for all processes

Example Runs

Some example runs provided to show nextflow output.

Example runs for DNAseq data

See example run on Ceres HPC - last update: 14 April 2021 Runtime: 1 hour 9 minutes and 27 seconds. ``` $ module load nextflow $ nextflow run main.nf \ --genome "test-data/ref/b73_chr1_150000001-151000000.fasta" \ --reads "test-data/fastq/*_{R1,R2}.fastq.gz" \ --queueSize 25 \ -profile slurm,singularity \ -resume N E X T F L O W ~ version 20.07.1 Launching `main.nf` [exotic_poincare] - revision: ca139b5b5f executor > slurm (155) [8c/e6342a] process > FastqToSam (BioSample26) [100%] 27 of 27 ✔ [6d/13aa51] process > MarkIlluminaAdapters (27_Bi... [100%] 27 of 27 ✔ [dc/032273] process > SamToFastq (20_BioSample24_... [100%] 27 of 27 ✔ [0e/a8d488] process > bwamem2_index (b73_chr1_150... [100%] 1 of 1 ✔ [2f/b3746a] process > bwamem2_mem (20_BioSample24) [100%] 27 of 27 ✔ [bc/8e43cc] process > CreateSequenceDictionary (b... [100%] 1 of 1 ✔ [fb/c2b500] process > samtools_faidx (b73_chr1_15... [100%] 1 of 1 ✔ [64/b1241a] process > MergeBamAlignment (20_BioSa... [100%] 27 of 27 ✔ [16/ea1c05] process > bedtools_makewindows (b73_c... [100%] 1 of 1 ✔ [34/14a2e1] process > gatk_HaplotypeCaller (chr1:... [100%] 10 of 10 ✔ [a2/6d9b6d] process > merge_vcf [100%] 1 of 1 ✔ [dd/187a85] process > vcftools_snp_only (first-ro... [100%] 1 of 1 ✔ [cc/367a9a] process > SortVcf (first-round_merged... [100%] 1 of 1 ✔ [ef/102130] process > calc_DPvalue (first-round_m... [100%] 1 of 1 ✔ [8f/281023] process > VariantFiltration (first-ro... [100%] 1 of 1 ✔ [fc/5d8e7c] process > keep_only_pass (first-round... [100%] 1 of 1 ✔ Completed at: 14-Apr-2021 01:51:21 Duration : 1h 9m 27s CPU hours : 7.3 Succeeded : 155 ```
See example run on Atlas HPC - last update: 14 April 2021 Runtime: 50 minutes and 50 seconds. ``` $ module load singularity $ NEXTFLOW=/project/isu_gif_vrsc/programs/nextflow $ $NEXTFLOW run main.nf \ --genome "test-data/ref/b73_chr1_150000001-151000000.fasta" \ --reads "test-data/fastq/*_{R1,R2}.fastq.gz" \ --queueSize 50 \ --account isu_gif_vrsc \ -profile slurm,singularity \ -resume N E X T F L O W ~ version 20.07.1 Launching `main.nf` [tiny_cori] - revision: ca139b5b5f executor > slurm (155) [92/cfaf35] process > FastqToSam (BioSample24) [100%] 27 of 27 ✔ [59/37e2e8] process > MarkIlluminaAdapters (20_Bi... [100%] 27 of 27 ✔ [b3/cc67d6] process > SamToFastq (20_BioSample24_... [100%] 27 of 27 ✔ [46/9d4a5f] process > bwamem2_index (b73_chr1_150... [100%] 1 of 1 ✔ [e4/ad114c] process > bwamem2_mem (20_BioSample24) [100%] 27 of 27 ✔ [a7/044560] process > CreateSequenceDictionary (b... [100%] 1 of 1 ✔ [50/e81af8] process > samtools_faidx (b73_chr1_15... [100%] 1 of 1 ✔ [67/1c03a5] process > MergeBamAlignment (20_BioSa... [100%] 27 of 27 ✔ [f6/d94e63] process > bedtools_makewindows (b73_c... [100%] 1 of 1 ✔ [49/3a6d6c] process > gatk_HaplotypeCaller (chr1:... [100%] 10 of 10 ✔ [68/e88beb] process > merge_vcf [100%] 1 of 1 ✔ [55/66c4cf] process > vcftools_snp_only (first-ro... [100%] 1 of 1 ✔ [92/7bad2f] process > SortVcf (first-round_merged... [100%] 1 of 1 ✔ [0b/e824cf] process > calc_DPvalue (first-round_m... [100%] 1 of 1 ✔ [a9/33a4f2] process > VariantFiltration (first-ro... [100%] 1 of 1 ✔ [ef/769ed6] process > keep_only_pass (first-round... [100%] 1 of 1 ✔ Completed at: 14-Apr-2021 01:03:12 Duration : 50m 50s CPU hours : 6.4 Succeeded : 155 ```
See example run on Nova HPC - last update: 14 April 2021 Runtime: 1 hour 46 minutes and 17 seconds. ``` $ module load gcc/7.3.0-xegsmw4 nextflow $ module load singularity $ nextflow run main.nf \ --genome "test-data/ref/b73_chr1_150000001-151000000.fasta" \ --reads "test-data/fastq/*_{R1,R2}.fastq.gz" \ --queueSize 25 \ -profile slurm,singularity \ -resume N E X T F L O W ~ version 20.07.1 Launching `main.nf` [condescending_monod] - revision: ca139b5b5f executor > slurm (155) [5c/b08536] process > FastqToSam (BioSample04) [100%] 27 of 27 ✔ [88/46d321] process > MarkIlluminaAdapters (27_Bi... [100%] 27 of 27 ✔ [96/200ac5] process > SamToFastq (21_BioSample24_... [100%] 27 of 27 ✔ [4c/8735b5] process > bwamem2_index (b73_chr1_150... [100%] 1 of 1 ✔ [86/06740e] process > bwamem2_mem (21_BioSample24) [100%] 27 of 27 ✔ [c0/3e6521] process > CreateSequenceDictionary (b... [100%] 1 of 1 ✔ [e2/856737] process > samtools_faidx (b73_chr1_15... [100%] 1 of 1 ✔ [25/529408] process > MergeBamAlignment (21_BioSa... [100%] 27 of 27 ✔ [40/ca5ca7] process > bedtools_makewindows (b73_c... [100%] 1 of 1 ✔ [8a/0d6f00] process > gatk_HaplotypeCaller (chr1:... [100%] 10 of 10 ✔ [96/0b957b] process > merge_vcf [100%] 1 of 1 ✔ [96/1f9848] process > vcftools_snp_only (first-ro... [100%] 1 of 1 ✔ [60/edb33d] process > SortVcf (first-round_merged... [100%] 1 of 1 ✔ [b0/372a4e] process > calc_DPvalue (first-round_m... [100%] 1 of 1 ✔ [f3/cfb966] process > VariantFiltration (first-ro... [100%] 1 of 1 ✔ [86/024c8b] process > keep_only_pass (first-round... [100%] 1 of 1 ✔ Completed at: 14-Apr-2021 02:17:48 Duration : 1h 46m 17s CPU hours : 6.6 Succeeded : 155 ```

Example run for RNAseq data

see example run on NOVA hpc - last update: 8 august 2023 runtime: 4 hours 38 minutes and 54 seconds. ``` $ module load gcc/7.3.0-xegsmw4 nextflow $ module load singularity $ nextflow run main.nf \ --seq "rna" \ --genome "genome/s_lycopersicum_chromosomes.3.00.fa" \ --reads "01_data/s*.{r1,r2}.fq.gz" \ --queuesize 25 \ --java_options "-xmx80g -xx:+useparallelgc -djava.io.tmpdir=/work/gif3/satheesh/2023_gatk_ms/tmp" \ --gtf "genome/itag3.0_gene_models.gtf" \ -profile slurm \ -resume N E X T F L O W ~ version 22.10.4 launching `/work/gif3/satheesh/programs/gatk_testing/gatk/main.nf` [chaotic_brahmagupta] dsl2 - revision: d07ccde0d9 executor > slurm (8304) [c4/48a5ae] process > fastqtosam (subsampled_p7r1) [100%] 1 of 1 ✔ [ee/c542a3] process > markilluminaadapters (1_sub... [100%] 1 of 1 ✔ [f8/a5cd97] process > samtofastq_rna (1_subsample... [100%] 1 of 1 ✔ [8b/c7dafa] process > star_index (s_lycopersicum_... [100%] 1 of 1 ✔ [2c/f96408] process > star_align (1_subsampled_p7r1) [100%] 1 of 1 ✔ [ee/72b0af] process > createsequencedictionary (s... [100%] 1 of 1 ✔ [29/6454c2] process > samtools_faidx (s_lycopersi... [100%] 1 of 1 ✔ [af/2ff081] process > mergebamalignment_rna (1_su... [100%] 1 of 1 ✔ [51/589571] process > markduplicates (1_subsample... [100%] 1 of 1 ✔ [4c/53bb96] process > splitncigarreads (1_subsamp... [100%] 1 of 1 ✔ [ee/79c9ea] process > bedtools_makewindows (s_lyc... [100%] 1 of 1 ✔ [f6/a83bb1] process > gatk_haplotypecaller_rna (s... [100%] 8287 of 8287, fai... [5c/efa2eb] process > merge_vcf [100%] 1 of 1 ✔ [ec/6b40be] process > vcftools_snp_only (first_ro... [100%] 1 of 1 ✔ [ab/604154] process > sortvcf (first_round_merged... [100%] 1 of 1 ✔ [28/765be4] process > calc_dpvalue (first_round_m... [100%] 1 of 1 ✔ [5b/0b530f] process > variantfiltration (first_ro... [100%] 1 of 1 ✔ [99/878369] process > keep_only_pass (first_round... [100%] 1 of 1 ✔ warn: failed to render execution report -- see the log file for details warn: failed to render execution timeline -- see the log file for details completed at: 08-aug-2023 21:37:02 duration : 4h 38m 54s cpu hours : 6.3 succeeded : 8'303 ```

Example run for long read data

see example run on NOVA hpc - last update: 31 October 2023 runtime: 41 minutes and 23 seconds. ``` $ module load gcc/7.3.0-xegsmw4 nextflow $ module load singularity $ nextflow run main.nf \ --seq "longread" \ --genome "genome/ecoli_2000001_3000000.fasta" \ --reads "01_Data/CP007799_2000001-3000000.fastq" \ --queueSize 25 \ --java_options "-Xmx80g -XX:+UseParallelGC -Djava.io.tmpdir=/work/gif3/satheesh/2023_GATK_ms/tmp" \ -profile slurm,singularity \ -resume N E X T F L O W ~ version 22.10.4 launching `/work/gif3/satheesh/programs/gatk_testing/gatk/main.nf` [chaotic_brahmagupta] dsl2 - revision: d07ccde0d9 executor > slurm (21) [c8/aa3601] process > LONGREAD_VARIANT_CALLING:Cr... [100%] 1 of 1 ✔ [51/b76b1a] process > LONGREAD_VARIANT_CALLING:sa... [100%] 1 of 1 ✔ [be/0cb73f] process > LONGREAD_VARIANT_CALLING:be... [100%] 1 of 1 ✔ [26/4f4559] process > LONGREAD_VARIANT_CALLING:pb... [100%] 1 of 1 ✔ [d1/78ce63] process > LONGREAD_VARIANT_CALLING:pb... [100%] 1 of 1 ✔ [bb/1e7433] process > LONGREAD_VARIANT_CALLING:Ma... [100%] 1 of 1 ✔ [aa/4c9f99] process > LONGREAD_VARIANT_CALLING:ga... [100%] 10 of 10 ✔ [a6/142af0] process > LONGREAD_VARIANT_CALLING:Co... [100%] 1 of 1 ✔ [fa/069ebb] process > LONGREAD_VARIANT_CALLING:Ge... [100%] 1 of 1 ✔ [09/ea1b74] process > LONGREAD_VARIANT_CALLING:ca... [100%] 1 of 1 ✔ [65/cc76dd] process > LONGREAD_VARIANT_CALLING:Va... [100%] 1 of 1 ✔ [77/452275] process > LONGREAD_VARIANT_CALLING:ke... [100%] 1 of 1 ✔ Completed at: 31-Oct-2023 14:39:44 Duration : 41m 23s CPU hours : 5.7 Succeeded : 21 ```