In silico evaluation of variant calling methods for bacterial whole genome sequencing assays

Abstract

Identification and analysis of clinically relevant strains of bacteria increasingly relies on whole genome sequencing. The downstream bioinformatics steps necessary for calling variants from short read sequences are well-established but seldom validated against halpoid genomes. We devised an in silico workflow to introduced single nucleotide polymorphisms (SNP) and indels into bacterial reference genomes, and computationally generate sequencing reads based on the mutated genomes. We then applied the method to Mycobacterium tuberculosis H37Rv, Staphylococcus aureus NCTC 8325, and Klebsiella pneumoniae HS11286, and used the synthetic reads as truth sets for evaluating several popular variant callers. Insertions proved especially challenging for most variant callers to correctly identify, relative to deletions and SNPs. With adequate read depth, however, variant callers that use high quality soft-clipped reads and base mismatches to perform local realignment consistently had the highest precision and recall in identifying insertions and deletions ranging from 1-50 bp. The remaining variant callers had lower recall values associated with identification of insertions greater than 20 bp.

Pipeline

Create virtual environment

Set up the virtual environment: ./create_venv.sh.
Activate the virtual environment: source REPO_NAME-env/bin/activate.

Dependencies and Containers

The pipeline relies on SingularityCE to use and manage containerized dependencies.
The containerized tools used by the pipeline are automatically pulled and cached from the links listed in the [images] section of configs/settings.conf.
Docker containers are pulled from Docker Hub while Singularity containers are from the author's Sylabs Cloud Library. Definition files for the Singularity containers are provided in this repo.

Configure the SCons pipeline

Configure synthetic variant parameters by editing the values in configs/variants_settings.conf. These settings will be logged in logs/variants/ once the pipeline is initiated.
Configure other parameters, including variant name, by editing the values in configs/settings.conf.

Run the SCons pipeline

Dry run: scons -n.
Run: scons
Debug run: scons --debug=explain
Variant callers included in the current pipeline: GATK HaplotypeCaller, bcftools, FreeBayes, DiscoSnp, DeepVariant, VarDict, Lancet, and Octopus.
A minimal reference genome is provided for testing in data/H37Rv-small.fa.
Note that DiscoSNP and Lancet require reference reads and BAM files to call variants. The following example reference files are generated by the pipeline for the minimal reference genome:
- output/H37Rv-small/deduped_mq.bam
- output/H37Rv-small/deduped_mq.bam.bai
- output/H37Rv-small/R1.trimmed.fq.gz
- output/H37Rv-small/R2.trimmed.fq.gz

This repo is provided as part of data management and record-keeping supporting this publication, DOI: https://doi.org/10.1128/jcm.01842-22. Support for issues with setting up and running the pipeline is therefore expected to be minimal.

molmicdx / mtb-pipeline

readme