JEKESA

An automated bacterial whole genome assembly and typing pipeline which primarily uses Illumina paired-end whole genome sequencing (WGS) data. In addition, Jekesa performs extensive analyses for Escherichia coli, Salmonella, Streptococcus pneumoniae and Streptococcus pyogenes (Group A Streptococcus), including in-depth virulence predicitions for various other pathogens (refer to sections below). Furthermore, Jekesa, also performs whole-genome reference-free alignments, pairwise SNP-site analysis and clustering, and generates a neighbor-joining tree which can be easily visualized using e.g. Microreact.

Pipeline overview

Jekesa (Illuminate) currently runs on a server (single compute node). The pipeline is written in Bash, R, and Rmarkdown, and generates the results report in an excel worksheet (.xlsx format) and html format.

De novo genome assembly and classification

QC and read filtering using FastQC and trim_galore.
Species identification and closest reference detection using Bactinspector.
Check for contamination using ConFindr, kraken2 and MiniKraken2_v2_8GB
De novo assembly using either SKESA, SPAdes, MEGAHIT, or velvet as implemented in Shovill.
Generation of assembly metrics using QUAST

MLST typing

Multi-locus sequence typing based on assembled contigs using mlst and PubMLST database.

Resistance profiling

Detection of acquired AMR genes and chromosomal mutations and their associated resistance phenotypes performed using resfinder, AMRFinderPlus and pointfinder.
Optionally, known and novel variants in anti-microbial resistance genes, predicted from clean reads using ariba and either CARD (The Comprehensive Antibiotic Resistance Database) or resfinder database.

Virulence gene predicition

Virulence genes detected using AMRFinderPlus.
In addition, in-depth virulence gene detection for specific pathogens such as E. coli, E. faecalis, E. faecium, S aureus and L. monocytogenes is performed using VirulenceFinder.
Optionally, detection of variants (known/novel) in virulence factor genes, from cleaned reads, using ariba and the VFDB. ARIBA can be activated by uncommenting the ARIBA specific scripts in the main JEKESA script.

Plasmid detection

Coming soon

Escherichia coli specific analysis

Serotyping using SerotypeFinder.

Salmonella enterica specific analysis

Serotyping using both SISTR and SeqSero2.

Streptococcus pneumoniae specific analysis

Serotyping using seroba
Pili detection based on reference sequences used in Nakano et. al, 2018
PBP gene typing and MIC profiling using CDC Streptococcus Lab SPN scripts and sequence databases.
Calculate core and accessory distances and cluster genomes (assigning global pneumococcal sequence clusters; GPSCs) using PopPUNK, as well as assign new genomes to clusters.

Streptococcus pyogenes specific analysis

EMM typing and MIC profiling using CDC Stretococcus Lab GAS scripts and sequence databases.
Calculate core and accessory distances and cluster/define genomes/strains using PopPUNK, as well as assign new genomes to clusters.

Reference-free alignments, pairwise SNP differences, and neighbor-joining tree construction

Reference free alignments performed using SKA. In addition, SKA distance is used to calculate pairiwise SNP differences between samples and assign SNP-based clusters.
The generated variant alignments are used to generate a neighbor-joining tree using rapidNJ with 1000 bootstrap replicates.

Output and reporting

All results will be strored in Results-ProjectName including:

The final report named ProjectName-WGS-typing-report.xlsx
Results from each step of the analysis in .xlsx format
Neighbor joining tree file (and associated files) generated using PopPUNK.
Subfolders contatining:
- assembled-contigs
- additional results from SKA.
- additional reports from ARIBA, including files for generating trees showing clustering of samples based on detected variants
- MultiQC reports for visualization of quality control reports, pre- and post- filtering of sequence reads.
Detailed HTML report generated using rmarkdown

Usage

usage: jekesa <options>

OPTIONS:
        -p      Path to output directory or project name
        -a      Select the assembler to use. Options available: 'spades', 'skesa', 'velvet', 'megahit'
        -s      Species scheme name to use for mlst typing.
                Use: 'spneumoniae' or 'spyogenes' or 'senterica', for streptococcus pneumoniae or streptococcus pyogenes or salmonella
                detailed analysis. Otherwise for any other schema use: 'other'. To check other available schema names use: mlst --longList.
        -t      Number of threads to use <integer>, (minimum value should be: 6)
        -g      Only perform de novo assembly
        -c      Path to assembled contigs to include in the typing analysis (only mlst and resistance profiling).
        -h      Show this help
        -v      Show version

Example

cd jekesa
#This script will create analysis directory and soft link fastq files
bin/find-link-fastq.sh  path/to/analysis/directory path/to/sampleID/list  path/to/raw/fastqfiles 

# Now run the jekesa pipeline
conda activate jekesa
jekesa -p path/to/analysis/directory -a skesa -s spyogenes -t 16 &

Installation

Clone the git repository:
git clone https://github.com/stanikae/jekesa.git
cd jekesa

After cloning the jekesa git repo, do the following to install the required dependencies and to setup the conda environment:

# JEKESA
wget -P lib https://anaconda.org/stanikae/jekesa/2021.01.15.141403/download/jekesa_v1.0.yml
conda env create -n jekesa --file ./lib/jekesa_v1.0.yml

Installation of dependancies

1. R packages

wget -P lib https://anaconda.org/stanikae/r_env/2021.01.15.141706/download/jekesa-v1.0_r_env.yml
conda env create -n r_env --file ./lib/jekesa-v1.0_r_env.yml

2. CGE tools

## ResFinder4 
wget -P lib https://anaconda.org/stanikae/resfinder/2021.06.18.105709/download/jekesa-v1.0_cge.yml
conda env create -n resfinder --file ./lib/jekesa-v1.0_cge.yml

## Other CGE tools
wget -P lib https://anaconda.org/stanikae/cge/2021.06.18.111232/download/jekesa-v1.0_resfinder4.yml
conda env create -n cge --file ./lib/jekesa-v1.0_resfinder4.yml

3. srst2 env (For CDC StrepLab scripts)

wget -P lib https://anaconda.org/stanikae/srst2/2021.06.18.115358/download/jekesa-v1.0_srst2.yml
conda env create -n srst2 --file ./lib/jekesa-v1.0_srst2.yml
conda activate srst2
pip install spn_scripts/srst2_env/
conda deactivate

## Activate jekesa
conda activate jekesa

If you already have jekesa installed, you can upgrade as follows:

cd jekesa
git pull
wget -P lib https://anaconda.org/stanikae/jekesa/2021.01.15.141403/download/jekesa_v1.0.yml
conda env update -n jekesa --file ./lib/jekesa_v1.0.yml --prune

Setting up required databases

To download and set-up required databases, execute the 00.download_databases.sh script

cd jekesa
conda activate jekesa
bash bin/00.download_databases.sh /path/to/installation/directory

ConFindr databases

To set up ConFindr databases kindly follow instructions here: https://olc-bioinformatics.github.io/ConFindr/install/ as this requires registration on PubMLST.

To deactivate jekesa (At the end of the analysis)

conda deactivate jekesa

Author

Stanford Kwenda

License

GPL 3.0

Citation

Kwenda S., Allam M., Khumalo Z.T.H., Mtshali S., Mnyameni F., Ismail A. Jekesa: an automated easy-to-use pipeline for bacterial whole genome typing Github https://github.com/stanikae/jekesa

stanikae / jekesa

readme