hoelzer-lab / ribap

A comprehensive bacterial core gene-set annotation pipeline based on Roary and pairwise ILPs
GNU General Public License v3.0
25 stars 4 forks source link

RIBAP

Roary ILP Bacterial core Annotation Pipeline.

Twitter Follow Twitter Follow

Annotate genes in your bacterial genomes with Prokka and determine a pangenome with the great Roary. The initial gene clusters found by Roary are further refined with the usage of ILPs that solve the best matching for each pairwise strain MMseqs2 comparison.

  1. Overview
  2. Quick start
  3. Execution examples
  4. Example output
  5. Install
  6. Runtime and disk space
  7. Limitations
  8. Publication
  9. References

Overview

A common task when you have a bunch of bacterial genomes at hand is calculating a core gene set or pangenome. We want to know, which genes are homologs and shared between a set of bacteria. However, defining homology only based an sequence similarity often underestimates the true core gene set, particularly when diverse species are compared. RIBAP combines sequence homology information from Roary with smart pairwise ILP calculations to produce a more complete core gene set - even on genus level. First, RIBAP performs annotations with Prokka, calculates the core gene set using Roary and pairwise ILPs, and finally visualizes the results in an interactive HTML table garnished with protein multiple sequence alignments and trees. RIBAP comes with Nextflow and Docker/Singularity/Conda options to resolve all necessary software dependencies for easy execution.

chart

Quick start

You just need a working nextflow and docker or singularity or conda installation, see below! We recommand the usage of containers! You have nextflow and docker? Give it a try:

nextflow pull hoelzer-lab/ribap
nextflow info hoelzer-lab/ribap
# select a release version, e.g. 1.0.0
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta "$HOME/.nextflow/assets/hoelzer-lab/ribap/data/*.fasta" -profile local,docker

You have nextflow and conda? Okay, just change the profile:

nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta "$HOME/.nextflow/assets/hoelzer-lab/ribap/data/*.fasta" -profile local,conda

You need some of these dependencies? Check the end of this README!

Execution examples

# Get or update the workflow:
nextflow pull hoelzer-lab/ribap

# Run a specific release, check whats available:
nextflow info hoelzer-lab/ribap

# Or get newest release version automatically and show the help:
REVISION=$(nextflow info hoelzer-lab/ribap | sed 's/ [*]//' | sed 's/ //g' | sed 's/\[t\]//g' | awk 'BEGIN{FS=" "};{if($0 ~ /^ *[0-9]/){print $0}}' | sort -Vr | head -1)
nextflow run hoelzer-lab/ribap -r $REVISION --help

# Run with IQ-TREE tree calculation using all identified core genes and specified output dir. 
# Attention: less than 1000 bootstrap replicates (default) can not be used.
# For intermediate files, a work dir via -w is defined.
# Note, that Nextflow build-in parameters use a single dash "-" symbol.   
nextflow run hoelzer-lab/ribap -r $REVISION --fasta '*.fasta' --tree --bootstrap 1000 --outdir ~/ribap -w ribap-work

# The ILPs can take a lot (!) of space. But you can use this flag to keep them anyway. Use with caution!
nextflow run hoelzer-lab/ribap -r $REVISION --fasta '*.fasta' --outdir ~/ribap -w ribap-work --keepILPs

# Run with optional reference GenBank file to guide Prokka annotation.
# ATTENTION: this will use the additional reference file for every input genome!
nextflow run hoelzer-lab/ribap -r $REVISION --fasta '*.fasta' --reference GCF_000007205.1_ASM720v1_genomic.gbff --outdir ~/ribap -w work

# Use list parameter to provide genome FASTAs and corresponding reference GenBank files in CSV format.
nextflow run hoelzer-lab/ribap -r $REVISION --list --fasta genomes.csv --reference refs.csv --outdir ~/ribap -w work

# genomes.csv:
#
# genome1,genome1.fasta
# genome2,genome2.fasta
# genome3,genome3.fasta

# refs.csv
#
# genome1,ref.gbff
# genome2,ref.gbff
# 
# Here, genome1 and genome2 will additionally use information from ref.gbff in Prokka annotation while genome3 will be annotated w/o additional reference information.

Example output

RIBAP stores all important output in --output results. Besides CDS annotations, alignments, trees, and the core gene groups one of the main output files is an interactive HTML table that summarizes the identified RIBAP groups and can be explored by the user:

Further output examples for a small test set composed of five Chlamydia genomes can be found here.

Installation

Attention Use the below installation examples with causion and check you system requirements. Its best to refer to the specific manual of each software.

Nextflow

Needed in all cases (conda, docker, singularity)

sudo apt-get update
sudo apt install -y default-jre
curl -s https://get.nextflow.io | bash 
sudo mv nextflow /bin/

Using Conda

Just copy the commands and follow the installation instructions. Let the installer configure conda for you. You need to specify -profile conda to run the pipeline with conda support. Note, internally the pipeline will switch to mamba to faser resolve all environments.

cd
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

See here if you need a different installer besides Linux used above.

Using Docker

If you dont have experience with bioinformatic tools just copy the commands into your terminal to set everything up:

sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo usermod -a -G docker $USER

Docker installation details here

Runtime and disk space

Attention: RIBAP is currently not intended to be used with hundreds or thousands of input genomes (see also Limitations). Also for smaller input sets, you will need quite some disk space to store the ILPs and their results. Due to that, RIBAP is per default not storing the intermediate ILP files and solutions. You can use --keepILPs to store them, if needed. Below, we report runtime and disk space usage on a regular Linux laptop (8 CPUs used) and a HPC (SLURM, pre-configured CPU and RAM usage, runtime may be biased due to HPC work load) using varying numbers of Chlamydia psittaci genomes (~1 Mbp, originally sampled from here and also available here) as input and with and without the --keepILPs option. Please also not that we used the default --chunks 8 value for local and HPC execution. Especially on a HPC, the runtime can be increased drastically by increasing the --chunks value (more parallel ILP solving).

CPUs #genomes (1 Mbp) --keepILPs time work space output space
8 8 NO 24 min 1.2 GB 195 MB
8 8 YES 23 min 5.5 GB 195 MB
8 16 NO 1 h 4 min 1.5 GB 398 MB
8 16 YES 1 h 18 GB 398 MB
8 32 NO 3 h 10 min 2.2 GB 788 MB
8 32 YES 3 h 16 min 84 GB 789 MB
HPC 8 NO 6 min 1.2 GB 196 MB
HPC 8 YES 6 min 5.5 GB 196 MB
HPC 16 NO 17 min 1.5 GB 400 MB
HPC 16 YES 16 min 21 GB 400 MB
HPC 32 NO 1 h 1 min 2.2 GB 794 MB
HPC 32 YES 1 h 11 min 83 GB 794 MB

Commands used (in slight variations):

# laptop
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta '*.fasta' --cores 8 --max_cores 8 -profile local,docker -w work --output ribap-results --keepILPs
# HPC
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta '*.fasta' -profile slurm,singularity -w work --output ribap-results --keepILPs

Limitations

Publication

If you find RIBAP useful, please cite it in your work:

Kevin Lamkiewicz, Lisa Marie Barf, Konrad Sachse & Martin Hölzer. "RIBAP: a comprehensive bacterial core genome annotation pipeline for pangenome calculation beyond the species level." Genome Biology, 25, 170 (2024)

However, RIBAP would not have been possible without the help of many great open source bioinformatics tools. Kudos to all the developers and maintainers! Please cite them in your work as well!

References

In particular, RIBAP takes advantage of and uses the following tools:

Roary

Page, Andrew J., et al. "Roary: rapid large-scale prokaryote pan genome analysis." Bioinformatics 31.22 (2015): 3691-3693.

Code | Publication

Prokka

Seemann, Torsten. "Prokka: rapid prokaryotic genome annotation." Bioinformatics 30.14 (2014): 2068-2069.

Code | Publication

MMSeqs2

Steinegger, Martin, and Johannes Söding. "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets." Nature biotechnology 35.11 (2017): 1026-1028.

Code | Publication

ILPs

Martinez FV, Feijão P, Braga MD, Stoye J. "On the family-free DCJ distance and similarity" Algorithms Mol Biol. 2015 Apr 1;10(1):13

Publication

GLPK ILP solver

Makhorin, Andrew. "GLPK (GNU linear programming kit)." http://www. gnu. org/s/glpk/glpk. html (2008).

MAFFT

Katoh, Kazutaka, and Daron M. Standley. "MAFFT multiple sequence alignment software version 7: improvements in performance and usability." Molecular biology and evolution 30.4 (2013): 772-780.

Code | Publication

FastTree

Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin. "FastTree 2 - approximately maximum-likelihood trees for large alignments." PloS one 5.3 (2010): e9490.

Code | Publication

cd-hit

Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li. "CD-HIT: accelerated for clustering the next generation sequencing data." Bioinformatics, (2012), 28 (23): 3150-3152

Code | Publication

IQ-TREE

Minh, Bui Quang, et al. "IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era." Molecular biology and evolution 37.5 (2020): 1530-1534.

Code | Publication

UpSetR

Conway, Jake R., Alexander Lex, and Nils Gehlenborg. "UpSetR: an R package for the visualization of intersecting sets and their properties." Bioinformatics (2017).

Code | Publication