Roary ILP Bacterial core Annotation Pipeline.
Annotate genes in your bacterial genomes with Prokka and determine a pangenome with the great Roary. The initial gene clusters found by Roary are further refined with the usage of ILPs that solve the best matching for each pairwise strain MMseqs2 comparison.
A common task when you have a bunch of bacterial genomes at hand is calculating a core gene set or pangenome. We want to know, which genes are homologs and shared between a set of bacteria. However, defining homology only based an sequence similarity often underestimates the true core gene set, particularly when diverse species are compared. RIBAP combines sequence homology information from Roary with smart pairwise ILP calculations to produce a more complete core gene set - even on genus level. First, RIBAP performs annotations with Prokka, calculates the core gene set using Roary and pairwise ILPs, and finally visualizes the results in an interactive HTML table garnished with protein multiple sequence alignments and trees. RIBAP comes with Nextflow and Docker/Singularity/Conda options to resolve all necessary software dependencies for easy execution.
You just need a working nextflow
and docker
or singularity
or conda
installation, see below! We recommand the usage of containers! You have nextflow
and docker
? Give it a try:
nextflow pull hoelzer-lab/ribap
nextflow info hoelzer-lab/ribap
# select a release version, e.g. 1.0.0
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta "$HOME/.nextflow/assets/hoelzer-lab/ribap/data/*.fasta" -profile local,docker
You have nextflow
and conda
? Okay, just change the profile:
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta "$HOME/.nextflow/assets/hoelzer-lab/ribap/data/*.fasta" -profile local,conda
You need some of these dependencies? Check the end of this README!
# Get or update the workflow:
nextflow pull hoelzer-lab/ribap
# Run a specific release, check whats available:
nextflow info hoelzer-lab/ribap
# Or get newest release version automatically and show the help:
REVISION=$(nextflow info hoelzer-lab/ribap | sed 's/ [*]//' | sed 's/ //g' | sed 's/\[t\]//g' | awk 'BEGIN{FS=" "};{if($0 ~ /^ *[0-9]/){print $0}}' | sort -Vr | head -1)
nextflow run hoelzer-lab/ribap -r $REVISION --help
# Run with IQ-TREE tree calculation using all identified core genes and specified output dir.
# Attention: less than 1000 bootstrap replicates (default) can not be used.
# For intermediate files, a work dir via -w is defined.
# Note, that Nextflow build-in parameters use a single dash "-" symbol.
nextflow run hoelzer-lab/ribap -r $REVISION --fasta '*.fasta' --tree --bootstrap 1000 --outdir ~/ribap -w ribap-work
# The ILPs can take a lot (!) of space. But you can use this flag to keep them anyway. Use with caution!
nextflow run hoelzer-lab/ribap -r $REVISION --fasta '*.fasta' --outdir ~/ribap -w ribap-work --keepILPs
# Run with optional reference GenBank file to guide Prokka annotation.
# ATTENTION: this will use the additional reference file for every input genome!
nextflow run hoelzer-lab/ribap -r $REVISION --fasta '*.fasta' --reference GCF_000007205.1_ASM720v1_genomic.gbff --outdir ~/ribap -w work
# Use list parameter to provide genome FASTAs and corresponding reference GenBank files in CSV format.
nextflow run hoelzer-lab/ribap -r $REVISION --list --fasta genomes.csv --reference refs.csv --outdir ~/ribap -w work
# genomes.csv:
#
# genome1,genome1.fasta
# genome2,genome2.fasta
# genome3,genome3.fasta
# refs.csv
#
# genome1,ref.gbff
# genome2,ref.gbff
#
# Here, genome1 and genome2 will additionally use information from ref.gbff in Prokka annotation while genome3 will be annotated w/o additional reference information.
RIBAP stores all important output in --output results
. Besides CDS annotations, alignments, trees, and the core gene groups one of the main output files is an interactive HTML table that summarizes the identified RIBAP groups and can be explored by the user:
Further output examples for a small test set composed of five Chlamydia genomes can be found here.
nextflow
using docker
or singularity
or conda
docker
, singularity
, or conda
docker
or sinularity
or conda
and nextflow
need to be installed (per default conda
is used)Attention Use the below installation examples with causion and check you system requirements. Its best to refer to the specific manual of each software.
Needed in all cases (conda
, docker
, singularity
)
sudo apt-get update
sudo apt install -y default-jre
curl -s https://get.nextflow.io | bash
sudo mv nextflow /bin/
Just copy the commands and follow the installation instructions. Let the installer configure conda
for you. You need to specify -profile conda
to run the pipeline with conda
support. Note, internally the pipeline will switch to mamba
to faser resolve all environments.
cd
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
See here if you need a different installer besides Linux used above.
If you dont have experience with bioinformatic tools just copy the commands into your terminal to set everything up:
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo usermod -a -G docker $USER
Docker installation details here
Attention: RIBAP is currently not intended to be used with hundreds or thousands of input genomes (see also Limitations). Also for smaller input sets, you will need quite some disk space to store the ILPs and their results. Due to that, RIBAP is per default not storing the intermediate ILP files and solutions. You can use --keepILPs
to store them, if needed. Below, we report runtime and disk space usage on a regular Linux laptop (8 CPUs used) and a HPC (SLURM, pre-configured CPU and RAM usage, runtime may be biased due to HPC work load) using varying numbers of Chlamydia psittaci genomes (~1 Mbp, originally sampled from here and also available here) as input and with and without the --keepILPs
option. Please also not that we used the default --chunks 8
value for local and HPC execution. Especially on a HPC, the runtime can be increased drastically by increasing the --chunks
value (more parallel ILP solving).
CPUs | #genomes (1 Mbp) | --keepILPs |
time | work space |
output space |
---|---|---|---|---|---|
8 | 8 | NO | 24 min | 1.2 GB | 195 MB |
8 | 8 | YES | 23 min | 5.5 GB | 195 MB |
8 | 16 | NO | 1 h 4 min | 1.5 GB | 398 MB |
8 | 16 | YES | 1 h | 18 GB | 398 MB |
8 | 32 | NO | 3 h 10 min | 2.2 GB | 788 MB |
8 | 32 | YES | 3 h 16 min | 84 GB | 789 MB |
HPC | 8 | NO | 6 min | 1.2 GB | 196 MB |
HPC | 8 | YES | 6 min | 5.5 GB | 196 MB |
HPC | 16 | NO | 17 min | 1.5 GB | 400 MB |
HPC | 16 | YES | 16 min | 21 GB | 400 MB |
HPC | 32 | NO | 1 h 1 min | 2.2 GB | 794 MB |
HPC | 32 | YES | 1 h 11 min | 83 GB | 794 MB |
Commands used (in slight variations):
# laptop
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta '*.fasta' --cores 8 --max_cores 8 -profile local,docker -w work --output ribap-results --keepILPs
# HPC
nextflow run hoelzer-lab/ribap -r 1.0.0 --fasta '*.fasta' -profile slurm,singularity -w work --output ribap-results --keepILPs
--keepILPs
parameter to save space.If you find RIBAP useful, please cite it in your work:
However, RIBAP would not have been possible without the help of many great open source bioinformatics tools. Kudos to all the developers and maintainers! Please cite them in your work as well!
In particular, RIBAP takes advantage of and uses the following tools:
Page, Andrew J., et al. "Roary: rapid large-scale prokaryote pan genome analysis." Bioinformatics 31.22 (2015): 3691-3693.
Seemann, Torsten. "Prokka: rapid prokaryotic genome annotation." Bioinformatics 30.14 (2014): 2068-2069.
Steinegger, Martin, and Johannes Söding. "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets." Nature biotechnology 35.11 (2017): 1026-1028.
Martinez FV, Feijão P, Braga MD, Stoye J. "On the family-free DCJ distance and similarity" Algorithms Mol Biol. 2015 Apr 1;10(1):13
Makhorin, Andrew. "GLPK (GNU linear programming kit)." http://www. gnu. org/s/glpk/glpk. html (2008).
Katoh, Kazutaka, and Daron M. Standley. "MAFFT multiple sequence alignment software version 7: improvements in performance and usability." Molecular biology and evolution 30.4 (2013): 772-780.
Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin. "FastTree 2 - approximately maximum-likelihood trees for large alignments." PloS one 5.3 (2010): e9490.
Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li. "CD-HIT: accelerated for clustering the next generation sequencing data." Bioinformatics, (2012), 28 (23): 3150-3152
Minh, Bui Quang, et al. "IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era." Molecular biology and evolution 37.5 (2020): 1530-1534.
Conway, Jake R., Alexander Lex, and Nils Gehlenborg. "UpSetR: an R package for the visualization of intersecting sets and their properties." Bioinformatics (2017).
1) Since Nextflow 23.07.0-edge, Nextflow no longer mounts the host’s home directory when using Apptainer or Singularity. This causes issues in some dependencies. As a workaround, you can revert to the old behavior by setting the environment variable NXF_APPTAINER_HOME_MOUNT
or NXF_SINGULARITY_HOME_MOUNT
to true
in the machine from which you launch the pipeline. From: https://www.nextflow.io/docs/edge/container.html.
2) RIBAP might fail due to RecursionError: maximum recursion depth exceeded in comparison
. This can happen in the current combining step where Roary initial clusters are combined with the ILP results. This is done in a Python script where the default recursion depth is 1000. If this happens, you can increase the recursion depth via the parameter --set_recursion_limit
. However, be careful not increasing this too much (we tested until 5000). See als https://github.com/hoelzer-lab/ribap/issues/66