Bacterial taxonomy construction and evaluation in R
bactaxR is an R package which contains functions to aid in average-nucleotide identity (ANI)-centric bacterial taxonomy construction and evaluation. Specific functions include:
Post issues at https://github.com/lmc297/bactaxR/issues
Carroll, Laura M., Martin Wiedmann, Jasna Kovac. 2020. "Proposal of a Taxonomic Nomenclature for the Bacillus cereus Group Which Reconciles Genomic Definitions of Bacterial Species with Clinical and Industrial Phenotypes." mBio 11(1): e00034-20; DOI: 10.1128/mBio.00034-20.
Download R, if necessary: https://www.r-project.org/
Dowload R Studio, if necessary: https://www.rstudio.com/products/rstudio/download/
Open R Studio, and install the devtools
package, if necessary, by typing the following command into R Studio's console:
install.packages("devtools")
ggtree
from Bioconductor, if necessary, by running the following commands from R Studio's console:if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ggtree")
devtools
by typing the following command into R Studio's console:library(devtools)
bactaxR
by typing the following command into R Studio's console:install_github("lmc297/bactaxR")
Note: Users who get an error when installing bactaxR
should run the following command before attempting to install bactaxR
again: Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS"=TRUE)
bactaxR
by typing the following command into R Studio's console:
library(bactaxR)
Click here to download the data set. This tab-separated file was produced by FastANI, with query genomes in the first column, reference genomes in the second column, and ANI values in the third column.
Feel free to save this file in the directory of your choice; if you would like to follow along with this tutorial exactly, using identical path/file names, save this file in your home directory as bactaxR_fastani_output.txt
.
Open RStudio; if you have not already installed bactaxR
, follow the installation instructions above.
If you have not already done so, load bactaxR
using the following command:
library(bactaxR)
bactaxRObject
; this will allow us to construct dendrograms and graphs and identify medoid genomes, exactly as was done in the paper. To do so, run the following command (replace ~/bactaxR_fastani_output.txt
with the path to your own file, if necessary):ani <- read.ANI(file = "~/bactaxR_fastani_output.txt")
This command:
read.ANI
function in bactaxR
to read bactaxR_fastani_output.txt
, an output file produed by FastANI bactaxRObject
, assigning it to the variable name ani
Note: bactaxR can use pairwise ANI values calculated using any ANI tool, not just FastANI; you can use read.ANI
with any headerless, tab-delimited file where query genome is in the first column, reference genome is in the second, and ANI values (ranging from 0 to 100) are in the third. Additionally, if you already have a data frame of query genomes/reference genomes/ANI values loaded into R, you can use the load.ANI
function to store it as a bactaxRObject
. Note that both of these functions check to make sure that these ANI values are pairwise all-vs-all ANI values (i.e., all query genome names must be identically present in the reference genome column, and vice-versa). Additionally, your ANI values should be between 0 and 100 (i.e., if they are between 0 and 1, multiply them by 100; for example, 0.95 ANI becomes 95 ANI, 0.971 ANI becomes 97.1 ANI). See ?read.ANI
and ?load.ANI
for more information.
We can obtain a summary of our bactaxR
object using the following command:
summary(ani)
This should tell us that our data set has 36 genomes and 1,296 total comparisons; this makes sense, because 36^2 = 1,296 (i.e., these are pairwise comparisons).
h
, run the following command:h <- ANI.histogram(bactaxRObject = ani, bindwidth = 0.1)
This command:
bactaxRObject
(here, we're using our bactaxRObject
which we named ani
)To view the histogram, just run:
h
For more options for annotating/displaying your histogram, see ?ANI.histogram
dend <- ANI.dendrogram(bactaxRObject = ani, ANI_threshold = 95, xline = c(4,5,6,7.5), xlinecol = c("#ffc425", "#f37735", "deeppink4", "black"), label_size = 0.5)
This command:
ANI_threshold
parameterxline
parameter for X-axis position and the xlinecol
parameter for color information (here, we have vertical lines at dissimilarity values of 4, 5, 6, and 7.5, which correspond to ANI values of 96, 95, 94, and 92.5, respectively; these parameters are just for annotating the dendrogram plot, and have no analytical value/effect on the identification of medoid genomes or dendrogram construction)label_size = 0.5
; by default, this is set to an arbitrarily small number so that tip labels are hidden)See ?ANI.dendrogram
for a complete list of options.
We can see the medoid genomes identified at our specified ANI threshold (i.e., 95) by running dend$medoid_genomes
We can see the clusters to which all of our genomes were assigned at our specified ANI threshold using dend$cluster_assignments
?ANI.graph
, we can see that we need to supply metadata (i.e., the discrete attributes which we will use to color our graph; in our case, cluster assignment) in the form of a named vector.To do this, we'll create a vector, metadata
, which contains our cluster assinments:
metadata <- dend$cluster_assignments$Cluster
names(metadata) <- dend$cluster_assignments$Genome
ANI.graph(bactaxRObject = ani, ANI_threshold = 95,
metadata = metadata,
legend_pos_x = -1.5, show_legend = T, graphout_niter = 1000000,
legend_ncol = 1, edge_color = "black")
This command:
ANI_threshold
(here, we set this to 95)metadata
(here, we used clusters identified in step 6 at a 95 ANI threshold)?ANI.graph
for more details)Click here to download the phylogeny (in Newick format).
Feel free to save this file in the directory of your choice; if you would like to follow along with this tutorial exactly, using identical path/file names, save this file in your home directory as bactaxR_phylogeny.nwk
.
Feel free to save this file in the directory of your choice; if you would like to follow along with this tutorial exactly, using identical path/file names, save this file in your home directory as sup_table_s1_genomes_ani.xlsx
(note that if you download the file as described here, it will likely be stored in your Downloads directory; please move it to your home directory if you would like to use exact path names).
Open RStudio; if you have not already installed bactaxR
, follow the installation instructions above.
If you have not already done so, load bactaxR
using the following command:
library(bactaxR)
phytools
package using the following command:library(phytools)
readxl
package using the following command:library(readxl)
read.newick
function in phytools
, via the following command (replace ~/bactaxR_phylogeny.nwk
with the path to your own file, if necessary):tree <- read.newick(file = "~/bactaxR_phylogeny.nwk")
If we run the command tree
, we can see that our phylogeny has 2,231 tips (i.e., genomes) and is unrooted.
read_excel
funtion in the readxl
package, via the following command (replace ~/sup_table_s1_genomes_ani.xlsx
with the path to your own file, if necessary):x <- read_excel(path = "~/sup_table_s1_genomes_ani.xlsx", skip = 1)
We can get a summary of our metadata by running summary(x)
table(tree$tip.label%in%x$`Study IDa,b`)
, we can see see that some of our tree tip labels cannot be found in StudyID column in our metadata table. This is because some of our tree tip labels have dashes instead of underscores. To replace all dashes in our tree tip labels with underscores, run the following command:tree$tip.label <- gsub(pattern = "-", replacement = "_", x = tree$tip.label)
If we type table(tree$tip.label%in%x$`Study IDa,b`)
, we can now see that all of our tip labels can be found in our metadata Study ID column.
tree <- root(tree, outgroup = "B_manliponensis_JCM_15802_TYPE_STRAIN.fna")
If we run tree
, we should see that our tree is now rooted.
phylo.discrete_trait_OTU
function in bactaxR
and genomospecies assignments produced at a 92.5 ANI threshold (see the column of our metadata titled "Closest Medoid Genome at 92.5 ANI Threshold (ANI)"). If we use the command ?phylo.discrete_trait_OTU
, we can see that we need to supply a named list to trait_list
, where the names correspond to the traits (i.e., genomospecies assignments), and the vectors under each name correspond to the taxa associated with each name (i.e., each genome assigned to a particular genomospecies).
Let's run the following command to create a vector named metadata.92_5
; this vector is identical to our column "Closest Medoid Genome at 92.5 ANI Threshold (ANI)", except we are discarding all of the information after a "(" character (i.e., we are removing the ANI values appended to each medoid genome):
metadata.92_5 <- unlist(lapply(strsplit(x = x$`Closest Medoid Genome at 92.5 ANI Threshold (ANI)`, split = "\\("), "[[", 1))
If we run table(metadata.92_5)
, we can see the number of genomes assigned to each genomospecies.
metadata.92_5
) using the corresponding Study IDs (which match tip labels found in the tree):names(metadata.92_5) <- x$`Study IDa,b`
metadata.92_5 <- gsub(pattern = "Bacillus_anthracis_GCF_001683155_1_ASM168315v1_genomic.fna", replacement = "B. mosaicus ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "B_bingmayongensis_FJAT-13831_TYPE_STRAIN.fna", replacement = "B. bingmayongensis ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_cereus_GCF_002568015_1_ASM256801v1_genomic.fna", replacement = "B. cereus s.s. ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_cytotoxicus_GCF_002251005_2_ASM225100v2_genomic.fna", replacement = "B. cytotoxicus ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "B_gaemokensis_KCTC_13318_TYPE_STRAIN.fna", replacement = "B. gaemokensis ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_cereus_GCF_002559215_1_ASM255921v1_genomic.fna", replacement = "B. luti ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "B_manliponensis_JCM_15802_TYPE_STRAIN.fna", replacement = "B. manliponensis ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_cereus_GCF_002550135_1_ASM255013v1_genomic.fna", replacement = "B. mycoides ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_cereus_GCF_002576795_1_ASM257679v1_genomic.fna", replacement = "B. paramycoides ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_pseudomycoides_GCF_002588885_1_ASM258888v1_genomic.fna", replacement = "B. pseudomycoides ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "B_toyonensis_BCT_7112_TYPE_STRAIN.fna", replacement = "B. toyonensis ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_cereus_GCF_002583515_1_ASM258351v1_genomic.fna", replacement = "Unknown B. cereus group Species 13", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_cereus_GCF_002550855_1_ASM255085v1_genomic.fna", replacement = "Unknown B. cereus group Species 14 ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_cereus_GCF_002559665_1_ASM255966v1_genomic.fna", replacement = "Unknown B. cereus group Species 15 ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_cereus_GCF_002578045_1_ASM257804v1_genomic.fna", replacement = "Unknown B. cereus group Species 16 ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_cereus_GCF_002584535_1_ASM258453v1_genomic.fna", replacement = "Unknown B. cereus group Species 17 ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_mycoides_GCF_000746925_1_BHP_1_genomic.fna", replacement = "B. clarus ", x = metadata.92_5)
metadata.92_5 <- gsub(pattern = "Bacillus_pseudomycoides_GCF_002552625_1_ASM255262v1_genomic.fna", replacement = "Unknown B. cereus group Species 18", x = metadata.92_5)
metadata.92_5 <- na.omit(metadata.92_5)
If we run table(metadata.92_5)
, we should see that our genomospecies genome names have been repaced with species names.
trait_list
. To construct this list, we'll first initiate an empty list, and assign it to the variable name clades.92_5
:clades.92_5 <- list()
metadata.92_5
), identify all genomes which belong to that genomospecies, and store it in our list, clades.92_5
:for (i in 1:length(unique(metadata.92_5))){
myclust <- as.character(unique(metadata.92_5)[i])
tiplabs <- names(metadata.92_5)[which(metadata.92_5==myclust)]
clades.92_5[[myclust]] <- tiplabs
}
If we type names(clades.92_5)
, we should see that our list has named elements, with one per genomospecies.
viridis
palette to color known/proposed species, and colors the root of the tree and unknown species black:pal.tree_92_5 <- c("black", viridis(option = "viridis", n = 12), rep("black", 6))
phylo.discrete_trait_OTU
function in bactaxR
by running the following:tree_92_5 <- phylo.discrete_trait_OTU(phylo = tree, trait_list = clades.92_5,
color_palette = pal.tree_92_5,
phylo_layout = "circular", tip_label_size = 0.5)
This command:
phylo.discrete_trait_OTU
to color the phylogeny using a named list, clades.92_5
color_palette
To view the phylogeny, run:
tree_92_5
To see more phylogeny annotation options, see ?phylo.discrete_trait_OTU
?phylo.discrete_trait_heatmap
, we see that we need to supply a data frame to trait_data_frame
, where each row corresponds to an isolate and each column to a trait.To construct a trait data frame, with row names which match our Study ID/tree tip labels, run the following command:
traitmap <- data.frame(x$`Anthrax toxin-encoding cya, lef, and pagAc`,
x$`Cereulide synthetase-encoding cesABCDc`,
x$`Known Cry- and/or Cyt-encoding genesd`,
row.names = x$`Study IDa,b`)
colnames(traitmap) <- c("Anthrax", "Cereulide", "Thuringiensis")
traitmap$Anthrax <- ifelse(test = grepl(pattern = "Absent", x = traitmap$Anthrax), yes = "0", no = "Anthrax toxin genes present")
traitmap$Cereulide <- ifelse(test = grepl(pattern = "Absent", x = traitmap$Cereulide), yes = "1", no = "Cereulide synthetase genes present")
traitmap$Thuringiensis <- ifelse(test = grepl(pattern = "No Cry", x = traitmap$Thuringiensis), yes = "2", no = "Known Cry- and/or Cyt-encoding genes present")
tree_92_5.2 <- phylo.discrete_trait_heatmap(plot = tree_92_5, phylo = tree,
trait_data_frame = traitmap,
font_size = 0,
heatmap_width = 0.1,
heatmap_offset = 0.3,
color_palette = c("#DE49681A", "#8C29811A", "#22A8841A" , "#DE4968FF", "#8C2981FF", "#22A884FF"))
This command:
tree_92_5
, using our phylogeny (tree
) and our trait data frame (traitmap
)font_size = 0
)If we run the following, we can view our phylogeny with the heatmap added:
tree_92_5.2
Carroll, Laura M., Martin Wiedmann, Jasna Kovac. 2020. "Proposal of a Taxonomic Nomenclature for the Bacillus cereus Group Which Reconciles Genomic Definitions of Bacterial Species with Clinical and Industrial Phenotypes." mBio 11(1): e00034-20; DOI: 10.1128/mBio.00034-20.
Carroll, Laura M., Martin Wiedmann, Jasna Kovac. 2019. "Proposal of a taxonomic nomenclature for the Bacillus cereus group which reconciles genomic definitions of bacterial species with clinical and industrial phenotypes." bioRxiv 779199; doi: https://doi.org/10.1101/779199.
Jain, Chirag, et al. High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries. bioRxiv 225342; doi: https://doi.org/10.1101/225342.
Disclaimer: bactaxR is pretty neat! However, no tool is perfect. As always, interpret your results with caution. We are not responsible for taxonomic misclassifications, misclassifications of an isolate's pathogenic potential or industrial utility, and/or misinterpretations (biological, statistical, or otherwise) of bactaxR results.