ukaraoz / microtrait

Other
29 stars 6 forks source link

microTrait

microTrait is an R package that provides a workflow to extract fitness traits from microbial genome sequences. microTrait uses profile hidden Markov models (profile-HMM) and simple logical operations to predict and map protein family content represented in a genome sequence to fitness traits. The HMMs used in the microTrait framework (microtrait-hmms) represent protein family sequence diversity accumulated in genomes and metagenomes in IMG/M, have been curated, benchmarked using KEGG orthologs to establish trusted cutoff (TC) scores.

The genome inferred traits represented by microTrait are summarized in the figure below.

microTrait maps a genome sequence to a hierarchy of traits summarized in this figure. The genomic basis for each trait was compiled from the literature accessible in this file.

Installation

microTrait has the following software and data dependencies.

1. Software dependencies

The following binaries should be in your PATH so that they can be accessed regardless of location.

The following are required for calculating features used in calculating Optimal Growth Temperatures based on regression model from Sauer (2009) based on genomic, ORF, protein and tRNA features.

2. Data dependencies

See setup section below for deployment of these databases after installation.

3. R package dependencies

The developmental version of microtrait can be installed using the devtools package. First install and load devtools:

install.packages("devtools")
library(devtools)

Next install R package dependencies for microTrait:

4. Installation of microTrait:

Now install microTrait with devtools:

devtools::install_github("ukaraoz/microtrait")

5. Setup of HMM database:

Due to its large size (~170M), microtrait-hmm isn't packaged with microTrait. After installation, microtrait-hmm database has to be downloaded and deployed.

Load microTrait:

library(microtrait)

microTrait includes a function (prep.hmmmodels()) that downloads and deploys hmm models that underlie microTrait. This function should be run once after the installation.

microtrait::prep.hmmmodels()

The following HMM database files should now be available under microtrait/extdata/hmm in your default R library path (.libPaths())

list.files(file.path(.libPaths(), "microtrait/extdata/hmm/hmmpress"))
 [1] "arcbacribosomal.hmmdb.h3f" "arcbacribosomal.hmmdb.h3i"
 [3] "arcbacribosomal.hmmdb.h3m" "arcbacribosomal.hmmdb.h3p"
 [5] "dbcan.select.v8.hmmdb.h3f" "dbcan.select.v8.hmmdb.h3i"
 [7] "dbcan.select.v8.hmmdb.h3m" "dbcan.select.v8.hmmdb.h3p"
 [9] "microtrait.hmmdb.h3f"      "microtrait.hmmdb.h3i"     
[11] "microtrait.hmmdb.h3m"      "microtrait.hmmdb.h3p"     

Usage

Extracting traits from a genome sequence

Single genome

Once microTrait is correctly setup, you can extract traits from a genome sequence. For the following example, we use the genome sequence of "Thermus tenuipuniceus YIM 76954" (IMG Taxon ID 2896171935 associated with GOLD Analysis Ga0452602) included as part of microTrait package.

library(microtrait)
genome_file <- system.file("extdata/genomic/2896171935.fna", package="microtrait")
result = extract.traits(genome_file)

microTrait packages its results into an R list with the following named components. The information under the component "microtrait_result" is also saved permanently as an R object (.rds file) with path stored in result$rds_file.

Multiple (thousands) genomes (parallel workflow)

A simple data level parallelization of microTrait for multiple genomes is implemented using foreach and parallel R packages in the `microtrait::extract.traits.parallel`` function.

To exemplify the processing of multiple genomes with microTrait, we will process a test dataset of 100 genomes from IMG available here. Download and unpack it under your package install directory:

download.file("https://github.com/ukaraoz/microtrait-hmm/releases/download/latest/100genomes.tar.gz",
    system.file("extdata/genomic/100genomes.tar.gz", package = "microtrait"))
untar(system.file("extdata/genomic/100genomes.tar.gz", package = "microtrait"), 
      exdir = system.file("extdata/genomic", package = "microtrait"))

The downloaded files will be under the following genomes_dir directory:

genomes_dir = system.file("extdata/genomic/100genomes", package = "microtrait")
genomes_files = list.files(genomes_dir, full.names = T, recursive = T, pattern = ".fna$")

Determine the number of "cores", or computational units on your machine/server using parallel::detectCores() function:

message("Number of cores:", parallel::detectCores(), "\n")

Here, we use ~70% of the available cores to process 100 genomes:

library("tictoc")
tictoc::tic.clearlog()
tictoc::tic(paste0("Running microtrait for ", length(genomes_files)))
microtrait_results = extract.traits.parallel(genomes_files, dirname(genomes_files), ncores = floor(parallel::detectCores()*0.7)
tictoc::toc(log = "TRUE")

When completed, microtrait_results will hold a list of lists for microTrait results. We can pull the paths to the corresponding "rds" files as follows:

rds_files = unlist(parallel::mclapply(microtrait_results, "[[", "rds_file", mc.cores = ncores))

Next, using these rds_files we will combine these results from multiple genomes.

Building trait matrices from microTrait outputs

microtrait::make.genomeset.results() combines microtrait outputs for multiple genomes into trait matrices (genomes x traits). Additionally, the resulting object will hold rules and hmm matrices.

genomeset_results = make.genomeset.results(rds_files = rds_files,
                                           ids = sub(".microtrait.rds", "", basename(rds_files)),
                                           ncores = 1)

The output genomeset_results will be a list with 5 elements, each of which is data.frame corresponding to extracted traits at different granularities, rules, and microtrait-hmm hits.

names(genomeset_results)
[1] "trait_matrixatgranularity1" "trait_matrixatgranularity2"
[3] "trait_matrixatgranularity3" "hmm_matrix"                
[5] "rule_matrix"

Analyzing trait data from a set of genomes

To demonstrate analysis of trait data from a set of genomes, we will use a dataset consisting of all environmental isolate genomes from IMG database. The process to combine microTrait results for multiple genomes is as above. Here, we use the precomputed microTrait outputs for this set that are available as part of the microTrait package.

base_dir = system.file("extdata/precomputed",package = "microtrait")
dataset = "environmentalgenomes"
genomeset_results = readRDS(file.path(base_dir, paste0(dataset, ".microtraitresults.rds")))
lapply(genomeset_results, dim)

If you have genome metadata, you can append them to the resulting matrices as additional columns using microtrait::add.metadata. Here we use GOLD metadata for these genomes, also available as part of the package.

library(dplyr)
genome_metadata = readRDS(file.path(base_dir, paste0(dataset, ".metadata.rds")))
genomeset_results_wmetadata = microtrait::add.metadata(genomeset_results, genome_metadata, genome_metadata_idcol = "IMG Taxon ID") %>% convert_traitdatatype(binarytype = "logical")
saveRDS(genomeset_results_wmetadata, file.path(base_dir, paste0(dataset, ".microtraitresults.rds")))

Normalize count traits

The utility function microtrait::trait.normalize() will correct the count traits for a given metric from the metadata column. Here, we use genome size (Estimated Size) to normalize count traits.

genomeset_results_wmetadata_norm = genomeset_results_wmetadata %>% trait.normalize(normby = "Estimated Size")

Trait-to-trait associations

Given the trait values across a large number of genomes, we can explore the association structure between traits.

For this analysis, we use the trait measurements at the most granular level (trait_matrixatgranularity3) and focus on soil genomes.

traits = traits_listbygranularity[[3]] %>%
  dplyr::select(`microtrait_trait-name`) %>%
  dplyr::filter(`microtrait_trait-name` != "Resource Use:Chemotrophy:chemolithoautotrophy:anaerobic ammonia oxidation") %>%
  dplyr::pull(`microtrait_trait-name`)
trait_matrixatgranularity3 = genomeset_results_wmetadata_norm[["trait_matrixatgranularity3"]] %>%
  dplyr::filter(`Ecosystem_Category` %in% c("Terrestrial", "Plants")) %>%
  dplyr::filter(`Ecosystem_Type` %in% c("Soil", "Rhizoplane", "Rhizosphere", "Roots")) %>%
  dplyr::select(c("id", traits, "mingentime", "ogt")) %>%
  #dplyr::slice(1:100) %>%
  dplyr::filter(`mingentime` < 100) %>%   # max out mingentime at 25 days to avoid errors
  dplyr::filter(!is.na(`mingentime`) & !is.na(`ogt`))

We first transform continous traits into binary traits (with microtrait:: trait.continuous2binary()), treating them as presence/absence traits.

trait_matrixatgranularity3_binary = trait_matrixatgranularity3 %>% microtrait::trait.continuous2binary()

Use microtrait::trait2trait_corr() to plot trait correlation matrix and write correlations into a tab-delimited file.

trait2trait_corr(trait_matrixatgranularity3_binary, verbose = TRUE, idcol = "id", outdir = base_dir, dataset = "soilgenomes")

Defining guilds

Generate genome-to-genome distance matrix

Use microtrait::calc_mixeddist() to generate the genome to genome distance matrix based on the trait profiles. The default distance metric is Wishart (2003) distance for mixed data types.

genomeset_distances = trait_matrixatgranularity3 %>% microtrait::calc_mixeddist(idcol = "id", col2ignore = c("mingentime", "ogt"), method = "wishart", binarytype = "logical", byrow = 1, verbose = TRUE)

Also compute trait prevalence in the dataset with microtrait::compute.prevalence():

prevalence = compute.prevalence(trait_matrixatgranularity3_binary, type="trait_matrixatgranularity3")

Now, apply hierarchical clustering based on the distance matrix and plot the corresponding heatmap. We use unit function from grid package.

library(grid)
library(ComplexHeatmap)
library(tibble)
A4_ratio = 11.75/8.25
width = grid::unit(8.25*4, "inches")
height = grid::unit(8.25*4*A4_ratio, "inches")
cluster_traitmatrix(trait_matrix = trait_matrixatgranularity3_binary,
                    idcol = "id", annot_cols = c("mingentime", "ogt"), granularity = 3,
                    clustering_distance_rows = genomeset_distances, clustering_distance_cols = "binary",
                    width = width, height = height,
                    heatmap_width = width*0.8, heatmap_height = height*0.70,
                    row_dend_width = width*0.05, column_dend_height = height*0.02,
                    rightannotation_width = width*0.1, topannotation_height = height*0.02,
                    bottomannotation_height = height*0.005,
                    outdir = base_dir, dataset = "soilgenomes", pdf = TRUE)
Quantification of inter-guild variance

We quantify the variance in the distance matrix as a function of the number of guilds. This is compute-heavy, a compute environment with high parallelization capability is warrented. First generate a range for number of guilds across which we want to quantify variance. As an example, we go from 2 guilds to the total number of genomes with a step size of 2.

nguilds = seq(2, nrow(trait_matrixatgranularity3), 2)
hclust_rows = clusters_traitmatrix[[1]]

maov_results_list = parallel::mclapply(2:length(nguilds),
                     function(i) {
                       v = cutree(hclust_rows, nguilds[i])
                       genome2guild = data.frame(guild = factor(v))
                       rownames(genome2guild) = names(v)
                       adonis_results = vegan::adonis2(distance ~ guild, data = genome2guild, perm = 1)
                       adonis_results
                     },
                     mc.cores = floor(0.8*detectCores()))

We will read precomputed results:

maov_results_list = readRDS(system.file("extdata/precomputed/soilgenomes_maov_results_list.rds", package="microtrait"))

maov_results = matrix(nrow = 0, ncol = 4)
for(i in 1:length(maov_results_list)) {
  #cat(i, "\n")
  maov_results = rbind(maov_results,
                       c(nguilds[i], maov_results_list[[i]]$R2))
}

maov_results = data.frame(`number of guilds` = maov_results[,1],
                          `percent variance` = round(maov_results[,2]*100,2),
                          check.names = F)
p = ggplot2::ggplot(maov_results,aes(x=`number of guilds`, y=`percent variance explained`)) +
  geom_line() + geom_point(size=0.8) +
  #xlim(1, 500) +
  geom_hline(yintercept = 60, color = "red", size = 0.3) +
  geom_vline(xintercept = 100, color = "red", size = 0.3) +
  geom_hline(yintercept = 70, color = "red", size = 0.3) +
  geom_vline(xintercept = 194, color = "red", size = 0.3) +
  geom_hline(yintercept = 80.04, color = "red", size = 0.3) +
  geom_vline(xintercept = 400, color = "red", size = 0.3) +
  scale_x_continuous(breaks = c(seq(0, 3556, by = 200), 3556)) +
  scale_y_continuous(breaks = seq(0, 100, by = 10))
dataset = "soilgenomes"
ggsave(p, height = 6, width = 12,
       filename = file.path(base_dir, paste0(dataset, ".guilds.percentvariance.pdf")))

Use microtrait::defineguilds() to define guilds at a given percent variance explained. We will define guilds so as to capture 70% of inter-genome trait variation.

nguildsat70perc = maov_results[which.min(abs(maov_results[,2]-70)), "number of guilds"]
defined_guilds = microtrait::define_guilds(trait_matrixatgranularity3_binary,
                                        clusters_traitmatrix,
                                        nguilds = nguildsat70perc,
                                        guildsizecutoff = 50,
                                        outdir, dataset, verbose = TRUE)