microEcology / pime

PIME: A package for discovery of novel differences among microbial communities
GNU General Public License v3.0
5 stars 2 forks source link

PIME: Prevalence Interval for Microbiome Evaluation

PIME removes the within group variation found in metataxonomic surveys (16S rRNA datasets) by capturing only biological differences at high samples prevalence levels.

First steps: installing 'phyloseq' and creating a 'phyloseq' object

To install phyloseq start R and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("phyloseq")

Phyloseq aggregates, mainly, an otu_table (with OTUs abundances), tax_table (taxonomy), and a sample_data (sample information) objects.

Importing output from DADA2

seqtab = readRDS("path_to_file/sequence_table_final.rds")
tax= readRDS("path_to_file/tax_final.rds")
map <- "path_to_file/sample_data.txt"
ps <- phyloseq(otu_table(seqtab, taxa_are_rows=FALSE), 
               tax_table(taxa))
sample_metadata = import_qiime_sample_data(map)
physeq =merge_phyloseq(ps, sample_metadata)

Importing otu_table as a biom file with taxonomy

jsonbiomfile = "path_to_file/otu_table_fix.biom"
mapfile = "path_to_file/v35_map_uniquebyPSN.txt"
biom = import_biom(jsonbiomfile, mapfile, parseFunction=parse_taxonomy_greengenes)
map = import_qiime_sample_data(mapfile)
input = merge_phyloseq(biom,map)

For further data input methods and examples go to https://joey711.github.io/phyloseq/import-data.html

How to install PIME package

To install PIME first install the devtools package. Then load the library(devtools) and run install_github using the following commands.

#install.packages("devtools")
library(devtools)
install_github("microEcology/pime")

PIME uses a Phyloseq object as input. A description of the phyloseq object and a tutorial on how to create this file in R using OTU tables in many different formats is detailed into the Phyloseq website https://joey711.github.io/phyloseq/

Step-by-step example

The first step in PIME is to define if the microbial community presents a high relative abundance of taxa with low prevalence, which is considered as noise in PIME analysis. This is calculated by random forests analysis. In this example we run PIME using the restroom dataset (https://doi.org/10.1007%2Fs10482-017-0976-6) against the metadata variable called Environment (a variable with two categories: men’s and women’s restroom).

Prediction using random forests on full dataset. Results in Out of Bag error rate. The input file was rarefied at 500 sequences for the purpose of this example (speed up the analysis). Using a rarefied dataset is recommended at this step.

library(pime)
data("restroom")
pime.oob.error(restroom, "Environment")
#> [1] 0.4444444

The OOB error rate <=0.1, indicated the dataset present large differences, and pime might not remove much of the noise. Higher OOB error rate indicates that the next functions should be run to find the best prevalence interval for the dataset.

This function takes two parameters: The phyloseq object (restroom) and the predictor variable (Environment).

Split the dataset by predictor variable

Two parameters are required to run this function: The phyloseq object (restroom) and the predictor variable (Environment).

per_variable_obj= pime.split.by.variable(restroom, "Environment")
per_variable_obj
#> $Restroom_F
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 1858 taxa and 9 samples ]
#> sample_data() Sample Data:       [ 9 samples by 6 sample variables ]
#> tax_table()   Taxonomy Table:    [ 1858 taxa by 7 taxonomic ranks ]
#> 
#> $Restroom_M
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 1983 taxa and 9 samples ]
#> sample_data() Sample Data:       [ 9 samples by 6 sample variables ]
#> tax_table()   Taxonomy Table:    [ 1983 taxa by 7 taxonomic ranks ]

Calculate the highest possible prevalence intervals

This function calculates prevalence for different intervals by increments of 5. The input file is the output from the pime.split.by.variable(per_variable_obj)

prevalences=pime.prevalence(per_variable_obj)
head(prevalences)
#> $`5`
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 3253 taxa and 18 samples ]
#> sample_data() Sample Data:       [ 18 samples by 6 sample variables ]
#> tax_table()   Taxonomy Table:    [ 3253 taxa by 7 taxonomic ranks ]
#> 
#> $`10`
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 3253 taxa and 18 samples ]
#> sample_data() Sample Data:       [ 18 samples by 6 sample variables ]
#> tax_table()   Taxonomy Table:    [ 3253 taxa by 7 taxonomic ranks ]
#> 
#> $`15`
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 593 taxa and 18 samples ]
#> sample_data() Sample Data:       [ 18 samples by 6 sample variables ]
#> tax_table()   Taxonomy Table:    [ 593 taxa by 7 taxonomic ranks ]
#> 
#> $`20`
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 593 taxa and 18 samples ]
#> sample_data() Sample Data:       [ 18 samples by 6 sample variables ]
#> tax_table()   Taxonomy Table:    [ 593 taxa by 7 taxonomic ranks ]
#> 
#> $`25`
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 222 taxa and 18 samples ]
#> sample_data() Sample Data:       [ 18 samples by 6 sample variables ]
#> tax_table()   Taxonomy Table:    [ 222 taxa by 7 taxonomic ranks ]
#> 
#> $`30`
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 222 taxa and 18 samples ]
#> sample_data() Sample Data:       [ 18 samples by 6 sample variables ]
#> tax_table()   Taxonomy Table:    [ 222 taxa by 7 taxonomic ranks ]

Calculate the best prevalence interval for the dataset

This function will return a table with Out of Bag error from random forests for each prevalence interval. The number of taxa and the number of remaining sequences for each prevalence interval are also computed. The best prevalence interval value provides the clearest separation of communities while still including a majority of the taxa in the analysis. If true differences are present. It will be represented by the first interval in which the OOB error rate is zero or close to zero. The input file is the list of prevalences generated by the pime.prevalence (prevalences) and the predictor variable ("Environment"").

set.seed(42)
best.prev=pime.best.prevalence(prevalences, "Environment")
#>        Interval OOB error rate (%) OTUs Nseqs
#>   Prevalence 5%              55.56 3253  9000
#>  Prevalence 10%                 50 3253  9000
#>  Prevalence 15%              11.11  593  5438
#>  Prevalence 20%              11.11  593  5438
#>  Prevalence 25%              11.11  222  4370
#>  Prevalence 30%              11.11  222  4370
#>  Prevalence 35%               5.56  117  3835
#>  Prevalence 40%               5.56  117  3835
#>  Prevalence 45%                  0   77  3531
#>  Prevalence 50%                  0   77  3531
#>  Prevalence 55%               5.56   77  3531
#>  Prevalence 60%                  0   45  3088
#>  Prevalence 65%                  0   45  3088
#>  Prevalence 70%                  0   26  2617
#>  Prevalence 75%                  0   26  2617
#>  Prevalence 80%                  0   16  2084
#>  Prevalence 85%                  0   16  2084
#>  Prevalence 90%                  0    4  1489
#>  Prevalence 95%                  0    4  1489

In addition, it also returns the results from the random forests classification for each prevalence level. It includes SequenceID (OTU/ASV), Mean Decrease Accuracy (MDA) for each sample group, that is how much that SequenceID was important for classification of that group. The Mean Decrease Impurity (Gini Importance) and taxonomy are also included.

To get the table with OTU/ASV importance of the chosen prevalence interval. Pime keeps only the top 30 OTUs/ASVs, with highest MDA.

imp65=best.prev$`Importance`$`Prevalence 65`
knitr::kable(head(imp65), format="markdown")
SequenceID Restroom_F Restroom_M MeanDecreaseAccuracy MeanDecreaseGini Rank1 Rank2 Rank3 Rank4 Rank5 Rank6 Rank7
denovo87919 0.0909333 0.0883333 0.0804214 1.3798848 k__Bacteria p__Firmicutes c__Bacilli o__Lactobacillales f__Lactobacillaceae g__Lactobacillus s__iners
denovo22521 0.0420333 0.0257333 0.0312325 0.8396941 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Pseudomonadales f__Pseudomonadaceae g__Pseudomonas s__
denovo65044 0.0282000 0.0372857 0.0292992 0.6619939 Unassigned NA NA NA NA NA NA
denovo6450 0.0302667 0.0171000 0.0215468 0.4021410 k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Sphingomonadales f__Sphingomonadaceae g__Sphingomonas s__
denovo1419 0.0098000 0.0197000 0.0127857 0.3298748 k__Bacteria p__Actinobacteria c__Actinobacteria o__Bifidobacteriales f__Bifidobacteriaceae NA NA
denovo88424 0.0098667 0.0168667 0.0122939 0.3698211 k__Bacteria p__Actinobacteria c__Actinobacteria o__Actinomycetales f__Corynebacteriaceae g__Corynebacterium s__

#To get the table with OOB error results.
#best.prev$`OOB error`

Within this dataset the best prevalence interval was 65%

To obtain the phyloseq object at this cutoff use the following command.

prevalence.65 = prevalences$`65`
prevalence.65
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 45 taxa and 18 samples ]
#> sample_data() Sample Data:       [ 18 samples by 6 sample variables ]
#> tax_table()   Taxonomy Table:    [ 45 taxa by 7 taxonomic ranks ]

To obtain the confusion matrix from random forests classification use the folowing:

best.prev$Confusion$`Prevalence 65`
#>            Restroom_F Restroom_M class.error
#> Restroom_F          9          0           0
#> Restroom_M          0          9           0

Estimating prediction error

To estimate error in prediction, we will use pime.error.prediction() to randomly assign treatments to samples and run random forests classification on each prevalence interval. The function returns a boxplot and a table with results of each classification error. For the purposes of this example we are running only 10 randomizations for saving time but we recommend at least 100 randomizations to obtain reliable results.

randomized=pime.error.prediction(restroom, "Environment", bootstrap = 10, parallel = TRUE, max.prev = 95)
randomized$Plot
randomized$'Table results'

It is also possible to estimate the variation of OOB error with each prevalence interval filtering. This is done by running the random forests classification for n times, determined by the user. This function will return a boxplot figure and a table for each classification error.

replicated.oob.error= pime.oob.replicate(prevalences, "Environment", bootstrap = 10, parallel = TRUE)

Getting Help

Please contact us if you need any help: contact@brmicrobiome.org

PIME Team

Luiz F. W. Roesch (Universidade Federal do Pampa - Brazil)

Priscila T. Dobbler (Universidade Federal do Pampa - Brazil)

Victor S. Pylro (Universidade Federal de Lavras - Brazil)

Bryan Kolaczkowski (University of Florida - United States of America)

Jennifer C. Drew (University of Florida - United States of America)

Eric W. Triplett (University of Florida - United States of America)

Citation

PIME: a package for discovery of novel differences among microbial communities Luiz Fernando W. Roesch, Priscila Thiago Dobbler, Victor Satler Pylro, Bryan Kolaczkowski, Jennifer C. Drew, Eric W. Triplett doi: https://doi.org/10.1111/1755-0998.13116