vallenderlab / MicrobiomeR

A comprehensive and customizable R package for microbiome analysis.
https://vallenderlab.github.io/MicrobiomeR/
MIT License
20 stars 4 forks source link

Discuss standard workflow paradigm (phyloseq/metacoder/visualization/etc). #42

Closed grabear closed 5 years ago

grabear commented 5 years ago

Intro

Currently, we have different ways of generating one plot or another. If we can standardize how we generate our plots, it will make the data much more credible, and it should eliminate any variations that we see.

Additionally, there may be better or programmatically quicker ways of doing our analysis. I don't think there have been any updates to phylsoeq, however, metacoder has had a few huge updates this year, and so has ampvis2.

General Workflow

In general all of our workflows are essentially doing the same things:

Phyloseq Worflow

Phyloseq is a great package and it has tons of useful plotting functions. But it also uses the base R filtering functions, which are less sexy. I personally think phyloseq is great for the initial import of data, and some initial preprocessing.

Metacoder Workflow

I think metacoder's use of dplyr is the main reason for my preference. I really like the taxmap object as well. It seems more descriptive and maybe we can use it to our advantage someone (that would be in a later release though..). While metacoder also has really cool functions for analyzing taxmap data, it does make it a bit harder to use the data in other plots. In an issue that I made on metacoders repo, Zachary foster showed me a cool way to make a data table with the "diff table" you can generate for statistical analysis:

library(phyloseq)
library(metacoder)
library(dplyr)

# Parse dataset
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Convert counts to proportions
x$data$otu_table <- calc_obs_props(x, dataset = "tax_data",
                                   cols = hmp_samples$sample_id)

# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, dataset = "otu_table",
                                     cols = hmp_samples$sample_id)

# Get per-taxon body site mean proportions 
x$data$site_means <- calc_group_mean(x, dataset = "tax_table",
                                     cols = hmp_samples$sample_id,
                                     groups = hmp_samples$body_site)

# Calculate difference between groups
x$data$diff_table <- compare_groups(x, dataset = "tax_table",
                                    cols = hmp_samples$sample_id,
                                    groups = hmp_samples$body_site)

# Make table of most different classes between nose an saliva
filter_taxa(x, taxon_ranks == "c", reassign_obs = FALSE) %>%
  mutate_obs("diff_table", taxon_name = taxon_names, mean_nose = Nose, mean_saliva = Saliva) %>%
  get_dataset("diff_table") %>% # New function I just added to `taxa`
  filter(wilcox_p_value <= 0.05, treatment_1 == "Nose", treatment_2 == "Saliva") %>%
  arrange(mean_diff) %>%
  transmute(Class = taxon_name, "Mean nose proportion" = mean_nose,  "Mean saliva proportion" = mean_saliva,  "Mean difference" = mean_diff, "P value" = wilcox_p_value) %>%
  knitr::kable()
Class Mean nose proportion Mean saliva proportion Mean difference P value
Bacteroidia 0.0062143 0.2488121 -0.2425979 0.0000108
Clostridia 0.0058709 0.2016887 -0.1958178 0.0000108
Gammaproteobacteria 0.0365834 0.1138908 -0.0773074 0.0051960
Betaproteobacteria 0.0312206 0.0900281 -0.0588075 0.0112070
Fusobacteria 0.0004220 0.0452540 -0.0448320 0.0001494
Flavobacteria 0.0046150 0.0242023 -0.0195873 0.0012604
Epsilonproteobacteria 0.0175916 0.0089710 0.0086206 0.0116113
Actinobacteria 0.6039006 0.0229178 0.5809828 0.0000108

Ampvis2 Workflow

Ampvis2 has a ton of really cool plots and they look great! But i haven't really used it much. We need to look more into this and see if it's worth exploring. I think the ordination plots from this package will be an excellent thing to use, but I haven't compared any of the other ordination workflows.

Ggplot2 Workflow

The correlation plots are obviously custom ggplot2 visualizations, and so are some of the other ones. Luckily all of the other packages use ggplot2 for the visualziations so this section is pretty self explainatory.

Specific Workflow Preferences

This is based off of what we currently do and what I think we should work towards in general. It may not happen this way, but it's something to work towards. As long as we can standardize the way we are producing the plots (aka we are using the same data for each ggplot), then that's what matters.

Import Data into R

Preprocess Data

This includes:

Analyze Data

This includes:

Plot data with ggplot

This includes:

grabear commented 5 years ago

Generally speaking, I like the Phyloseq Manipulation-> Metacoder Analysis -> Multi-Package Visualization workflow.

grabear commented 5 years ago

I like metacoder for the anlaysis part especially. I think the way Zachary Foster has it set up to analyze data by treatment group is great. metacoder::compare_groups also does pairwise comparisons for experiments where you have more than 2 Treatment Groups. I think that's been a concern moving forward with the MicrobiomeR package. I was trying to figure out how to reverse engineer metacoder::compare_groups for simple dataframes, but it would be much easier if we could simply utilize what's already there.

sdhutchins commented 5 years ago

As far as the alpha diversity plot, both metacoder and microbiome use the vegan package for this, and after a quick look, I can see it's being done the same way in both packages except they're using different graphical ways to represent the data which is worth exploring.

As far as stacked bar charts, I'm not doing much or any stats - simply transforming the relative abundance after using the preprocess phyloseq function.

I do see that metacoder has a way to do this...

obj$data$type_abund <- calc_taxon_abund(obj, "otu_rarefied",
                                        cols = sample_data$SampleID,
                                        groups = sample_data$Type)

I can certainly look into adopting that.

The below is pretty much all I do stats wise with phyloseq.

physeq <- get_phyloseq_obj(rdata_file = "data/silva_phyloseq_obj.RData")
physeq <- preprocess_phyloseq(physeq)

physeq_glommed_phylum <- tax_glom(physeq, "Phylum")
physeq_glommed_order <- tax_glom(physeq, "Order")
physeq_transformed_phylum <- transform_sample_counts(physeq_glommed_phylum, function(x) 100 * x / sum(x))

# Merge groups for first stacked bar chart
merged_groups <- merge_by_group(physeq)

normalized_merged_groups <- transform_sample_counts(merged_groups, function(x) 100 * x / sum(x))

Given that the heat trees are basically a refined and extended version of the stacked bar charts - we should definitely ensure that statistical analyses are the same.

grabear commented 5 years ago

we should definitely ensure that statistical analyses are the same.

@sdhutchins This! I just want to make sure we are using the same data for our visualizations. That way we can rule out the different packages as the source of any variations we might see in our analysis.

For the correlation plots, I'm making a custom dataframe from the phyloseq object. I'm going to look into making a custom dataframe from a metacoder object instead.

grabear commented 5 years ago

Joining Metacoder Observation Data with Taxonomy Data by taxon_ids

I forget

So here's what I explained earlier for our reference. I don't want to forget. @sdhutchins

> phy <- get_phyloseq_obj(".RDATA")
> proc_phy <- preprocess_phyloseq(phy)
> xx <- get_metacoder_obj(proc_phy)
> obj <- xx
> x1 <- obj %>% filter_taxa(n_supertaxa < 2, reassign_obs = FALSE)
...
> dplyr::right_join(x1$taxonomy_table(subset = taxon_ids(x1), add_id_col = TRUE), x1$data$diff_table, by = c("taxon_ids" = "taxon_id"))
# A tibble: 16 x 16
   taxon_ids Kingdom Phylum treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2 wilcox_p_value hartigan_dip_tr~ hartigan_dip_tr~ bimodality_coef~
   <chr>     <fct>   <fct>  <chr>       <chr>                  <dbl>           <dbl>       <dbl>     <dbl>       <dbl>       <dbl>          <dbl>            <dbl>            <dbl>            <dbl>
 1 ab        Bacter~ NA     Stressed    Control            -0.000165        -0.00122 -0.000114    -8.44e-4    0.999      1.000          0.0170               0.996           0.554             0.879
 2 ac        Archaea NA     Stressed    Control             3.48             2.53     0.000114     8.44e-4    0.00102    0.000176       0.0200               0.996           0.554             0.879
 3 ad        Bacter~ Firmi~ Stressed    Control             0.266            0.357    0.0737       1.01e-1    0.460      0.359          0.000170             0.501           0.169             0.473
 4 ae        Bacter~ Prote~ Stressed    Control            -0.901           -0.298   -0.00748     -6.14e-3    0.0267     0.0329         0.141                0.847           0.656             0.912
 5 af        Bacter~ Spiro~ Stressed    Control             0.830            0.607    0.00885      1.21e-2    0.0352     0.0231         0.135                0.222           0.961             0.646
 6 ag        Bacter~ Verru~ Stressed    Control            -2.05             0.416   -0.000824     5.13e-4    0.00205    0.00154        0.00852              0.814           0.797             0.946
 7 ah        Bacter~ Bacte~ Stressed    Control            -0.180           -0.303   -0.0660      -1.09e-1    0.468      0.577          0.00355              0.377           0.699             0.598
 8 ai        Bacter~ Fibro~ Stressed    Control             1.15             1.65     0.000466     1.65e-3    0.00242    0.000773       0.0728               0.871           0.929             0.796
 9 aj        Bacter~ Actin~ Stressed    Control            -2.67            -2.26    -0.000591    -6.42e-4    0.000169   0.000811       0.0000153            0.918           0.954             0.832
10 ak        Bacter~ Cyano~ Stressed    Control             2.10             1.53     0.000473     7.57e-4    0.00116    0.000402       0.0279               0.577           0.906             0.783
11 al        Bacter~ Elusi~ Stressed    Control             0.706           -1.71     0.000105    -2.33e-3    0.00103    0.00336        0.584                0.822           0.756             0.853
12 am        Bacter~ Lenti~ Stressed    Control            -0.420            0.450   -0.0000118    3.67e-5    0.000137   0.000100       0.868                0.890           0.604             0.891
13 an        Bacter~ Sacch~ Stressed    Control             3.60             2.74     0.000736     1.27e-3    0.00149    0.000224       0.000227             0.564           0.662             0.830
14 ao        Archaea Eurya~ Stressed    Control             3.48             2.53     0.000114     8.44e-4    0.00102    0.000176       0.0200               0.996           0.554             0.879
15 ap        Bacter~ Planc~ Stressed    Control             0                2.82     0.000178     5.50e-4    0.000641   0.0000911      0.00206              0.991           0.917             0.780
16 aq        Bacter~ Tener~ Stressed    Control            -0.0916          -0.169   -0.00000927  -5.17e-5    0.000416   0.000468       0.638                0.855           0.0624            0.804
# ... with 1 more variable: bimodality_coeff_treat2 <dbl>
grabear commented 5 years ago

Metacoder Style Agglomeration Example

I've also confirmed that the explanation given in the Issue I linked to above (https://github.com/grunwaldlab/metacoder/issues/234) works like phyloseq::tax_glom()

It might be cooler to add an agglomeration function like in phyloseq::tax_glom() as well.

Im not sure this is needed since this behavior is handled well by other functions. For filtering data, filter_taxa can remove ranks and reassign everything to the remaining ranks automatically. Also, calc_taxon_abund sums the counts for every taxon at every rank, so you can just subset the result of that to get the counts for a specific rank. However, since the info for every rank is there, subsetting the whole object to a specific rank is not really needed. I guess the equivalent of phyloseq::tax_glom() is something like:

library(metacoder)

# Parse dataset
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Get per-taxon counts
x$data$tax_counts <- calc_taxon_abund(x, dataset = "tax_data",
                                      cols = hmp_samples$sample_id)

# Subset 
filter_taxa(x, taxon_ranks == "p", supertaxa = TRUE, reassign_obs = c(tax_counts = FALSE))
grabear commented 5 years ago

Metacoder Style Taxonomic Prevalence Filtering

metacoder::calc_prop_samples is a super useful function for prevalence filtering. Here I've mimicked the phyloseq workflow of:

pp_p <- "Phylsoeq_obj"
> tf_glom <- tax_glom(pp_p, "Phylum", NArm = FALSE)
> tf_unfiltered <- genefilter_sample(tf_glom, filterfun_sample(function(x) x > 10), A = 0.6*nsamples(tf_glom)) 
> p_filter <- prune_taxa(tf_unfiltered, tf_glom)
# Formating for standard evaluation
> r_filter <- get_taxa_unique(p_filter, taxonomic.rank = "Phylum")
> pp_p <- subset_taxa(pp_p, Kingdom %in% r_filter)

By using metacoder and some dplyr:

# Get Metacoder object and then get per taxon counts
> raw_m <- parse_phyloseq(phy)
> raw_m$data$tax_counts <- raw_m %>% calc_taxon_abund("otu_table", cols = raw_m$data$sample_data$sample_id)
> removed_ids <- raw_m %>% 
filter_taxa(taxon_ranks == "Phylum", reassign_obs = c(tax_counts = FALSE)) %>% 
calc_prop_samples("tax_counts", more_than = 10) %>% 
filter(n_samples < 0.6)
> new_metacoder_obj <- raw_m %>% filter_taxa(!taxon_ids %in% removed_ids$taxon_id, reassign_obs = FALSE)

(Note: These percentages (0.6) and minimum count numbers (10) are random.)

grabear commented 5 years ago

Metacoder Style OTU Prevalence Filtering

This is almost exactly the same as above, except you drop the "agglomeration" part:

> raw_m <- parse_phyloseq(phy)
> raw_m$data$tax_counts <- raw_m %>% calc_taxon_abund("otu_table", cols = raw_m$data$sample_data$sample_id)
> new_rem <- raw_m  %>% 
    calc_prop_samples("tax_counts", more_than = 10) %>% 
    filter(n_samples < 0.6)
> new_metacoder_obj <- raw_m %>% filter_taxa(!taxon_ids %in% removed_ids$taxon_id, reassign_obs = FALSE)
sdhutchins commented 5 years ago

Well that's super succint.

grabear commented 5 years ago

Metacoder Style Filtering by Coefficient of Variation with Help from Phyloseq

Here is the way I tried to do CoV filtering. I had to coerce the data into a phyloseq::otu_table and do the filtering that way. I couldn't find a function in metacoder that allows you to apply a function across the columns.. This is a bit messy...

> raw_m <- parse_phyloseq(phy)
> raw_m$data$tax_counts <- raw_m %>% 
calc_taxon_abund("otu_table", cols = raw_m$data$sample_data$sample_id)
> total <- median(as.numeric(raw_m$data$otu_table %>%
 summarise_if(is.numeric, sum)))
> standf <- function(x, t = total) round(t * (x / sum(x)))
> raw_m_otu_table <- otu_table(data.matrix(column_to_rownames(raw_m$data$otu_table[2:length(raw_m$data$otu_table)], var="otu_id")), taxa_are_rows = TRUE)
> raw_m_trans <- phyloseq::transform_sample_counts(raw_m_otu_table, standf)
> f_list <- phyloseq::filter_taxa(raw_m_trans, flist = function(x) sd(x) / mean(x) > 0.9)
> remove_otu <- f_list[f_list == FALSE]
> removed_ids <- dplyr::filter(raw_m$data$otu_table, otu_id %in% names(remove_otu))
> new_metacoder_obj <- filter_taxa(raw_m, !taxon_ids %in% remove_ids$taxon_id, reassign_obs = FALSE)
grabear commented 5 years ago

Metacoder Style Master Threshold filtering

This is not the proper way to do this, however it is a good step in the right direction.

> raw_m <- parse_phyloseq(phy)
> otu_data <- raw_m$data$otu_table
> otu_mean <- rowMeans(select(otu_data, raw_m$data$sample_data$sample_id))
> keep_these_data <- otu_data %>% mutate(OTU_mean = otu_mean) %>% select(taxon_id, otu_id, OTU_mean) %>% filter(OTU_mean > 0.00005)

Filtering steps like the master threshold filtering and the coeffiecing of variation filtering have to be done using the otu_proportions table in the following way:

This way the otu tables and taxa tables reflect each other (the observation data) as well as the taxmap object (the taxon data).

grabear commented 5 years ago

New Metacoder Format for the Package

I've come up with a new metacoder format for the package. It's really just a new naming convention for the data tables (aka observations). I find this to be a bit more intutive. Let me know what you think! @sdhutchins

# The metacoder_obj$data will contain the following tables.
# The keys here are the old phyloseq table names, and the values are the new table names
...
otu_table: "otu_abundance"
tax_data: "otu_annotations"
sample_data: "sample_data"
phy_tree: "phy_tree"
...
# These new tables are created by metacoder's calculation functions
...
calc_taxon_abund(otu_abundance): "taxa_abundance"
calc_obs_props(otu_abundance): "otu_proportions"
calc_obs_props(otu_proportions): "taxa_proportions"
...
grabear commented 5 years ago

@sdhutchins In order to get OTU ids in the otu_annotations table, I had to make a PR on the metacoder package. It's only like 5 lines, so maybe he will accept it soon. https://github.com/grabear/metacoder

grabear commented 5 years ago

Merge Functions

New Functionality

> phy <- get_phyloseq_obj(".RDATA")
> proc_phy <- preprocess_phyloseq(phy)
> xx <- get_metacoder_obj(proc_phy)
> obj <- xx
> x1 <- obj %>% filter_taxa(n_supertaxa < 2, reassign_obs = FALSE)
...
> dplyr::right_join(x1$taxonomy_table(subset = taxon_ids(x1), add_id_col = TRUE), x1$data$diff_table, by = c("taxon_ids" = "taxon_id"))
# A tibble: 16 x 16
   taxon_ids Kingdom Phylum treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2 wilcox_p_value hartigan_dip_tr~ hartigan_dip_tr~ bimodality_coef~
   <chr>     <fct>   <fct>  <chr>       <chr>                  <dbl>           <dbl>       <dbl>     <dbl>       <dbl>       <dbl>          <dbl>            <dbl>            <dbl>            <dbl>
 1 ab        Bacter~ NA     Stressed    Control            -0.000165        -0.00122 -0.000114    -8.44e-4    0.999      1.000          0.0170               0.996           0.554             0.879
 2 ac        Archaea NA     Stressed    Control             3.48             2.53     0.000114     8.44e-4    0.00102    0.000176       0.0200               0.996           0.554             0.879
 3 ad        Bacter~ Firmi~ Stressed    Control             0.266            0.357    0.0737       1.01e-1    0.460      0.359          0.000170             0.501           0.169             0.473
 4 ae        Bacter~ Prote~ Stressed    Control            -0.901           -0.298   -0.00748     -6.14e-3    0.0267     0.0329         0.141                0.847           0.656             0.912
 5 af        Bacter~ Spiro~ Stressed    Control             0.830            0.607    0.00885      1.21e-2    0.0352     0.0231         0.135                0.222           0.961             0.646
 6 ag        Bacter~ Verru~ Stressed    Control            -2.05             0.416   -0.000824     5.13e-4    0.00205    0.00154        0.00852              0.814           0.797             0.946
 7 ah        Bacter~ Bacte~ Stressed    Control            -0.180           -0.303   -0.0660      -1.09e-1    0.468      0.577          0.00355              0.377           0.699             0.598
 8 ai        Bacter~ Fibro~ Stressed    Control             1.15             1.65     0.000466     1.65e-3    0.00242    0.000773       0.0728               0.871           0.929             0.796
 9 aj        Bacter~ Actin~ Stressed    Control            -2.67            -2.26    -0.000591    -6.42e-4    0.000169   0.000811       0.0000153            0.918           0.954             0.832
10 ak        Bacter~ Cyano~ Stressed    Control             2.10             1.53     0.000473     7.57e-4    0.00116    0.000402       0.0279               0.577           0.906             0.783
11 al        Bacter~ Elusi~ Stressed    Control             0.706           -1.71     0.000105    -2.33e-3    0.00103    0.00336        0.584                0.822           0.756             0.853
12 am        Bacter~ Lenti~ Stressed    Control            -0.420            0.450   -0.0000118    3.67e-5    0.000137   0.000100       0.868                0.890           0.604             0.891
13 an        Bacter~ Sacch~ Stressed    Control             3.60             2.74     0.000736     1.27e-3    0.00149    0.000224       0.000227             0.564           0.662             0.830
14 ao        Archaea Eurya~ Stressed    Control             3.48             2.53     0.000114     8.44e-4    0.00102    0.000176       0.0200               0.996           0.554             0.879
15 ap        Bacter~ Planc~ Stressed    Control             0                2.82     0.000178     5.50e-4    0.000641   0.0000911      0.00206              0.991           0.917             0.780
16 aq        Bacter~ Tener~ Stressed    Control            -0.0916          -0.169   -0.00000927  -5.17e-5    0.000416   0.000468       0.638                0.855           0.0624            0.804
# ... with 1 more variable: bimodality_coeff_treat2 <dbl>
sdhutchins commented 5 years ago

@sdhutchins In order to get OTU ids in the otu_annotations table, I had to make a PR on the metacoder package. It's only like 5 lines, so maybe he will accept it soon. https://github.com/grabear/metacoder

Sounds good! I'll keep an eye on that.

grabear commented 5 years ago

@sdhutchins it's merged https://github.com/grunwaldlab/metacoder/pull/253

sdhutchins commented 5 years ago

Awesome, @grabear!

grabear commented 5 years ago

This has been resolved in the package: https://github.com/vallenderlab/MicrobiomeR/issues