grabear commented 5 years ago

Intro

Currently, we have different ways of generating one plot or another. If we can standardize how we generate our plots, it will make the data much more credible, and it should eliminate any variations that we see.

Additionally, there may be better or programmatically quicker ways of doing our analysis. I don't think there have been any updates to phylsoeq, however, metacoder has had a few huge updates this year, and so has ampvis2.

General Workflow

In general all of our workflows are essentially doing the same things:

Import Data into R
Preprocess Data
- Remove unnecessary data
- Filter data
  - taxon filter
  - prevalence filter
  - coefficient of variation
  - agglomeration by rank
- transform data
Analyze Data
- Treatment 1 vs Treatment 2
  - log2_median_ratio
  - log2_mean_ratio
  - median_diff
  - mean_diff
  - wilcox_p_value
- mean_treat1
- mean_treat2
- hartigan_dip_treat1
- hartigan_dip_treat2
Plot data with ggplot
- heat tree
- stacked bar
- correlation
- violin
- ordination (new)
- scatter (old)

Phyloseq Worflow

Phyloseq is a great package and it has tons of useful plotting functions. But it also uses the base R filtering functions, which are less sexy. I personally think phyloseq is great for the initial import of data, and some initial preprocessing.

Metacoder Workflow

I think metacoder's use of dplyr is the main reason for my preference. I really like the taxmap object as well. It seems more descriptive and maybe we can use it to our advantage someone (that would be in a later release though..). While metacoder also has really cool functions for analyzing taxmap data, it does make it a bit harder to use the data in other plots. In an issue that I made on metacoders repo, Zachary foster showed me a cool way to make a data table with the "diff table" you can generate for statistical analysis:

library(phyloseq)
library(metacoder)
library(dplyr)

# Parse dataset
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Convert counts to proportions
x$data$otu_table <- calc_obs_props(x, dataset = "tax_data",
                                   cols = hmp_samples$sample_id)

# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, dataset = "otu_table",
                                     cols = hmp_samples$sample_id)

# Get per-taxon body site mean proportions 
x$data$site_means <- calc_group_mean(x, dataset = "tax_table",
                                     cols = hmp_samples$sample_id,
                                     groups = hmp_samples$body_site)

# Calculate difference between groups
x$data$diff_table <- compare_groups(x, dataset = "tax_table",
                                    cols = hmp_samples$sample_id,
                                    groups = hmp_samples$body_site)

# Make table of most different classes between nose an saliva
filter_taxa(x, taxon_ranks == "c", reassign_obs = FALSE) %>%
  mutate_obs("diff_table", taxon_name = taxon_names, mean_nose = Nose, mean_saliva = Saliva) %>%
  get_dataset("diff_table") %>% # New function I just added to `taxa`
  filter(wilcox_p_value <= 0.05, treatment_1 == "Nose", treatment_2 == "Saliva") %>%
  arrange(mean_diff) %>%
  transmute(Class = taxon_name, "Mean nose proportion" = mean_nose,  "Mean saliva proportion" = mean_saliva,  "Mean difference" = mean_diff, "P value" = wilcox_p_value) %>%
  knitr::kable()

Class	Mean nose proportion	Mean saliva proportion	Mean difference	P value
Bacteroidia	0.0062143	0.2488121	-0.2425979	0.0000108
Clostridia	0.0058709	0.2016887	-0.1958178	0.0000108
Gammaproteobacteria	0.0365834	0.1138908	-0.0773074	0.0051960
Betaproteobacteria	0.0312206	0.0900281	-0.0588075	0.0112070
Fusobacteria	0.0004220	0.0452540	-0.0448320	0.0001494
Flavobacteria	0.0046150	0.0242023	-0.0195873	0.0012604
Epsilonproteobacteria	0.0175916	0.0089710	0.0086206	0.0116113
Actinobacteria	0.6039006	0.0229178	0.5809828	0.0000108

Ampvis2 Workflow

Ampvis2 has a ton of really cool plots and they look great! But i haven't really used it much. We need to look more into this and see if it's worth exploring. I think the ordination plots from this package will be an excellent thing to use, but I haven't compared any of the other ordination workflows.

Ggplot2 Workflow

The correlation plots are obviously custom ggplot2 visualizations, and so are some of the other ones. Luckily all of the other packages use ggplot2 for the visualziations so this section is pretty self explainatory.

Specific Workflow Preferences

This is based off of what we currently do and what I think we should work towards in general. It may not happen this way, but it's something to work towards. As long as we can standardize the way we are producing the plots (aka we are using the same data for each ggplot), then that's what matters.

Import Data into R

[x] Phyloseq

Preprocess Data

[X] Phyloseq (phyloseq is ONLY being used for importing data)
[x] Metacoder

This includes:

Remove unnecessary data
Filter data
- taxon filter
- prevalence filter
- coefficient of variation
- agglomeration by rank
transform data

Analyze Data

[x] Metacoder

This includes:

Treatment 1 vs Treatment 2
- log2_median_ratio
- log2_mean_ratio
- median_diff
- mean_diff
- wilcox_p_value
mean_treat1
mean_treat2
hartigan_dip_treat1
hartigan_dip_treat2

Plot data with ggplot

[x] Metacoder
[x] Microbiome pkg
[x] Ggplot2
[X] Ampvis2 (Not currently using ampvis2)

This includes:

heat tree
stacked bar
correlation
violin
ordination (new)
scatter (old)

grabear commented 5 years ago

Generally speaking, I like the Phyloseq Manipulation-> Metacoder Analysis -> Multi-Package Visualization workflow.

grabear commented 5 years ago

I like metacoder for the anlaysis part especially. I think the way Zachary Foster has it set up to analyze data by treatment group is great. metacoder::compare_groups also does pairwise comparisons for experiments where you have more than 2 Treatment Groups. I think that's been a concern moving forward with the MicrobiomeR package. I was trying to figure out how to reverse engineer metacoder::compare_groups for simple dataframes, but it would be much easier if we could simply utilize what's already there.

sdhutchins commented 5 years ago

As far as the alpha diversity plot, both metacoder and microbiome use the vegan package for this, and after a quick look, I can see it's being done the same way in both packages except they're using different graphical ways to represent the data which is worth exploring.

As far as stacked bar charts, I'm not doing much or any stats - simply transforming the relative abundance after using the preprocess phyloseq function.

I do see that metacoder has a way to do this...

obj$data$type_abund <- calc_taxon_abund(obj, "otu_rarefied",
                                        cols = sample_data$SampleID,
                                        groups = sample_data$Type)

I can certainly look into adopting that.

The below is pretty much all I do stats wise with phyloseq.

physeq <- get_phyloseq_obj(rdata_file = "data/silva_phyloseq_obj.RData")
physeq <- preprocess_phyloseq(physeq)

physeq_glommed_phylum <- tax_glom(physeq, "Phylum")
physeq_glommed_order <- tax_glom(physeq, "Order")
physeq_transformed_phylum <- transform_sample_counts(physeq_glommed_phylum, function(x) 100 * x / sum(x))

# Merge groups for first stacked bar chart
merged_groups <- merge_by_group(physeq)

normalized_merged_groups <- transform_sample_counts(merged_groups, function(x) 100 * x / sum(x))

Given that the heat trees are basically a refined and extended version of the stacked bar charts - we should definitely ensure that statistical analyses are the same.

grabear commented 5 years ago

we should definitely ensure that statistical analyses are the same.

@sdhutchins This! I just want to make sure we are using the same data for our visualizations. That way we can rule out the different packages as the source of any variations we might see in our analysis.

For the correlation plots, I'm making a custom dataframe from the phyloseq object. I'm going to look into making a custom dataframe from a metacoder object instead.

grabear commented 5 years ago

Joining Metacoder Observation Data with Taxonomy Data by taxon_ids

I forget

So here's what I explained earlier for our reference. I don't want to forget. @sdhutchins

> phy <- get_phyloseq_obj(".RDATA")
> proc_phy <- preprocess_phyloseq(phy)
> xx <- get_metacoder_obj(proc_phy)
> obj <- xx
> x1 <- obj %>% filter_taxa(n_supertaxa < 2, reassign_obs = FALSE)
...
> dplyr::right_join(x1$taxonomy_table(subset = taxon_ids(x1), add_id_col = TRUE), x1$data$diff_table, by = c("taxon_ids" = "taxon_id"))
# A tibble: 16 x 16
   taxon_ids Kingdom Phylum treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2 wilcox_p_value hartigan_dip_tr~ hartigan_dip_tr~ bimodality_coef~
   <chr>     <fct>   <fct>  <chr>       <chr>                  <dbl>           <dbl>       <dbl>     <dbl>       <dbl>       <dbl>          <dbl>            <dbl>            <dbl>            <dbl>
 1 ab        Bacter~ NA     Stressed    Control            -0.000165        -0.00122 -0.000114    -8.44e-4    0.999      1.000          0.0170               0.996           0.554             0.879
 2 ac        Archaea NA     Stressed    Control             3.48             2.53     0.000114     8.44e-4    0.00102    0.000176       0.0200               0.996           0.554             0.879
 3 ad        Bacter~ Firmi~ Stressed    Control             0.266            0.357    0.0737       1.01e-1    0.460      0.359          0.000170             0.501           0.169             0.473
 4 ae        Bacter~ Prote~ Stressed    Control            -0.901           -0.298   -0.00748     -6.14e-3    0.0267     0.0329         0.141                0.847           0.656             0.912
 5 af        Bacter~ Spiro~ Stressed    Control             0.830            0.607    0.00885      1.21e-2    0.0352     0.0231         0.135                0.222           0.961             0.646
 6 ag        Bacter~ Verru~ Stressed    Control            -2.05             0.416   -0.000824     5.13e-4    0.00205    0.00154        0.00852              0.814           0.797             0.946
 7 ah        Bacter~ Bacte~ Stressed    Control            -0.180           -0.303   -0.0660      -1.09e-1    0.468      0.577          0.00355              0.377           0.699             0.598
 8 ai        Bacter~ Fibro~ Stressed    Control             1.15             1.65     0.000466     1.65e-3    0.00242    0.000773       0.0728               0.871           0.929             0.796
 9 aj        Bacter~ Actin~ Stressed    Control            -2.67            -2.26    -0.000591    -6.42e-4    0.000169   0.000811       0.0000153            0.918           0.954             0.832
10 ak        Bacter~ Cyano~ Stressed    Control             2.10             1.53     0.000473     7.57e-4    0.00116    0.000402       0.0279               0.577           0.906             0.783
11 al        Bacter~ Elusi~ Stressed    Control             0.706           -1.71     0.000105    -2.33e-3    0.00103    0.00336        0.584                0.822           0.756             0.853
12 am        Bacter~ Lenti~ Stressed    Control            -0.420            0.450   -0.0000118    3.67e-5    0.000137   0.000100       0.868                0.890           0.604             0.891
13 an        Bacter~ Sacch~ Stressed    Control             3.60             2.74     0.000736     1.27e-3    0.00149    0.000224       0.000227             0.564           0.662             0.830
14 ao        Archaea Eurya~ Stressed    Control             3.48             2.53     0.000114     8.44e-4    0.00102    0.000176       0.0200               0.996           0.554             0.879
15 ap        Bacter~ Planc~ Stressed    Control             0                2.82     0.000178     5.50e-4    0.000641   0.0000911      0.00206              0.991           0.917             0.780
16 aq        Bacter~ Tener~ Stressed    Control            -0.0916          -0.169   -0.00000927  -5.17e-5    0.000416   0.000468       0.638                0.855           0.0624            0.804
# ... with 1 more variable: bimodality_coeff_treat2 <dbl>

grabear commented 5 years ago

Metacoder Style Agglomeration Example

I've also confirmed that the explanation given in the Issue I linked to above (https://github.com/grunwaldlab/metacoder/issues/234) works like phyloseq::tax_glom()

It might be cooler to add an agglomeration function like in phyloseq::tax_glom() as well.

Im not sure this is needed since this behavior is handled well by other functions. For filtering data, filter_taxa can remove ranks and reassign everything to the remaining ranks automatically. Also, calc_taxon_abund sums the counts for every taxon at every rank, so you can just subset the result of that to get the counts for a specific rank. However, since the info for every rank is there, subsetting the whole object to a specific rank is not really needed. I guess the equivalent of phyloseq::tax_glom() is something like:

library(metacoder)

# Parse dataset
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Get per-taxon counts
x$data$tax_counts <- calc_taxon_abund(x, dataset = "tax_data",
                                      cols = hmp_samples$sample_id)

# Subset 
filter_taxa(x, taxon_ranks == "p", supertaxa = TRUE, reassign_obs = c(tax_counts = FALSE))

grabear commented 5 years ago

Metacoder Style Taxonomic Prevalence Filtering

metacoder::calc_prop_samples is a super useful function for prevalence filtering. Here I've mimicked the phyloseq workflow of:

pp_p <- "Phylsoeq_obj"
> tf_glom <- tax_glom(pp_p, "Phylum", NArm = FALSE)
> tf_unfiltered <- genefilter_sample(tf_glom, filterfun_sample(function(x) x > 10), A = 0.6*nsamples(tf_glom)) 
> p_filter <- prune_taxa(tf_unfiltered, tf_glom)
# Formating for standard evaluation
> r_filter <- get_taxa_unique(p_filter, taxonomic.rank = "Phylum")
> pp_p <- subset_taxa(pp_p, Kingdom %in% r_filter)

By using metacoder and some dplyr:

# Get Metacoder object and then get per taxon counts
> raw_m <- parse_phyloseq(phy)
> raw_m$data$tax_counts <- raw_m %>% calc_taxon_abund("otu_table", cols = raw_m$data$sample_data$sample_id)
> removed_ids <- raw_m %>% 
filter_taxa(taxon_ranks == "Phylum", reassign_obs = c(tax_counts = FALSE)) %>% 
calc_prop_samples("tax_counts", more_than = 10) %>% 
filter(n_samples < 0.6)
> new_metacoder_obj <- raw_m %>% filter_taxa(!taxon_ids %in% removed_ids$taxon_id, reassign_obs = FALSE)

(Note: These percentages (0.6) and minimum count numbers (10) are random.)

grabear commented 5 years ago

Metacoder Style OTU Prevalence Filtering

This is almost exactly the same as above, except you drop the "agglomeration" part:

> raw_m <- parse_phyloseq(phy)
> raw_m$data$tax_counts <- raw_m %>% calc_taxon_abund("otu_table", cols = raw_m$data$sample_data$sample_id)
> new_rem <- raw_m  %>% 
    calc_prop_samples("tax_counts", more_than = 10) %>% 
    filter(n_samples < 0.6)
> new_metacoder_obj <- raw_m %>% filter_taxa(!taxon_ids %in% removed_ids$taxon_id, reassign_obs = FALSE)

sdhutchins commented 5 years ago

Well that's super succint.

grabear commented 5 years ago

Metacoder Style Filtering by Coefficient of Variation with Help from Phyloseq

Here is the way I tried to do CoV filtering. I had to coerce the data into a phyloseq::otu_table and do the filtering that way. I couldn't find a function in metacoder that allows you to apply a function across the columns.. This is a bit messy...

> raw_m <- parse_phyloseq(phy)
> raw_m$data$tax_counts <- raw_m %>% 
calc_taxon_abund("otu_table", cols = raw_m$data$sample_data$sample_id)
> total <- median(as.numeric(raw_m$data$otu_table %>%
 summarise_if(is.numeric, sum)))
> standf <- function(x, t = total) round(t * (x / sum(x)))
> raw_m_otu_table <- otu_table(data.matrix(column_to_rownames(raw_m$data$otu_table[2:length(raw_m$data$otu_table)], var="otu_id")), taxa_are_rows = TRUE)
> raw_m_trans <- phyloseq::transform_sample_counts(raw_m_otu_table, standf)
> f_list <- phyloseq::filter_taxa(raw_m_trans, flist = function(x) sd(x) / mean(x) > 0.9)
> remove_otu <- f_list[f_list == FALSE]
> removed_ids <- dplyr::filter(raw_m$data$otu_table, otu_id %in% names(remove_otu))
> new_metacoder_obj <- filter_taxa(raw_m, !taxon_ids %in% remove_ids$taxon_id, reassign_obs = FALSE)

grabear commented 5 years ago

Metacoder Style Master Threshold filtering

This is not the proper way to do this, however it is a good step in the right direction.

> raw_m <- parse_phyloseq(phy)
> otu_data <- raw_m$data$otu_table
> otu_mean <- rowMeans(select(otu_data, raw_m$data$sample_data$sample_id))
> keep_these_data <- otu_data %>% mutate(OTU_mean = otu_mean) %>% select(taxon_id, otu_id, OTU_mean) %>% filter(OTU_mean > 0.00005)

Filtering steps like the master threshold filtering and the coeffiecing of variation filtering have to be done using the otu_proportions table in the following way:

Convert otu_abundance to otu_proportions
Get list of OTU ids to filter
Filter Ids from the otu_abundance and otu_proportions
Create taxa_abundance and taxa_proporitons
use the metacoder::filter_taxa(obj, taxon_ids %in% obj$data$taxa_proportions%taxon_id)

This way the otu tables and taxa tables reflect each other (the observation data) as well as the taxmap object (the taxon data).

grabear commented 5 years ago

New Metacoder Format for the Package

I've come up with a new metacoder format for the package. It's really just a new naming convention for the data tables (aka observations). I find this to be a bit more intutive. Let me know what you think! @sdhutchins

# The metacoder_obj$data will contain the following tables.
# The keys here are the old phyloseq table names, and the values are the new table names
...
otu_table: "otu_abundance"
tax_data: "otu_annotations"
sample_data: "sample_data"
phy_tree: "phy_tree"
...
# These new tables are created by metacoder's calculation functions
...
calc_taxon_abund(otu_abundance): "taxa_abundance"
calc_obs_props(otu_abundance): "otu_proportions"
calc_obs_props(otu_proportions): "taxa_proportions"
...

grabear commented 5 years ago

@sdhutchins In order to get OTU ids in the otu_annotations table, I had to make a PR on the metacoder package. It's only like 5 lines, so maybe he will accept it soon. https://github.com/grabear/metacoder

grabear commented 5 years ago

Merge Functions

The zero_samples_filter needs to be changed into a more generalistic function called sample_proportions_filter (colMeans). From here it can be merged with otu_proportions_filter (rowsMeans) and the named changed to proportions_filter. #48
The taxa_prevalence_filter and otu_prevalence_filter could be combined by using some logic with rank = NULL in the parameters. #44

New Functionality

Add a function for creating the diff tables joined with taxa_annotations.
- This will also require updates to the formatting/validation functions for diff tables #42

> phy <- get_phyloseq_obj(".RDATA")
> proc_phy <- preprocess_phyloseq(phy)
> xx <- get_metacoder_obj(proc_phy)
> obj <- xx
> x1 <- obj %>% filter_taxa(n_supertaxa < 2, reassign_obs = FALSE)
...
> dplyr::right_join(x1$taxonomy_table(subset = taxon_ids(x1), add_id_col = TRUE), x1$data$diff_table, by = c("taxon_ids" = "taxon_id"))
# A tibble: 16 x 16
   taxon_ids Kingdom Phylum treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2 wilcox_p_value hartigan_dip_tr~ hartigan_dip_tr~ bimodality_coef~
   <chr>     <fct>   <fct>  <chr>       <chr>                  <dbl>           <dbl>       <dbl>     <dbl>       <dbl>       <dbl>          <dbl>            <dbl>            <dbl>            <dbl>
 1 ab        Bacter~ NA     Stressed    Control            -0.000165        -0.00122 -0.000114    -8.44e-4    0.999      1.000          0.0170               0.996           0.554             0.879
 2 ac        Archaea NA     Stressed    Control             3.48             2.53     0.000114     8.44e-4    0.00102    0.000176       0.0200               0.996           0.554             0.879
 3 ad        Bacter~ Firmi~ Stressed    Control             0.266            0.357    0.0737       1.01e-1    0.460      0.359          0.000170             0.501           0.169             0.473
 4 ae        Bacter~ Prote~ Stressed    Control            -0.901           -0.298   -0.00748     -6.14e-3    0.0267     0.0329         0.141                0.847           0.656             0.912
 5 af        Bacter~ Spiro~ Stressed    Control             0.830            0.607    0.00885      1.21e-2    0.0352     0.0231         0.135                0.222           0.961             0.646
 6 ag        Bacter~ Verru~ Stressed    Control            -2.05             0.416   -0.000824     5.13e-4    0.00205    0.00154        0.00852              0.814           0.797             0.946
 7 ah        Bacter~ Bacte~ Stressed    Control            -0.180           -0.303   -0.0660      -1.09e-1    0.468      0.577          0.00355              0.377           0.699             0.598
 8 ai        Bacter~ Fibro~ Stressed    Control             1.15             1.65     0.000466     1.65e-3    0.00242    0.000773       0.0728               0.871           0.929             0.796
 9 aj        Bacter~ Actin~ Stressed    Control            -2.67            -2.26    -0.000591    -6.42e-4    0.000169   0.000811       0.0000153            0.918           0.954             0.832
10 ak        Bacter~ Cyano~ Stressed    Control             2.10             1.53     0.000473     7.57e-4    0.00116    0.000402       0.0279               0.577           0.906             0.783
11 al        Bacter~ Elusi~ Stressed    Control             0.706           -1.71     0.000105    -2.33e-3    0.00103    0.00336        0.584                0.822           0.756             0.853
12 am        Bacter~ Lenti~ Stressed    Control            -0.420            0.450   -0.0000118    3.67e-5    0.000137   0.000100       0.868                0.890           0.604             0.891
13 an        Bacter~ Sacch~ Stressed    Control             3.60             2.74     0.000736     1.27e-3    0.00149    0.000224       0.000227             0.564           0.662             0.830
14 ao        Archaea Eurya~ Stressed    Control             3.48             2.53     0.000114     8.44e-4    0.00102    0.000176       0.0200               0.996           0.554             0.879
15 ap        Bacter~ Planc~ Stressed    Control             0                2.82     0.000178     5.50e-4    0.000641   0.0000911      0.00206              0.991           0.917             0.780
16 aq        Bacter~ Tener~ Stressed    Control            -0.0916          -0.169   -0.00000927  -5.17e-5    0.000416   0.000468       0.638                0.855           0.0624            0.804
# ... with 1 more variable: bimodality_coeff_treat2 <dbl>

Update the metacoder object to contain various functions we need in met_obj$funcs
Update the metacoder object to contain the original data in met_obj$archive$phyloseq and met_obj$archive$metacoder
Update the metacoder object to contain variables that are needed in met_obj$variables

sdhutchins commented 5 years ago

@sdhutchins In order to get OTU ids in the otu_annotations table, I had to make a PR on the metacoder package. It's only like 5 lines, so maybe he will accept it soon. https://github.com/grabear/metacoder

Sounds good! I'll keep an eye on that.

grabear commented 5 years ago

@sdhutchins it's merged https://github.com/grunwaldlab/metacoder/pull/253

sdhutchins commented 5 years ago

Awesome, @grabear!

grabear commented 5 years ago

This has been resolved in the package: https://github.com/vallenderlab/MicrobiomeR/issues

vallenderlab / MicrobiomeR

Discuss standard workflow paradigm (phyloseq/metacoder/visualization/etc). #42

Intro

General Workflow

Phyloseq Worflow

Metacoder Workflow

Ampvis2 Workflow

Ggplot2 Workflow

Specific Workflow Preferences

Import Data into R

Preprocess Data

Analyze Data

Plot data with ggplot

Joining Metacoder Observation Data with Taxonomy Data by taxon_ids

Metacoder Style Agglomeration Example

Metacoder Style Taxonomic Prevalence Filtering

Metacoder Style OTU Prevalence Filtering

Metacoder Style Filtering by Coefficient of Variation with Help from Phyloseq

Metacoder Style Master Threshold filtering

New Metacoder Format for the Package

Merge Functions

New Functionality