Closed grabear closed 5 years ago
Generally speaking, I like the Phyloseq Manipulation-> Metacoder Analysis -> Multi-Package Visualization
workflow.
I like metacoder for the anlaysis part especially. I think the way Zachary Foster has it set up to analyze data by treatment group is great. metacoder::compare_groups
also does pairwise comparisons for experiments where you have more than 2 Treatment Groups. I think that's been a concern moving forward with the MicrobiomeR package. I was trying to figure out how to reverse engineer metacoder::compare_groups
for simple dataframes, but it would be much easier if we could simply utilize what's already there.
As far as the alpha diversity plot, both metacoder and microbiome use the vegan
package for this, and after a quick look, I can see it's being done the same way in both packages except they're using different graphical ways to represent the data which is worth exploring.
As far as stacked bar charts, I'm not doing much or any stats - simply transforming the relative abundance after using the preprocess phyloseq function.
I do see that metacoder has a way to do this...
obj$data$type_abund <- calc_taxon_abund(obj, "otu_rarefied",
cols = sample_data$SampleID,
groups = sample_data$Type)
I can certainly look into adopting that.
The below is pretty much all I do stats wise with phyloseq
.
physeq <- get_phyloseq_obj(rdata_file = "data/silva_phyloseq_obj.RData")
physeq <- preprocess_phyloseq(physeq)
physeq_glommed_phylum <- tax_glom(physeq, "Phylum")
physeq_glommed_order <- tax_glom(physeq, "Order")
physeq_transformed_phylum <- transform_sample_counts(physeq_glommed_phylum, function(x) 100 * x / sum(x))
# Merge groups for first stacked bar chart
merged_groups <- merge_by_group(physeq)
normalized_merged_groups <- transform_sample_counts(merged_groups, function(x) 100 * x / sum(x))
Given that the heat trees are basically a refined and extended version of the stacked bar charts - we should definitely ensure that statistical analyses are the same.
we should definitely ensure that statistical analyses are the same.
@sdhutchins This! I just want to make sure we are using the same data for our visualizations. That way we can rule out the different packages as the source of any variations we might see in our analysis.
For the correlation plots, I'm making a custom dataframe from the phyloseq object. I'm going to look into making a custom dataframe from a metacoder object instead.
So here's what I explained earlier for our reference. I don't want to forget. @sdhutchins
> phy <- get_phyloseq_obj(".RDATA")
> proc_phy <- preprocess_phyloseq(phy)
> xx <- get_metacoder_obj(proc_phy)
> obj <- xx
> x1 <- obj %>% filter_taxa(n_supertaxa < 2, reassign_obs = FALSE)
...
> dplyr::right_join(x1$taxonomy_table(subset = taxon_ids(x1), add_id_col = TRUE), x1$data$diff_table, by = c("taxon_ids" = "taxon_id"))
# A tibble: 16 x 16
taxon_ids Kingdom Phylum treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2 wilcox_p_value hartigan_dip_tr~ hartigan_dip_tr~ bimodality_coef~
<chr> <fct> <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ab Bacter~ NA Stressed Control -0.000165 -0.00122 -0.000114 -8.44e-4 0.999 1.000 0.0170 0.996 0.554 0.879
2 ac Archaea NA Stressed Control 3.48 2.53 0.000114 8.44e-4 0.00102 0.000176 0.0200 0.996 0.554 0.879
3 ad Bacter~ Firmi~ Stressed Control 0.266 0.357 0.0737 1.01e-1 0.460 0.359 0.000170 0.501 0.169 0.473
4 ae Bacter~ Prote~ Stressed Control -0.901 -0.298 -0.00748 -6.14e-3 0.0267 0.0329 0.141 0.847 0.656 0.912
5 af Bacter~ Spiro~ Stressed Control 0.830 0.607 0.00885 1.21e-2 0.0352 0.0231 0.135 0.222 0.961 0.646
6 ag Bacter~ Verru~ Stressed Control -2.05 0.416 -0.000824 5.13e-4 0.00205 0.00154 0.00852 0.814 0.797 0.946
7 ah Bacter~ Bacte~ Stressed Control -0.180 -0.303 -0.0660 -1.09e-1 0.468 0.577 0.00355 0.377 0.699 0.598
8 ai Bacter~ Fibro~ Stressed Control 1.15 1.65 0.000466 1.65e-3 0.00242 0.000773 0.0728 0.871 0.929 0.796
9 aj Bacter~ Actin~ Stressed Control -2.67 -2.26 -0.000591 -6.42e-4 0.000169 0.000811 0.0000153 0.918 0.954 0.832
10 ak Bacter~ Cyano~ Stressed Control 2.10 1.53 0.000473 7.57e-4 0.00116 0.000402 0.0279 0.577 0.906 0.783
11 al Bacter~ Elusi~ Stressed Control 0.706 -1.71 0.000105 -2.33e-3 0.00103 0.00336 0.584 0.822 0.756 0.853
12 am Bacter~ Lenti~ Stressed Control -0.420 0.450 -0.0000118 3.67e-5 0.000137 0.000100 0.868 0.890 0.604 0.891
13 an Bacter~ Sacch~ Stressed Control 3.60 2.74 0.000736 1.27e-3 0.00149 0.000224 0.000227 0.564 0.662 0.830
14 ao Archaea Eurya~ Stressed Control 3.48 2.53 0.000114 8.44e-4 0.00102 0.000176 0.0200 0.996 0.554 0.879
15 ap Bacter~ Planc~ Stressed Control 0 2.82 0.000178 5.50e-4 0.000641 0.0000911 0.00206 0.991 0.917 0.780
16 aq Bacter~ Tener~ Stressed Control -0.0916 -0.169 -0.00000927 -5.17e-5 0.000416 0.000468 0.638 0.855 0.0624 0.804
# ... with 1 more variable: bimodality_coeff_treat2 <dbl>
I've also confirmed that the explanation given in the Issue I linked to above (https://github.com/grunwaldlab/metacoder/issues/234) works like phyloseq::tax_glom()
It might be cooler to add an agglomeration function like in phyloseq::tax_glom() as well.
Im not sure this is needed since this behavior is handled well by other functions. For filtering data, filter_taxa can remove ranks and reassign everything to the remaining ranks automatically. Also, calc_taxon_abund sums the counts for every taxon at every rank, so you can just subset the result of that to get the counts for a specific rank. However, since the info for every rank is there, subsetting the whole object to a specific rank is not really needed. I guess the equivalent of phyloseq::tax_glom() is something like:
library(metacoder)
# Parse dataset
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Get per-taxon counts
x$data$tax_counts <- calc_taxon_abund(x, dataset = "tax_data",
cols = hmp_samples$sample_id)
# Subset
filter_taxa(x, taxon_ranks == "p", supertaxa = TRUE, reassign_obs = c(tax_counts = FALSE))
metacoder::calc_prop_samples
is a super useful function for prevalence filtering. Here I've mimicked the phyloseq workflow of:
pp_p <- "Phylsoeq_obj"
> tf_glom <- tax_glom(pp_p, "Phylum", NArm = FALSE)
> tf_unfiltered <- genefilter_sample(tf_glom, filterfun_sample(function(x) x > 10), A = 0.6*nsamples(tf_glom))
> p_filter <- prune_taxa(tf_unfiltered, tf_glom)
# Formating for standard evaluation
> r_filter <- get_taxa_unique(p_filter, taxonomic.rank = "Phylum")
> pp_p <- subset_taxa(pp_p, Kingdom %in% r_filter)
By using metacoder and some dplyr:
# Get Metacoder object and then get per taxon counts
> raw_m <- parse_phyloseq(phy)
> raw_m$data$tax_counts <- raw_m %>% calc_taxon_abund("otu_table", cols = raw_m$data$sample_data$sample_id)
> removed_ids <- raw_m %>%
filter_taxa(taxon_ranks == "Phylum", reassign_obs = c(tax_counts = FALSE)) %>%
calc_prop_samples("tax_counts", more_than = 10) %>%
filter(n_samples < 0.6)
> new_metacoder_obj <- raw_m %>% filter_taxa(!taxon_ids %in% removed_ids$taxon_id, reassign_obs = FALSE)
(Note: These percentages (0.6) and minimum count numbers (10) are random.)
This is almost exactly the same as above, except you drop the "agglomeration" part:
> raw_m <- parse_phyloseq(phy)
> raw_m$data$tax_counts <- raw_m %>% calc_taxon_abund("otu_table", cols = raw_m$data$sample_data$sample_id)
> new_rem <- raw_m %>%
calc_prop_samples("tax_counts", more_than = 10) %>%
filter(n_samples < 0.6)
> new_metacoder_obj <- raw_m %>% filter_taxa(!taxon_ids %in% removed_ids$taxon_id, reassign_obs = FALSE)
Well that's super succint.
Here is the way I tried to do CoV filtering. I had to coerce the data into a phyloseq::otu_table
and do the filtering that way. I couldn't find a function in metacoder that allows you to apply a function across the columns.. This is a bit messy...
> raw_m <- parse_phyloseq(phy)
> raw_m$data$tax_counts <- raw_m %>%
calc_taxon_abund("otu_table", cols = raw_m$data$sample_data$sample_id)
> total <- median(as.numeric(raw_m$data$otu_table %>%
summarise_if(is.numeric, sum)))
> standf <- function(x, t = total) round(t * (x / sum(x)))
> raw_m_otu_table <- otu_table(data.matrix(column_to_rownames(raw_m$data$otu_table[2:length(raw_m$data$otu_table)], var="otu_id")), taxa_are_rows = TRUE)
> raw_m_trans <- phyloseq::transform_sample_counts(raw_m_otu_table, standf)
> f_list <- phyloseq::filter_taxa(raw_m_trans, flist = function(x) sd(x) / mean(x) > 0.9)
> remove_otu <- f_list[f_list == FALSE]
> removed_ids <- dplyr::filter(raw_m$data$otu_table, otu_id %in% names(remove_otu))
> new_metacoder_obj <- filter_taxa(raw_m, !taxon_ids %in% remove_ids$taxon_id, reassign_obs = FALSE)
This is not the proper way to do this, however it is a good step in the right direction.
> raw_m <- parse_phyloseq(phy)
> otu_data <- raw_m$data$otu_table
> otu_mean <- rowMeans(select(otu_data, raw_m$data$sample_data$sample_id))
> keep_these_data <- otu_data %>% mutate(OTU_mean = otu_mean) %>% select(taxon_id, otu_id, OTU_mean) %>% filter(OTU_mean > 0.00005)
Filtering steps like the master threshold filtering and the coeffiecing of variation filtering have to be done using the otu_proportions table in the following way:
metacoder::filter_taxa(obj, taxon_ids %in% obj$data$taxa_proportions%taxon_id)
This way the otu tables and taxa tables reflect each other (the observation data) as well as the taxmap object (the taxon data).
I've come up with a new metacoder format for the package. It's really just a new naming convention for the data tables (aka observations). I find this to be a bit more intutive. Let me know what you think! @sdhutchins
# The metacoder_obj$data will contain the following tables.
# The keys here are the old phyloseq table names, and the values are the new table names
...
otu_table: "otu_abundance"
tax_data: "otu_annotations"
sample_data: "sample_data"
phy_tree: "phy_tree"
...
# These new tables are created by metacoder's calculation functions
...
calc_taxon_abund(otu_abundance): "taxa_abundance"
calc_obs_props(otu_abundance): "otu_proportions"
calc_obs_props(otu_proportions): "taxa_proportions"
...
@sdhutchins In order to get OTU ids in the otu_annotations table, I had to make a PR on the metacoder package. It's only like 5 lines, so maybe he will accept it soon. https://github.com/grabear/metacoder
The zero_samples_filter
needs to be changed into a more generalistic function called sample_proportions_filter
(colMeans). From here it can be merged with otu_proportions_filter
(rowsMeans) and the named changed to proportions_filter
. #48
The taxa_prevalence_filter
and otu_prevalence_filter
could be combined by using some logic with rank = NULL
in the parameters. #44
formatting/validation
functions for diff tables #42> phy <- get_phyloseq_obj(".RDATA")
> proc_phy <- preprocess_phyloseq(phy)
> xx <- get_metacoder_obj(proc_phy)
> obj <- xx
> x1 <- obj %>% filter_taxa(n_supertaxa < 2, reassign_obs = FALSE)
...
> dplyr::right_join(x1$taxonomy_table(subset = taxon_ids(x1), add_id_col = TRUE), x1$data$diff_table, by = c("taxon_ids" = "taxon_id"))
# A tibble: 16 x 16
taxon_ids Kingdom Phylum treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2 wilcox_p_value hartigan_dip_tr~ hartigan_dip_tr~ bimodality_coef~
<chr> <fct> <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ab Bacter~ NA Stressed Control -0.000165 -0.00122 -0.000114 -8.44e-4 0.999 1.000 0.0170 0.996 0.554 0.879
2 ac Archaea NA Stressed Control 3.48 2.53 0.000114 8.44e-4 0.00102 0.000176 0.0200 0.996 0.554 0.879
3 ad Bacter~ Firmi~ Stressed Control 0.266 0.357 0.0737 1.01e-1 0.460 0.359 0.000170 0.501 0.169 0.473
4 ae Bacter~ Prote~ Stressed Control -0.901 -0.298 -0.00748 -6.14e-3 0.0267 0.0329 0.141 0.847 0.656 0.912
5 af Bacter~ Spiro~ Stressed Control 0.830 0.607 0.00885 1.21e-2 0.0352 0.0231 0.135 0.222 0.961 0.646
6 ag Bacter~ Verru~ Stressed Control -2.05 0.416 -0.000824 5.13e-4 0.00205 0.00154 0.00852 0.814 0.797 0.946
7 ah Bacter~ Bacte~ Stressed Control -0.180 -0.303 -0.0660 -1.09e-1 0.468 0.577 0.00355 0.377 0.699 0.598
8 ai Bacter~ Fibro~ Stressed Control 1.15 1.65 0.000466 1.65e-3 0.00242 0.000773 0.0728 0.871 0.929 0.796
9 aj Bacter~ Actin~ Stressed Control -2.67 -2.26 -0.000591 -6.42e-4 0.000169 0.000811 0.0000153 0.918 0.954 0.832
10 ak Bacter~ Cyano~ Stressed Control 2.10 1.53 0.000473 7.57e-4 0.00116 0.000402 0.0279 0.577 0.906 0.783
11 al Bacter~ Elusi~ Stressed Control 0.706 -1.71 0.000105 -2.33e-3 0.00103 0.00336 0.584 0.822 0.756 0.853
12 am Bacter~ Lenti~ Stressed Control -0.420 0.450 -0.0000118 3.67e-5 0.000137 0.000100 0.868 0.890 0.604 0.891
13 an Bacter~ Sacch~ Stressed Control 3.60 2.74 0.000736 1.27e-3 0.00149 0.000224 0.000227 0.564 0.662 0.830
14 ao Archaea Eurya~ Stressed Control 3.48 2.53 0.000114 8.44e-4 0.00102 0.000176 0.0200 0.996 0.554 0.879
15 ap Bacter~ Planc~ Stressed Control 0 2.82 0.000178 5.50e-4 0.000641 0.0000911 0.00206 0.991 0.917 0.780
16 aq Bacter~ Tener~ Stressed Control -0.0916 -0.169 -0.00000927 -5.17e-5 0.000416 0.000468 0.638 0.855 0.0624 0.804
# ... with 1 more variable: bimodality_coeff_treat2 <dbl>
met_obj$funcs
met_obj$archive$phyloseq
and met_obj$archive$metacoder
met_obj$variables
@sdhutchins In order to get OTU ids in the otu_annotations table, I had to make a PR on the metacoder package. It's only like 5 lines, so maybe he will accept it soon. https://github.com/grabear/metacoder
Sounds good! I'll keep an eye on that.
@sdhutchins it's merged https://github.com/grunwaldlab/metacoder/pull/253
Awesome, @grabear!
This has been resolved in the package: https://github.com/vallenderlab/MicrobiomeR/issues
Intro
Currently, we have different ways of generating one plot or another. If we can standardize how we generate our plots, it will make the data much more credible, and it should eliminate any variations that we see.
Additionally, there may be better or programmatically quicker ways of doing our analysis. I don't think there have been any updates to phylsoeq, however, metacoder has had a few huge updates this year, and so has ampvis2.
General Workflow
In general all of our workflows are essentially doing the same things:
Phyloseq Worflow
Phyloseq is a great package and it has tons of useful plotting functions. But it also uses the base R filtering functions, which are less sexy. I personally think phyloseq is great for the initial import of data, and some initial preprocessing.
Metacoder Workflow
I think metacoder's use of dplyr is the main reason for my preference. I really like the taxmap object as well. It seems more descriptive and maybe we can use it to our advantage someone (that would be in a later release though..). While metacoder also has really cool functions for analyzing taxmap data, it does make it a bit harder to use the data in other plots. In an issue that I made on metacoders repo, Zachary foster showed me a cool way to make a data table with the "diff table" you can generate for statistical analysis:
Ampvis2 Workflow
Ampvis2 has a ton of really cool plots and they look great! But i haven't really used it much. We need to look more into this and see if it's worth exploring. I think the ordination plots from this package will be an excellent thing to use, but I haven't compared any of the other ordination workflows.
Ggplot2 Workflow
The correlation plots are obviously custom ggplot2 visualizations, and so are some of the other ones. Luckily all of the other packages use ggplot2 for the visualziations so this section is pretty self explainatory.
Specific Workflow Preferences
This is based off of what we currently do and what I think we should work towards in general. It may not happen this way, but it's something to work towards. As long as we can standardize the way we are producing the plots (aka we are using the same data for each ggplot), then that's what matters.
Import Data into R
Preprocess Data
This includes:
Analyze Data
This includes:
Plot data with ggplot
This includes: