Open grabear opened 6 years ago
Cool, I like it! It seems like the output of compare_groups
would work well for this. A few thoughts:
filter_taxa
before plotting and %>%
the result in. There would need to be examples of this if that is the strategy.compare_groups
is used and there are more than two things being compared, how to select which comparison to plot? Should this be extended to multiple comparisons by faceting? If so, should the facets be wrapped with the titles saying what is compared, or in a matrix like heat_tree_matrix
. Perhaps there can be an as_matrix = FALSE
option, if multiple comparisons are supported. taxon_names
.n_obs
)? If the function returns a ggplot2 obejct, can this be done after by the user?It seems like the output of _comparegroups would work well for this.
@zachary-foster That's exactly what I did with the correlation plots lol. I'll type up a gist tomorrow and show you how I did it. Pretty simple stuff.
Also, I really need to go back through all of the functions again with the test data. I feel like I haven't used some of the functions or don't understand a few of them. I really explored your functions pretty hard though, so I think I have a handle on it. I just don't want to leave anything undone.
What do you think about a table containing some key statistical output from the _comparegroups function? Maybe unnecessary, but it was another thing I did for the same project:
Like a table ready to used in Rmarkdown? How would this differ from the current output of compare_groups?
Like a table ready to used in Rmarkdown?
Yes, this would be a table that's markdown ready.
How would this differ from the current output of compare_groups?
The current tibble returned by compare_groups looks a bit different. Below I've added the diff_table I created using a custom function for generating other statistics:
> silva_smb_data$metacoder_data$data$diff_table
# A tibble: 420 x 14
taxon_id treatment_1 treatment_2 log2_median_ratio log2_mean_ratio median_diff mean_diff mean_treat1
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ac Stressed Control 0.00272 0.00155 0.00186 0.00106 0.986
2 af Stressed Control 0.322 0.377 0.0850 0.102 0.442
3 ag Stressed Control -0.542 -0.228 -0.00454 -0.00455 0.0265
4 ah Stressed Control 0.804 0.622 0.00733 0.0109 0.0310
5 ai Stressed Control -2.07 0.493 -0.000742 0.000546 0.00189
6 aj Stressed Control -0.260 -0.296 -0.0955 -0.109 0.477
7 ak Stressed Control 1.07 1.64 0.000365 0.00143 0.00210
8 al Stressed Control -2.65 -2.25 -0.000880 -0.000953 0.000253
9 an Stressed Control 1.21 1.16 0.000441 0.000764 0.00138
10 ao Stressed Control 0.653 -1.63 0.0000817 -0.00190 0.000903
# ... with 410 more rows, and 6 more variables: mean_treat2 <dbl>, wilcox_p_value <dbl>,
# hartigan_dip_treat1 <dbl>, hartigan_dip_treat2 <dbl>, bimodality_coeff_treat1 <dbl>,
# bimodality_coeff_treat2 <dbl>
The difference is that there are only select variables (columns) in the publication ready table, and the data is agglomerated by a specific rank (Phylum in the example). Other subtaxa columns could be added if desired. While this could be useful, the user could also do this themselves, and I might have missed an important concept in one of the other metacoder/taxmap functions.
It might be cooler to add an agglomeration function like in phyloseq::tax_glom() as well.
In my project I've done this with phyloseq. I'm not always a super efficient R programmer so some of it might be redundant or coded better with functions like lapply, cbind, etc.:
phylum_phy <- phyloseq::tax_glom(phyloseq_object, taxrank=rank)
# Get data frames
df_tax <- data.frame(tax_table(phylum_phy))
df_otu <- data.frame(otu_table(phylum_phy))
df_sam <- data.frame(sample_data(phylum_phy))
# Get various variables for filtering
otu_names <- rownames(df_tax)
stressed_samples <- rownames(df_sam[df_sam$TreatmentGroup == "Stressed",])
control_samples <- rownames(df_sam[df_sam$TreatmentGroup == "Control",])
# Add a OTU column
df_tax <- mutate(df_tax, OTU=rownames(df_tax))
df_otu <- mutate(df_otu, OTU=rownames(df_otu))
# Join the tax table and otu table (phyloseq)
df_tax_otu <- left_join(x=df_otu, y=df_tax, by="OTU")
rownames(df_tax_otu) <- df_tax_otu$OTU
temp <- df_tax_otu[,!(colnames(df_tax_otu) %in% c("OTU"))][c(stressed_samples, control_samples)]
# Get statistical vectors for the main table
wilcox_p_value <- apply(as.matrix(temp), 1, function(x) wilcox.test(x[stressed_samples], x[control_samples])$p.value)
mean_stressed <- apply(as.matrix(temp), 1, function(x) mean(x[stressed_samples]))
mean_control <- apply(as.matrix(temp), 1, function(x) mean(x[control_samples]))
log2_mean_ratio <- apply(as.matrix(temp), 1, function(x) { log_ratio <- log2(mean(x[stressed_samples]) / mean(x[control_samples]))
if (is.nan(log_ratio)) {
log_ratio <- 0
}
if (is.infinite(log_ratio)) {
log_ratio <- 0
}
log_ratio
})
stats <- data.frame(wilcox_p_value=wilcox_p_value, mean_stressed = mean_stressed, mean_control = mean_control, log2_mean_ratio = log2_mean_ratio)
stats$OTU <- names(wilcox_p_value)
df_tax_otu <- left_join(df_tax_otu, stats, by="OTU")
# Curate the main data table
df_tax_otu <- mutate(df_tax_otu, Significance=wilcox_p_value < 0.05)
rownames(df_tax_otu) <- df_tax_otu$OTU
df_tax_otu$OTU <- NULL
main_df <- df_tax_otu[49:62][,c(rank, "mean_stressed", "mean_control", "wilcox_p_value", "log2_mean_ratio", "Significance", unlist(ranks[ranks != rank]))]
main_df <- main_df %>% select_if(~sum(!is.na(.)) > 0)
The metacoder workflow would conceptually look something like this (don't focus on the data types or anything. this is just a rough draft):
...
metacoder_obj <- metacoder::agglomerate(metacoder_obj, rank="Phylum")
# Calc functions before or after agglomeration??
# Comp group??
# Merge the diff table and tax table, filter by p-value, and keep specific columns
metacoder_obj$data$phylum_table <- metacoder::merge_and_filter(tables=c(diff_table, tax_table), filter = wilcox_p_value <= 0.05,
columns=c(Phylum, Class, Order, wilcox_p_value, log2_mean_ratio, mean_stressed, mean_control))
Ok, I think I get it now. Yea, I think this could be useful! It could be abstracted to a function that combines per-taxon info from multiple tables (anything in obj$all_names()
or any external variable named by taxon ids) into a single table. It would error if told to use data with multiple values per taxon. There could be a TRUE/FALSE option to return a markdown-ready table using knitr::kable
. This was kindof the goal of taxa::get_data_frame
, but it is not very developed/useful yet.
It might be cooler to add an agglomeration function like in phyloseq::tax_glom() as well.
Im not sure this is needed since this behavior is handled well by other functions. For filtering data, filter_taxa
can remove ranks and reassign everything to the remaining ranks automatically. Also, calc_taxon_abund
sums the counts for every taxon at every rank, so you can just subset the result of that to get the counts for a specific rank. However, since the info for every rank is there, subsetting the whole object to a specific rank is not really needed. I guess the equivalent of phyloseq::tax_glom()
is something like:
library(metacoder)
# Parse dataset
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Get per-taxon counts
x$data$tax_counts <- calc_taxon_abund(x, dataset = "tax_data",
cols = hmp_samples$sample_id)
# Subset
filter_taxa(x, taxon_ranks == "p", supertaxa = TRUE, reassign_obs = c(tax_counts = FALSE))
<Taxmap>
8 taxa: ab. Root, ac. Proteobacteria, ad. Bacteroidetes ... ag. Fusobacteria, ah. Tenericutes, ai. Spirochaetes
8 edges: NA->ab, ab->ac, ab->ad, ab->ae, ab->af, ab->ag, ab->ah, ab->ai
3 data sets:
tax_data:
# A tibble: 1,000 x 53
taxon_id otu_id lineage `700035949` `700097855` `700100489` `700111314` `700033744` `700109581` `700111044`
<chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
1 ac OTU_97… r__Root;p… 0 2 1 0 0 0 0
2 ad OTU_97… r__Root;p… 0 0 0 0 0 0 0
3 ad OTU_97… r__Root;p… 0 1 0 0 0 0 0
# ... with 997 more rows, and 43 more variables: `700101365` <int>, `700100431` <int>, `700016050` <int>,
# `700032425` <int>, `700024855` <int>, `700103488` <int>, `700096869` <int>, `700107379` <int>,
# `700096422` <int>, `700102417` <int>, …
class_data:
# A tibble: 5,922 x 5
taxon_id input_index tax_rank tax_name regex_match
<chr> <int> <chr> <chr> <chr>
1 ab 1 r Root r__Root
2 ac 1 p Proteobacteria p__Proteobacteria
3 ac 1 c Gammaproteobacteria c__Gammaproteobacteria
# ... with 5,919 more rows
tax_counts:
# A tibble: 8 x 51
taxon_id `700035949` `700097855` `700100489` `700111314` `700033744` `700109581` `700111044` `700101365`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ab 955. 4330. 887. 598. 6303. 3379. 1219. 829.
2 ac 193. 144. 17. 132. 1388. 30. 0. 21.
3 ad 5. 43. 6. 21. 20. 15. 7. 5.
# ... with 5 more rows, and 42 more variables: `700100431` <dbl>, `700016050` <dbl>, `700032425` <dbl>,
# `700024855` <dbl>, `700103488` <dbl>, `700096869` <dbl>, `700107379` <dbl>, `700096422` <dbl>,
# `700102417` <dbl>, `700114168` <dbl>, …
0 functions:
Phyloseq workflow
Thanks for the phyoseq workflow! Its nice to be able to compare the methods.
Metacoder workflow
It can be done something like this:
library(phyloseq)
library(metacoder)
library(dplyr)
# Parse dataset
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Convert counts to proportions
x$data$otu_table <- calc_obs_props(x, dataset = "tax_data",
cols = hmp_samples$sample_id)
# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, dataset = "otu_table",
cols = hmp_samples$sample_id)
# Get per-taxon body site mean proportions
x$data$site_means <- calc_group_mean(x, dataset = "tax_table",
cols = hmp_samples$sample_id,
groups = hmp_samples$body_site)
# Calculate difference between groups
x$data$diff_table <- compare_groups(x, dataset = "tax_table",
cols = hmp_samples$sample_id,
groups = hmp_samples$body_site)
# Make table of most different classes between nose an saliva
filter_taxa(x, taxon_ranks == "c", reassign_obs = FALSE) %>%
mutate_obs("diff_table", taxon_name = taxon_names, mean_nose = Nose, mean_saliva = Saliva) %>%
get_dataset("diff_table") %>% # New function I just added to `taxa`
filter(wilcox_p_value <= 0.05, treatment_1 == "Nose", treatment_2 == "Saliva") %>%
arrange(mean_diff) %>%
transmute(Class = taxon_name, "Mean nose proportion" = mean_nose, "Mean saliva proportion" = mean_saliva, "Mean difference" = mean_diff, "P value" = wilcox_p_value) %>%
knitr::kable()
Class | Mean nose proportion | Mean saliva proportion | Mean difference | P value |
---|---|---|---|---|
Bacteroidia | 0.0062143 | 0.2488121 | -0.2425979 | 0.0000108 |
Clostridia | 0.0058709 | 0.2016887 | -0.1958178 | 0.0000108 |
Gammaproteobacteria | 0.0365834 | 0.1138908 | -0.0773074 | 0.0051960 |
Betaproteobacteria | 0.0312206 | 0.0900281 | -0.0588075 | 0.0112070 |
Fusobacteria | 0.0004220 | 0.0452540 | -0.0448320 | 0.0001494 |
Flavobacteria | 0.0046150 | 0.0242023 | -0.0195873 | 0.0012604 |
Epsilonproteobacteria | 0.0175916 | 0.0089710 | 0.0086206 | 0.0116113 |
Actinobacteria | 0.6039006 | 0.0229178 | 0.5809828 | 0.0000108 |
In this example you gave me:
I guess the equivalent of
phyloseq::tax_glom()
is something like:library(metacoder) # Parse dataset x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";", class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"), class_regex = "^(.+)__(.+)$") # Get per-taxon counts x$data$tax_counts <- calc_taxon_abund(x, dataset = "tax_data", cols = hmp_samples$sample_id) # Subset filter_taxa(x, taxon_ranks == "p", supertaxa = TRUE, reassign_obs = c(tax_counts = FALSE))
Does reassign_obs need to be false
? I would think that it would need to be true
so that all of the indexes would change with the data.
This issue is also old so things may have changed between now and then.
This issue is also old so things may have changed between now and then.
I think filter_taxa
behaves mostly the same, at least in this regard.
Does reassign_obs need to be false? I would think that it would need to be true so that all of the indexes would change with the data.
Was the goal similar to #255 ?
If it was false, "tax_counts" would not be subset, just reassigned to the remaining 8 taxa. For example the row with info for Clostridia would be reassinged to Firmicutes, even though a row for Firmicutes already exists, so now you have two rows assigned to Firmicutes, one for Firmicutes and one for Clostridia, which is confusing.
One source of confusion for a lot of people is that this is using filter_taxa
for its side effect (filtering a table), rather than its main goal of filtering the taxonomy.
Perhaps I need to think of a way to make filter_obs
use per-taxon info to filter, so something like this is possible:
filter_obs(x, "tax_counts", taxon_ranks == "p")
That way just the table is filtered, not the taxonomy. It actually already works in this case probably, but only because the rows of "tax_counts" happen to correspond to taxa 1 to 1. It would have to work in other cases too, which gets more tricky. Also, if I do:
filter_obs(x, "tax_counts", taxon_names == "Actinobacteria")
I probably want the subtaxa as well, but the "subtaxa" option in filter_obs
relates to how taxa are removed from the taxonomy if drop_taxa
is TRUE
, so the following would not work currently:
filter_obs(x, "tax_counts", taxon_names == "Actinobacteria", subtaxa = TRUE)
Perhaps I could rename that option "drop_subtaxa" and make the above work.
Just brainstorming, not sure if that is relevant to your question.
Did I answer your question?
If it was false, "tax_counts" would not be subset, just reassigned to the remaining 8 taxa.
Yes, this answers my question and more!
Just brainstorming, not sure if that is relevant to your question.
I have some more ideas for you to brew over, but I'll make another issue (https://github.com/grunwaldlab/metacoder/issues/254) as I did recently :rofl:
By correlation plots (#229) I just mean a scatter plot comparing two variables. Below I've given a use case example from one of our labs projects vallenderlab also datasnakes.
Vallender Lab Microbiome Project
Interactive Plotly Example
Static plot: