Open M-Ourry opened 5 years ago
Hi @M-Ourry,
Thanks, I am glad it was helpful!
I was wondering if it was possible to use presence/absence data instead of proportions in order to perform a generalized linear model (glm) using the binomial family (instead of the wilcoxon test).
I have not done that, but you can plot anything if you can get the results into the right format. In this case, you need per-taxon values of what ever you are plotting in a table with a taxon_id
column.
How do I obtain presence/absence data? Should the transformation occur before or after the parse function?
You can do it with counts_to_presence
after reading in your data:
library(metacoder)
#> Loading required package: taxa
#> This is metacoder verison 0.3.2.9001 (development version)
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Convert count to presence/absence
counts_to_presence(x, "tax_data")
#> # A tibble: 1,000 x 51
#> taxon_id `700035949` `700097855` `700100489` `700111314` `700033744`
#> <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 dm FALSE TRUE TRUE FALSE FALSE
#> 2 dn FALSE FALSE FALSE FALSE FALSE
#> 3 do FALSE TRUE FALSE FALSE FALSE
#> 4 dp TRUE TRUE TRUE TRUE TRUE
#> 5 dq TRUE TRUE FALSE FALSE FALSE
#> 6 dp TRUE TRUE TRUE TRUE TRUE
#> 7 dr TRUE TRUE TRUE TRUE TRUE
#> 8 ds FALSE FALSE FALSE FALSE FALSE
#> 9 dt FALSE FALSE FALSE FALSE FALSE
#> 10 du FALSE FALSE FALSE FALSE TRUE
#> # … with 990 more rows, and 45 more variables: `700109581` <lgl>,
#> # `700111044` <lgl>, `700101365` <lgl>, `700100431` <lgl>,
#> # `700016050` <lgl>, ...
# Check if there are any reads in each group of samples
counts_to_presence(x, "tax_data", groups = hmp_samples$body_site)
#> # A tibble: 1,000 x 6
#> taxon_id Nose Saliva Skin Stool Throat
#> <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 dm TRUE TRUE FALSE FALSE TRUE
#> 2 dn FALSE TRUE FALSE FALSE TRUE
#> 3 do TRUE TRUE TRUE FALSE TRUE
#> 4 dp TRUE FALSE TRUE FALSE FALSE
#> 5 dq TRUE TRUE TRUE FALSE TRUE
#> 6 dp TRUE TRUE TRUE FALSE FALSE
#> 7 dr TRUE FALSE TRUE FALSE FALSE
#> 8 ds TRUE FALSE TRUE TRUE FALSE
#> 9 dt FALSE TRUE FALSE FALSE TRUE
#> 10 du TRUE FALSE TRUE TRUE FALSE
#> # … with 990 more rows
Created on 2019-06-20 by the reprex package (v0.3.0)
How do I change the wilcoxon test into the binomial glm?
I assume you are talking about using compare_groups
with heat_tree_matrix
? If so, you can make compare_groups
use your own custom function with the func
option. Your function is with every comparison of a taxon between two groups of counts (TRUE/FALSE in your case). Unfortunately, I don't know how to use GLM, so you'll have to figure out that part, but you can see the format of the function to use in the help page for ?compare_groups
.
Finally, I am not interested in all the obtained comparisons, is it possible to select the comparisons of interest when plotting the heat trees?
Yea, that's also an option in ?compare_groups
called combinations
. Or you can subset the data to just the groups you want to compare before using compare_groups
, but you will get all pairwise combinations. If you don't want all pairwise combinations, but just want specific pairs, then you need to usecombinations
. Im not sure you would want to plot those results with heat_tree_matrix
, but you can subset the results of each comparison and plot them on individual trees.
library(metacoder)
#> Loading required package: taxa
#> This is metacoder verison 0.3.2.9002 (development version)
# Parse data for plotting
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Convert counts to proportions
x$data$otu_table <- calc_obs_props(x, data = "tax_data", cols = hmp_samples$sample_id)
#> Calculating proportions from counts for 50 columns for 1000 observations.
# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "otu_table", cols = hmp_samples$sample_id)
#> Summing per-taxon counts from 50 columns for 174 taxa
# Calculate difference between groups
x$data$diff_table <- compare_groups(x, data = "tax_table",
cols = hmp_samples$sample_id,
groups = hmp_samples$body_site,
combinations = list(c('Nose', 'Saliva'),
c('Skin', 'Throat')))
x$data$diff_table$wilcox_p_value <- p.adjust(x$data$diff_table$wilcox_p_value,
method = "fdr")
# Plot subset
x %>%
filter_obs(data = "diff_table", treatment_1 == "Nose", treatment_2 == "Saliva") %>%
heat_tree(node_label = taxon_names,
node_size = n_obs,
node_color = ifelse(wilcox_p_value > 0.05 | is.nan(wilcox_p_value),
0, log2_median_ratio),
node_color_interval = c(-2, 2), # The range of `log2_median_ratio` to display
node_color_range = c("cyan", "gray", "tan"),
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 ratio of median proportions",
layout = "davidson-harel",
initial_layout = "reingold-tilford")
Created on 2019-06-20 by the reprex package (v0.3.0)
If you are concerned about using a Wilcox test (a valid concern) and looking for an alternative, not specifically GLM, you could look into my recent attempts to use DESeq2 to do differential abundance testing. It uses a negative binomial distribution to model read counts. As far as I know, it is one of the best methods out there for this. Check it out here if you are interested:
Hi @zachary-foster
Thank you for your answer! Actually, I was more interested in taxa presence/absence than their abundance, which is why I wanted to use a binomial glm (on presence/absence data) and then realize a heat tree using this statistical analysis. I tried to change it in the compare_group function but it is definitely out of my league (for now and I am running out of time). So I think I will stick to the wilcoxon test for now.
However, I am not too sure about how to interprete the log2 ratio. If I take your example (heat tree comparing nose and saliva), is it correct to say that Staphylococcus are more frequent/abundant(?) in nose than saliva samples? Or are there more things to say and how do you use the notion of log2 ratio for description? I think it is just an interpretation issue as I don't often encounter log2 ratio in analyses and don't know how to use it to describe the heat tree (when comparing different treatments).
Lastly, I was wondering whether it is possible to highlight a branch of interest in the heat tree. Let's say I want to highlight the whole branch leading to Bacillus, because it is known in the litterature for having X functions.
Thanks again for your help!
Actually, I was more interested in taxa presence/absence than their abundance, which is why I wanted to use a binomial glm (on presence/absence data)
Do you need a test for this? What if you just picked a minimum read count for a taxon to be considered present and colored the taxa that were present one color and taxa not present in grey? You could have a set of trees, one for each experimental factor. Or if you wanted to do comparisons between factors, you could have 4 colors: present in both, present in factor A, present in factor B, or absent in both. Like this:
library(metacoder)
#> Loading required package: taxa
#> This is metacoder verison 0.3.2.9001 (development version)
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Set read counts to 0 below a minimum threshold
x$data$tax_data <- zero_low_counts(x, data = "tax_data", min_count = 10, cols = hmp_samples$sample_id)
#> Zeroing 5422 of 50000 counts less than 10.
# Convert counts to proportions
x$data$otu_table <- calc_obs_props(x, data = "tax_data", cols = hmp_samples$sample_id)
#> Calculating proportions from counts for 50 columns for 1000 observations.
# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "otu_table", cols = hmp_samples$sample_id)
#> Summing per-taxon counts from 50 columns for 174 taxa
# Make new func for compare_groups for presence/absence
presence_func <- function(abund_1, abund_2) {
abund_1 <- sum(abund_1)
abund_2 <- sum(abund_2)
if (abund_1 > 0 && abund_2 > 0) {
out <- 'both'
} else if (abund_1 > 0) {
out <- 'only 1'
} else if (abund_2 > 0) {
out <- 'only 2'
} else {
out <- 'neither'
}
return(list(presence = out))
}
# Calculate difference between groups
x$data$diff_table <- compare_groups(x, data = "tax_table", func = presence_func,
cols = hmp_samples$sample_id,
groups = hmp_samples$body_site)
# Set colors to use
color_key <- c('both' = 'purple', 'only 1' = 'blue', 'only 2' = 'red', 'neither' = 'grey')
x$data$diff_table$color <- color_key[as.character(x$data$diff_table$presence)]
# Plot results (might take a few minutes)
heat_tree_matrix(x,
row_label_color = color_key['only 1'],
col_label_color = color_key['only 2'],
data = "diff_table",
node_size = n_obs,
node_label = taxon_names,
node_color = color,
make_node_legend = FALSE,
make_edge_legend = FALSE,
node_color_trans = "linear",
node_color_interval = c(-3, 3),
edge_color_interval = c(-3, 3),
node_size_axis_label = "Number of OTUs",
node_color_axis_label = "Log2 ratio median proportions")
Created on 2019-06-27 by the reprex package (v0.3.0)
However, I am not too sure about how to interprete the log2 ratio.
It is the same information as a difference in abundance, but better for plotting. Its a ratio instead of a difference so that differences in taxa with a small proportion of the reads are visible (e.g. two taxa with 1% and 2% of reads is the same ratio as two with 10% and 20%). The ratio is log transformed so that it is centered on 0 and symmetric.
log2(1/1)
#> [1] 0
log2(1/2)
#> [1] -1
log2(10/20)
#> [1] -1
log2(2/1)
#> [1] 1
log2(20/10)
#> [1] 1
Created on 2019-06-27 by the reprex package (v0.3.0)
Look at this FAQ for how too tell which color is which treatment and let me know if you have questions.
Lastly, I was wondering whether it is possible to highlight a branch of interest in the heat tree.
Not right now, but its something I want to add. You could save the plot as a pdf or svg and edit it by hand with the free program inkscape.
Again, thanks a lot for your answer! This kind of heat tree reminds me of the UpSet plots ([https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720993/]) but with the taxonomy info, which is great! This worked with my data but I added a threshold so that taxa are highlighted only if there are present in at least 50% of my replicates, otherwise a taxon is highlighted when present in only one replicate of treatment 1 compared to treatment 2. Thank you!!!
No problem!
Hello @zachary-foster ,
I have to start by thanking for such a great package and the support found in these pages!
I resuscitate this thread because what I am trying to do is close to this! I have two sampling strategies (metabarcoding and traditional ID) and what I am trying to show is something akin to the section 'Comparing two treatments/groups' of this example https://grunwaldlab.github.io/metacoder_documentation/example.html but using presence/absence data.
I only have 2 groups I wish to compare in the same tree and the colours ranging from group1 = blue, both = gray, group2=tan
I was able to run the compare_groups command with the suggested presence_function and my diff_table now includes the both/group1/group2 labels but I can't quite get the heat_tree function to colour the tree in a gradient. I was only able to have solid colours corresponding to the groups, but these change according to the order in which I write them (eg. c(both=gray, grp1=blue, grp2=tan) VS c(grp1=blue, grp2=tan, both=gray). I think the ideal would be to have a gradient as in the heat_tree function using node_color = log2_median_ratio but I am having a hard time justifying this since with the presence/absence matrix and we are testing the differences between them separately. I just want this as a neat figure to show (in a gradient) the taxa identified with either strategy.
Sorry if this is longwinded and unclear. In short, is there a way to use the heat_tree function with a colour gradient based on presence/absence counts! Does this make sense? Thanks a lot!
Thanks!
I not sure I understand. Do you want to use those colors as 3 categories in all taxa, or have them blended for internal taxa (e.g., if half the species in a genus are present in both)? Both can be done
Hi Zachary, Thanks a lot for your reply!
I can do the three colors as categories as such: node_color = c('only 1' = 'blue','only2' = 'tan', 'both'='gray')
but ideally I want to have a blending of the colors as when using node_color_range = c("blue", "gray", "tan") when node_color = log2_median_ratio
What I would like is that, for instance, when a larger proportion of metabarcoding samples found formicidae then yes, it is colored blue, but when its only found say 10% more times in the metabarcoding than when in traditional ID, then its more grayish-blue than just blue. Does this make sense?
I'm attaching the figures to make it clearer. differential_heat_tree is what I mean when I used presence/absence data and it treats the colors just as categories.
What I'm trying to get, is the differential_heat_tree_meandiff where the colors are better blended. I hope this makes sense!
[differential_heat_treepresence.pdf] differential_heat_tree_meandiff.pdf (https://github.com/grunwaldlab/metacoder/files/7557600/differential_heat_treepresence.pdf) l_heat_tree_method_ocurrence_separate_meandiff.pdf)
How about something like this?
library(metacoder)
#> This is metacoder verison 0.3.5 (stable)
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = hmp_samples$sample_id)
#> Summing per-taxon counts from 50 columns for 174 taxa
# Convert taxon counts to 0/1
x$data$tax_table[-1] <- lapply(x$data$tax_table[-1], function(y) ifelse(y > 10, 1, 0))
x$data$diff_table <- compare_groups(x, data = "tax_table",
cols = hmp_samples$sample_id,
groups = hmp_samples$sex)
heat_tree(x,
node_label = taxon_names,
node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
node_color = mean_diff, # A column from `x$data$diff_table`
node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
node_color_range = c("cyan", "gray", "tan"), # The color palette used
node_color_digits = 1,
node_size_axis_label = "OTU count",
node_color_axis_label = "Mean difference in sample proportion",
layout = "davidson-harel", # The primary layout algorithm
initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations
Created on 2021-11-18 by the reprex package (v2.0.1)
Note that the taxon counts from summing up the presence/absence data has to be converted back to 0/1. This means the "mean_diff" output of compare_groups
returns the difference in the proportion of samples detected in each group. For example if group 1 for taxon A had (0, 0, 0, 0, 1) and group 2 had (1,1,1,0,0) for the 5 replicates, the mean_diff would be 0.6 - 0.2 = 0.4
. Make sense?
Hi Zachary! Yes, perfect sense! This is exactly what I needed! Really, thank you for the support!
Great, no problem!
Hello @zachary-foster
I attended your workshop at the Phytobiome Conference in Montpellier last December and I managed to realize several comparing heat trees (using a phyloseq object). Again, thanks for the workshop as it is very beneficial for my current analyses!
I was wondering if it was possible to use presence/absence data instead of proportions in order to perform a generalized linear model (glm) using the binomial family (instead of the wilcoxon test). How do I obtain presence/absence data? Should the transformation occur before or after the parse function? How do I change the wilcoxon test into the binomial glm? Finally, I am not interested in all the obtained comparisons, is it possible to select the comparisons of interest when plotting the heat trees?
Thank you in advance for your help.