Comparing heat trees using presence/absence data and binomial glm

M-Ourry commented 5 years ago

Hello @zachary-foster

I attended your workshop at the Phytobiome Conference in Montpellier last December and I managed to realize several comparing heat trees (using a phyloseq object). Again, thanks for the workshop as it is very beneficial for my current analyses!

I was wondering if it was possible to use presence/absence data instead of proportions in order to perform a generalized linear model (glm) using the binomial family (instead of the wilcoxon test). How do I obtain presence/absence data? Should the transformation occur before or after the parse function? How do I change the wilcoxon test into the binomial glm? Finally, I am not interested in all the obtained comparisons, is it possible to select the comparisons of interest when plotting the heat trees?

Thank you in advance for your help.

zachary-foster commented 5 years ago

Hi @M-Ourry,

Thanks, I am glad it was helpful!

I was wondering if it was possible to use presence/absence data instead of proportions in order to perform a generalized linear model (glm) using the binomial family (instead of the wilcoxon test).

I have not done that, but you can plot anything if you can get the results into the right format. In this case, you need per-taxon values of what ever you are plotting in a table with a taxon_id column.

How do I obtain presence/absence data? Should the transformation occur before or after the parse function?

You can do it with counts_to_presence after reading in your data:

library(metacoder)
#> Loading required package: taxa
#> This is metacoder verison 0.3.2.9001 (development version)
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Convert count to presence/absence
counts_to_presence(x, "tax_data")
#> # A tibble: 1,000 x 51
#>    taxon_id `700035949` `700097855` `700100489` `700111314` `700033744`
#>    <chr>    <lgl>       <lgl>       <lgl>       <lgl>       <lgl>      
#>  1 dm       FALSE       TRUE        TRUE        FALSE       FALSE      
#>  2 dn       FALSE       FALSE       FALSE       FALSE       FALSE      
#>  3 do       FALSE       TRUE        FALSE       FALSE       FALSE      
#>  4 dp       TRUE        TRUE        TRUE        TRUE        TRUE       
#>  5 dq       TRUE        TRUE        FALSE       FALSE       FALSE      
#>  6 dp       TRUE        TRUE        TRUE        TRUE        TRUE       
#>  7 dr       TRUE        TRUE        TRUE        TRUE        TRUE       
#>  8 ds       FALSE       FALSE       FALSE       FALSE       FALSE      
#>  9 dt       FALSE       FALSE       FALSE       FALSE       FALSE      
#> 10 du       FALSE       FALSE       FALSE       FALSE       TRUE       
#> # … with 990 more rows, and 45 more variables: `700109581` <lgl>,
#> #   `700111044` <lgl>, `700101365` <lgl>, `700100431` <lgl>,
#> #   `700016050` <lgl>, ...

# Check if there are any reads in each group of samples
counts_to_presence(x, "tax_data", groups = hmp_samples$body_site)
#> # A tibble: 1,000 x 6
#>    taxon_id Nose  Saliva Skin  Stool Throat
#>    <chr>    <lgl> <lgl>  <lgl> <lgl> <lgl> 
#>  1 dm       TRUE  TRUE   FALSE FALSE TRUE  
#>  2 dn       FALSE TRUE   FALSE FALSE TRUE  
#>  3 do       TRUE  TRUE   TRUE  FALSE TRUE  
#>  4 dp       TRUE  FALSE  TRUE  FALSE FALSE 
#>  5 dq       TRUE  TRUE   TRUE  FALSE TRUE  
#>  6 dp       TRUE  TRUE   TRUE  FALSE FALSE 
#>  7 dr       TRUE  FALSE  TRUE  FALSE FALSE 
#>  8 ds       TRUE  FALSE  TRUE  TRUE  FALSE 
#>  9 dt       FALSE TRUE   FALSE FALSE TRUE  
#> 10 du       TRUE  FALSE  TRUE  TRUE  FALSE 
#> # … with 990 more rows

^{Created on 2019-06-20 by the reprex package (v0.3.0)}

How do I change the wilcoxon test into the binomial glm?

I assume you are talking about using compare_groups with heat_tree_matrix? If so, you can make compare_groups use your own custom function with the func option. Your function is with every comparison of a taxon between two groups of counts (TRUE/FALSE in your case). Unfortunately, I don't know how to use GLM, so you'll have to figure out that part, but you can see the format of the function to use in the help page for ?compare_groups.

Finally, I am not interested in all the obtained comparisons, is it possible to select the comparisons of interest when plotting the heat trees?

Yea, that's also an option in ?compare_groups called combinations. Or you can subset the data to just the groups you want to compare before using compare_groups, but you will get all pairwise combinations. If you don't want all pairwise combinations, but just want specific pairs, then you need to usecombinations. Im not sure you would want to plot those results with heat_tree_matrix, but you can subset the results of each comparison and plot them on individual trees.

library(metacoder)
#> Loading required package: taxa
#> This is metacoder verison 0.3.2.9002 (development version)

# Parse data for plotting
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Convert counts to proportions
x$data$otu_table <- calc_obs_props(x, data = "tax_data", cols = hmp_samples$sample_id)
#> Calculating proportions from counts for 50 columns for 1000 observations.

# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "otu_table", cols = hmp_samples$sample_id)
#> Summing per-taxon counts from 50 columns for 174 taxa

# Calculate difference between groups
x$data$diff_table <- compare_groups(x, data = "tax_table",
                                    cols = hmp_samples$sample_id,
                                    groups = hmp_samples$body_site,
                                    combinations = list(c('Nose', 'Saliva'),
                                                        c('Skin', 'Throat')))
x$data$diff_table$wilcox_p_value <- p.adjust(x$data$diff_table$wilcox_p_value,
                                             method = "fdr")

# Plot subset
x %>%
  filter_obs(data = "diff_table", treatment_1 == "Nose", treatment_2 == "Saliva") %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, 
            node_color = ifelse(wilcox_p_value > 0.05 | is.nan(wilcox_p_value), 
                                0, log2_median_ratio),
            node_color_interval = c(-2, 2), # The range of `log2_median_ratio` to display
            node_color_range = c("cyan", "gray", "tan"), 
            node_size_axis_label = "OTU count",
            node_color_axis_label = "Log 2 ratio of median proportions",
            layout = "davidson-harel",
            initial_layout = "reingold-tilford")

^{Created on 2019-06-20 by the reprex package (v0.3.0)}

If you are concerned about using a Wilcox test (a valid concern) and looking for an alternative, not specifically GLM, you could look into my recent attempts to use DESeq2 to do differential abundance testing. It uses a negative binomial distribution to model read counts. As far as I know, it is one of the best methods out there for this. Check it out here if you are interested:

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/metacoder-discussions/xqzsD592QGg/6b7Bw13QAAAJ

M-Ourry commented 5 years ago

Hi @zachary-foster

Thank you for your answer! Actually, I was more interested in taxa presence/absence than their abundance, which is why I wanted to use a binomial glm (on presence/absence data) and then realize a heat tree using this statistical analysis. I tried to change it in the compare_group function but it is definitely out of my league (for now and I am running out of time). So I think I will stick to the wilcoxon test for now.

However, I am not too sure about how to interprete the log2 ratio. If I take your example (heat tree comparing nose and saliva), is it correct to say that Staphylococcus are more frequent/abundant(?) in nose than saliva samples? Or are there more things to say and how do you use the notion of log2 ratio for description? I think it is just an interpretation issue as I don't often encounter log2 ratio in analyses and don't know how to use it to describe the heat tree (when comparing different treatments).

Lastly, I was wondering whether it is possible to highlight a branch of interest in the heat tree. Let's say I want to highlight the whole branch leading to Bacillus, because it is known in the litterature for having X functions.

Thanks again for your help!

zachary-foster commented 5 years ago

Actually, I was more interested in taxa presence/absence than their abundance, which is why I wanted to use a binomial glm (on presence/absence data)

Do you need a test for this? What if you just picked a minimum read count for a taxon to be considered present and colored the taxa that were present one color and taxa not present in grey? You could have a set of trees, one for each experimental factor. Or if you wanted to do comparisons between factors, you could have 4 colors: present in both, present in factor A, present in factor B, or absent in both. Like this:

library(metacoder)
#> Loading required package: taxa
#> This is metacoder verison 0.3.2.9001 (development version)

x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Set read counts to 0 below a minimum threshold
x$data$tax_data <- zero_low_counts(x, data = "tax_data", min_count = 10, cols = hmp_samples$sample_id)
#> Zeroing 5422 of 50000 counts less than 10.

# Convert counts to proportions
x$data$otu_table <- calc_obs_props(x, data = "tax_data", cols = hmp_samples$sample_id)
#> Calculating proportions from counts for 50 columns for 1000 observations.

# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "otu_table", cols = hmp_samples$sample_id)
#> Summing per-taxon counts from 50 columns for 174 taxa

# Make new func for compare_groups for presence/absence
presence_func <- function(abund_1, abund_2) {
  abund_1 <- sum(abund_1)
  abund_2 <- sum(abund_2)
  if (abund_1 > 0 && abund_2 > 0) {
    out <- 'both'
  } else if (abund_1 > 0) {
    out <- 'only 1'
  } else if (abund_2 > 0) {
    out <- 'only 2'
  } else {
    out <- 'neither'
  }

  return(list(presence = out))
}

# Calculate difference between groups
x$data$diff_table <- compare_groups(x, data = "tax_table", func = presence_func, 
                                    cols = hmp_samples$sample_id,
                                    groups = hmp_samples$body_site)

# Set colors to use
color_key <- c('both' = 'purple', 'only 1' = 'blue', 'only 2' = 'red', 'neither' = 'grey')
x$data$diff_table$color <- color_key[as.character(x$data$diff_table$presence)]

# Plot results (might take a few minutes)
heat_tree_matrix(x,
                 row_label_color = color_key['only 1'],
                 col_label_color =  color_key['only 2'],
                 data = "diff_table",
                 node_size = n_obs,
                 node_label = taxon_names,
                 node_color = color,
                 make_node_legend = FALSE, 
                 make_edge_legend = FALSE,
                 node_color_trans = "linear",
                 node_color_interval = c(-3, 3),
                 edge_color_interval = c(-3, 3),
                 node_size_axis_label = "Number of OTUs",
                 node_color_axis_label = "Log2 ratio median proportions")

^{Created on 2019-06-27 by the reprex package (v0.3.0)}

However, I am not too sure about how to interprete the log2 ratio.

It is the same information as a difference in abundance, but better for plotting. Its a ratio instead of a difference so that differences in taxa with a small proportion of the reads are visible (e.g. two taxa with 1% and 2% of reads is the same ratio as two with 10% and 20%). The ratio is log transformed so that it is centered on 0 and symmetric.

log2(1/1)
#> [1] 0
log2(1/2)
#> [1] -1
log2(10/20)
#> [1] -1
log2(2/1)
#> [1] 1
log2(20/10)
#> [1] 1

^{Created on 2019-06-27 by the reprex package (v0.3.0)}

Look at this FAQ for how too tell which color is which treatment and let me know if you have questions.

Lastly, I was wondering whether it is possible to highlight a branch of interest in the heat tree.

Not right now, but its something I want to add. You could save the plot as a pdf or svg and edit it by hand with the free program inkscape.

M-Ourry commented 5 years ago

Again, thanks a lot for your answer! This kind of heat tree reminds me of the UpSet plots ([https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720993/]) but with the taxonomy info, which is great! This worked with my data but I added a threshold so that taxa are highlighted only if there are present in at least 50% of my replicates, otherwise a taxon is highlighted when present in only one replicate of treatment 1 compared to treatment 2. Thank you!!!

zachary-foster commented 5 years ago

No problem!

DanielSoutoV commented 2 years ago

Hello @zachary-foster ,

I have to start by thanking for such a great package and the support found in these pages!

I resuscitate this thread because what I am trying to do is close to this! I have two sampling strategies (metabarcoding and traditional ID) and what I am trying to show is something akin to the section 'Comparing two treatments/groups' of this example https://grunwaldlab.github.io/metacoder_documentation/example.html but using presence/absence data.

I only have 2 groups I wish to compare in the same tree and the colours ranging from group1 = blue, both = gray, group2=tan

I was able to run the compare_groups command with the suggested presence_function and my diff_table now includes the both/group1/group2 labels but I can't quite get the heat_tree function to colour the tree in a gradient. I was only able to have solid colours corresponding to the groups, but these change according to the order in which I write them (eg. c(both=gray, grp1=blue, grp2=tan) VS c(grp1=blue, grp2=tan, both=gray). I think the ideal would be to have a gradient as in the heat_tree function using node_color = log2_median_ratio but I am having a hard time justifying this since with the presence/absence matrix and we are testing the differences between them separately. I just want this as a neat figure to show (in a gradient) the taxa identified with either strategy.

Sorry if this is longwinded and unclear. In short, is there a way to use the heat_tree function with a colour gradient based on presence/absence counts! Does this make sense? Thanks a lot!

zachary-foster commented 2 years ago

Thanks!

I not sure I understand. Do you want to use those colors as 3 categories in all taxa, or have them blended for internal taxa (e.g., if half the species in a genus are present in both)? Both can be done

DanielSoutoV commented 2 years ago

Hi Zachary, Thanks a lot for your reply!

I can do the three colors as categories as such: node_color = c('only 1' = 'blue','only2' = 'tan', 'both'='gray')

but ideally I want to have a blending of the colors as when using node_color_range = c("blue", "gray", "tan") when node_color = log2_median_ratio

What I would like is that, for instance, when a larger proportion of metabarcoding samples found formicidae then yes, it is colored blue, but when its only found say 10% more times in the metabarcoding than when in traditional ID, then its more grayish-blue than just blue. Does this make sense?

I'm attaching the figures to make it clearer. differential_heat_tree is what I mean when I used presence/absence data and it treats the colors just as categories.

What I'm trying to get, is the differential_heat_tree_meandiff where the colors are better blended. I hope this makes sense!

[differential_heat_treepresence.pdf] differential_heat_tree_meandiff.pdf (https://github.com/grunwaldlab/metacoder/files/7557600/differential_heat_treepresence.pdf) l_heat_tree_method_ocurrence_separate_meandiff.pdf)

zachary-foster commented 2 years ago

How about something like this?

library(metacoder)
#> This is metacoder verison 0.3.5 (stable)

x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = hmp_samples$sample_id)
#> Summing per-taxon counts from 50 columns for 174 taxa

# Convert taxon counts to 0/1
x$data$tax_table[-1] <- lapply(x$data$tax_table[-1], function(y) ifelse(y > 10, 1, 0))

x$data$diff_table <- compare_groups(x, data = "tax_table",
                                    cols = hmp_samples$sample_id,
                                    groups = hmp_samples$sex)

heat_tree(x,
         node_label = taxon_names,
         node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
         node_color = mean_diff, # A column from `x$data$diff_table`
         node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
         node_color_range = c("cyan", "gray", "tan"), # The color palette used
         node_color_digits = 1,
         node_size_axis_label = "OTU count",
         node_color_axis_label = "Mean difference in sample proportion",
         layout = "davidson-harel", # The primary layout algorithm
         initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations

^{Created on 2021-11-18 by the reprex package (v2.0.1)}

Note that the taxon counts from summing up the presence/absence data has to be converted back to 0/1. This means the "mean_diff" output of compare_groups returns the difference in the proportion of samples detected in each group. For example if group 1 for taxon A had (0, 0, 0, 0, 1) and group 2 had (1,1,1,0,0) for the 5 replicates, the mean_diff would be 0.6 - 0.2 = 0.4. Make sense?

DanielSoutoV commented 2 years ago

Hi Zachary! Yes, perfect sense! This is exactly what I needed! Really, thank you for the support!

zachary-foster commented 2 years ago

Great, no problem!

grunwaldlab / metacoder

Comparing heat trees using presence/absence data and binomial glm #266