grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
134 stars 28 forks source link

Run heat_tree with result of differential abundance analysis #355

Closed Paoluxxe closed 6 months ago

Paoluxxe commented 9 months ago

Hello,

Thanks for developing this very interesting package! I'm facing a problem because I'd like to illustrate the result of a differential abundance analysis with the heat_tree function, i.e. I'd like to color distinctly taxa which are preferentially associated with one or another modality of a group and make the node size dependent on logfold change. This differntial abundance analysis was carried out independently of the metacoder package (I used Maaslin2). So I have now a table with a log fold change for each taxa of my table. I've spent all day looking for a code example for this on the internet but haven't found a solution...

Is it possible to do this? If so, can you explain?

Thank you for your attention.

zachary-foster commented 8 months ago

Hello,

Its probably doable, just need to get the table in the right format and put it in a taxmap object. What is the format you have from Maaslin2? Can you share an example of data in that format?

Paoluxxe commented 8 months ago

That would be great! Here is a sample of the file, it contains the taxon name (Cluster), the coefficient (coef) and the taxonomy (Domain, Phylum, Class, Order Family, Genus, Species).

Results_Maaslin2_Github.xlsx

zachary-foster commented 8 months ago

You can get the data into taxmap format without much problem, but it seems you only have differential abundance data for the species-level taxa? If so, you need to get some info for the higher level taxa (e.g., Orders) for what you want to plot. I show in the code below one way that uses the obs/obs_apply functions to average the coef column for all the subtaxa in each taxon, resulting in a vector with one value per taxon. In order to plot information with heat_tree you need one value per taxon, named by taxon IDs.

library(metacoder)
library(readxl)

# Read input data
raw_data <- read_xlsx('~/Downloads/Results_Maaslin2_Github.xlsx')
x <- parse_tax_data(raw_data, class_cols = 6:12)

# Averge coef for each taxon (this could also be columns in a table if you did this to multiple values)
x$data$mean_coef <- unlist(obs_apply(x, data = "tax_data", func = mean, value = "coef"))

# Plot to test
heat_tree(x, node_label = taxon_names, node_size = n_obs, node_color = mean_coef)

Created on 2023-11-02 with reprex v2.0.2

vyom84 commented 6 months ago

Hi, Thanks for developing the package and providing the above command. I am also in the same boat and can reproduce the above command, but in my case, I also obtained the value at different taxon levels from phylum - species.

I am unable to generate the heat tree using the values from Group A (column B) and taxa information from the rest of the columns.

Update: I was able to get it to work by replacing the empty values with Unclassified and got the figure. The figure does consider the data from all of the taxa data (phylum - species) and not only where I have data up to the species level. Please confirm.

Also, how can I plot only data using a specific taxa level (filter_taxa(Taxon == "Genus," supertaxa = TRUE) %>%) - This isn't working)?

Thank you.

example.xlsx example_Unclassified.xlsx

zachary-foster commented 6 months ago

It looks like your data has one value per taxon (mostly, look at the note in the code example), so you do not need to infer the values for the coarser taxa like Paoluxxe did. However, you might want to add a row at the top for the Kingdom level (with a value of 0), so that it ends up as one tree.

library(metacoder)
#> This is metacoder version 0.3.6 (stable)
library(readxl)

# Read input data
raw_data <- read_xlsx('~/Downloads/example_Unclassified.xlsx')
colnames(raw_data)[2] <- 'difference'
x <- parse_tax_data(raw_data, class_cols = 4:10)

# Remove unclassified taxa 
x <- filter_taxa(x, taxon_names != "Unclassified")

# Remove taxa not represented as rows in the input data
#   NOTE: It looks like your format should have 1 value per taxon, but
#         when I parse it, it I get 85 taxa from 75 rows. I did not look into 
#         why that is, but it might be worth looking into why that is. I 
#         would not expect this filtering step to be needed given you input format
x <- filter_taxa(x, taxon_ids %in% x$data$tax_data$taxon_id)

# Plot to test
heat_tree(x,
          node_label = taxon_names,
          node_size = n_obs,
          node_color = difference,
          node_size_range = c(0.01, 0.04))

Created on 2024-01-02 with reprex v2.0.2

vyom84 commented 6 months ago

Thanks for the input. I was able to generate the tree, but my main issue was plotting names. My input table has ~1000 entries ranging from phylum to species level, as I did the differential abundance test at Phylum level up to Species levels. Thus, I added the 3rd column (Taxon) so that I can only label up to the family or genus levels. Can I do that?

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

ID | Group A | Taxon | Kingdom | Phylum | Class | Order | Family | Genus | Species -- | -- | -- | -- | -- | -- | -- | -- | -- | -- PSLC-6 | -1.18 | Phylum | Bacteria | Firmicutes_A | Unclassified | Unclassified | Unclassified | Unclassified | Unclassified PSLC-102 | -0.27 | Phylum | Bacteria | Bacteroidota | Unclassified | Unclassified | Unclassified | Unclassified | Unclassified PSLC-102 | -0.42 | Class | Bacteria | Bacteroidota | Bacteroidia | Unclassified | Unclassified | Unclassified | Unclassified PSLC-57 | -0.49 | Class | Bacteria | Firmicutes | Bacilli | Unclassified | Unclassified | Unclassified | Unclassified PSLC-57 | -0.25 | Order | Bacteria | Firmicutes | Bacilli | Erysipelotrichales | Unclassified | Unclassified | Unclassified PSLC-10 | -2.46 | Order | Bacteria | Actinobacteriota | Actinomycetia | Actinomycetales | Unclassified | Unclassified | Unclassified PSLC-74 | -0.26 | Family | Bacteria | Firmicutes_A | Clostridia | Oscillospirales | Oscillospiraceae | Unclassified | Unclassified PSLC-6 | -1.37 | Family | Bacteria | Firmicutes_A | Clostridia | Lachnospirales | Lachnospiraceae | Unclassified | Unclassified PSLC-5 | 1.06 | Genus | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | UBA932 | Cryptobacteroides | Unclassified PSLC-36 | -1.52 | Genus | Bacteria | Firmicutes_A | Clostridia | Lachnospirales | Lachnospiraceae | Anaerobutyricum | Unclassified PSLC-36 | -1.52 | Species | Bacteria | Firmicutes_A | Clostridia | Lachnospirales | Lachnospiraceae | Anaerobutyricum | Anaerobutyricum_hallii PSLC-16 | -1.29 | Species | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides | Bacteroides_uniformis

Thanks

zachary-foster commented 6 months ago

You can do that without the extra "Taxon" column, since you can filter by the number of supertaxa each taxon has (n_supertaxa) or the name of the ranks (taxon_ranks if you add named_by_rank = TRUE to parse_tax_data). Does this do what you want?

library(metacoder)
#> This is metacoder version 0.3.6 (stable)
library(readxl)

# Read input data
raw_data <- read_xlsx('~/Downloads/example_Unclassified.xlsx')
colnames(raw_data)[2] <- 'difference'
x <- parse_tax_data(raw_data, class_cols = 4:10, named_by_rank = TRUE)

# Remove unclassified taxa 
x <- filter_taxa(x, taxon_names != "Unclassified")

# Remove taxa not represented as rows in the input data
#   NOTE: It looks like your format should have 1 value per taxon, but
#         when I parse it, it I get 85 taxa from 75 rows. I did not look into 
#         why that is, but it might be worth looking into why that is. I 
#         would not expect this filtering step to be needed given your input format
x <- filter_taxa(x, taxon_ids %in% x$data$tax_data$taxon_id)

# Plot up to family
x %>%
  filter_taxa(taxon_ranks == "Family", supertaxa = TRUE, reassign_obs = FALSE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs,
            node_color = difference,
            node_size_range = c(0.01, 0.04))


# Only label up to family
x %>%
  heat_tree(node_label = ifelse(n_supertaxa <= 3, taxon_names, ''),
            node_size = n_obs,
            node_color = difference,
            node_size_range = c(0.01, 0.04))

Created on 2024-01-02 with reprex v2.0.2

vyom84 commented 6 months ago

Thank you for the code. That works perfectly.

zachary-foster commented 6 months ago

Great! No problem!