Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
Color nodes according to correlation between relative abundance and external variables #340

Hi there, first of all, thanks for taxa and metacoder. These are extremely useful packages.

I was wondering if it would be possible to color the nodes and edges of a taxonomic tree according the correlation between the relative abundance of each taxa and an external variable (e.g. any environmental variable such as pH or Salinity). Thanks!

Yea, it should be, but there is no function to do that for you yet. You would have to fit a model for each taxon's abundance vs that external variable and add the correlation and p-value as columns in a table with per-taxon data in the taxmap object and then color by the correlation column, while setting all correlations with a p value > 0.05 to 0. Make sense?

Thanks for your fast response! That is what I had in mind, but I am not familiar enough with the taxmap object as to do it. Let me know if you ever give it a try. I think this would be a cool addition to future versions of metacoder.

Here is a prototype of the technique that could be used to make a function in the future.

Load libraries

I will use the TARA expedition dataset since that is the first that comes to mind that had sample data with a continuous variable (latitude).

Parsing taxonomic data

The data set at the below URL was downloaded and uncompressed:

raw_data <- readr::read_tsv("data/Database_W5_OTU_occurences.tsv")
obj <- parse_tax_data(raw_data, class_cols = "lineage", class_sep = "\\|", sep_is_regex = TRUE)

Getting sample data

The sample data was downloaded from the URL below:

sample_data <- read_excel("data/Database_W1_Sample_parameters.xls")

Caluculate read abundance per taxon

The input data included read abundance for each sample-OTU combination, but we need the abundances associated with each taxon for graphing and regression. There will usually be multiple OTUs assigned to the same taxon, especially for coarse taxonomic ranks (e.g. the root will have all OTU indexes), so the abundances at those indexes are are summed to provide the total abundance for each taxon.

obj$data$otu_prop <- calc_obs_props(obj, data = "tax_data", cols = sample_data[["PANGAEA ACCESSION NUMBER"]])
obj$data$tax_abund <- calc_taxon_abund(obj, data = "otu_prop",
                                       cols = sample_data[["PANGAEA ACCESSION NUMBER"]])
Looking for correlations between latitude and taxon abundance

I will be using simple linear regression to demonstrate how this might work for plotting with metacoder, but it is likely the linear regression is not the correct method for this kind of proportional data. This code does the bare minimum as an example and should not be used as is for research. Once I learn what an appropriate method is I will try to remember to update this post.

The first step is to get a table for each taxon in a format that typical regression functions like lm expect. This would be a table with one row per sample and columns for abundance and another for the continuous variable of interest from the sample metadata. Since the taxonomic abundance matrix in the taxmap object has the abundance data in rows and the samples in columns, a single row corresponding to each taxon must be transposed and combined with the sample IDs from the column names. This can then be joined with the sample metadata based on sample IDs to generate the input for lm or similar functions. The output of lm can then be returned for each taxon and reformatted into a table with one row per taxon that includes columns for taxon ID, p-value, and correlation strength. Here is a function to do the test for each taxon (row):

run_one_test <- function(tax_prop_row) {
  sample_ids <- sample_data[["PANGAEA ACCESSION NUMBER"]]
  props <- unlist(tax_prop_row[1, sample_ids])
  test_data <- tibble(sample_id = names(props), prop = props) %>%
    left_join(sample_data, c("sample_id" = "PANGAEA ACCESSION NUMBER"))
  lm_result = summary(lm(prop ~ `LATITUDE (Decimal Degrees)`, data = test_data))
  output <- tibble(
    taxon_id = tax_prop_row$taxon_id,
    coeff = lm_result$coefficients[2, 1],
    pvalue = lm_result$coefficients[2, 4]

And here is how to run that function for each row and format the results as a table:

obj$data$tax_lm <- obj$data$tax_abund %>%
  group_by(taxon_id) %>%
  group_split() %>%
It would also be useful to have the per-taxon mean proportion for plotting and filtering, so we can add that as another table:

obj$data$tax_mean_prop <- calc_group_mean(obj, data = "tax_abund",
                                          cols = sample_data[["PANGAEA ACCESSION NUMBER"]],
                                          groups = "mean_prop")

Now we have the results of the per-taxon regression in a format that can be plotted.

obj %>%
  filter_taxa(taxon_names == "Bacteria", subtaxa = TRUE, reassign_obs = FALSE) %>%
  filter_taxa(n_supertaxa < 3, taxon_names != "NA", reassign_obs = FALSE) %>%
  # filter_taxa(mean_prop > 0.00001, reassign_obs = FALSE) %>%
  filter_taxa(pvalue < 0.05, supertaxa = TRUE, reassign_obs = FALSE) %>%
  heat_tree(node_label = taxon_names,
            node_size = mean_prop, 
            node_color = ifelse(pvalue < 0.05, coeff, 0), 
            node_color_interval = c(-0.00001, 0.00001), 
            node_color_range = c("cyan", "gray", "tan"), 
            node_size_range = c(0.01, 0.04),
            node_size_axis_label = "Mean taxon read proportion",
            node_color_axis_label = "Regression coeffecient",
            layout = "davidson-harel",
            initial_layout = "reingold-tilford")


Its clearly not the best dataset/variable as far as interesting results go, but it still demonstrates the idea. Once I have a chance to look into which statistical methods would be most appropriate I might turn this process into a function to make it easier.