Color nodes according to correlation between relative abundance and external variables

jpcasasruiz commented 2 years ago

Hi there, first of all, thanks for taxa and metacoder. These are extremely useful packages.

I was wondering if it would be possible to color the nodes and edges of a taxonomic tree according the correlation between the relative abundance of each taxa and an external variable (e.g. any environmental variable such as pH or Salinity). Thanks!

zachary-foster commented 2 years ago

Thanks!

Yea, it should be, but there is no function to do that for you yet. You would have to fit a model for each taxon's abundance vs that external variable and add the correlation and p-value as columns in a table with per-taxon data in the taxmap object and then color by the correlation column, while setting all correlations with a p value > 0.05 to 0. Make sense?

zachary-foster commented 2 years ago

oops, did not mean to close.

zachary-foster commented 2 years ago

Apparently Ctrl + Enter closes issues now, haha

jpcasasruiz commented 2 years ago

Thanks for your fast response! That is what I had in mind, but I am not familiar enough with the taxmap object as to do it. Let me know if you ever give it a try. I think this would be a cool addition to future versions of metacoder.

zachary-foster commented 1 year ago

Here is a prototype of the technique that could be used to make a function in the future.

Load libraries

library(metacoder)

## This is metacoder version 0.3.6 (stable)

library(tidyr)
library(readr)
library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(purrr)

I will use the TARA expedition dataset since that is the first that comes to mind that had sample data with a continuous variable (latitude).

Parsing taxonomic data

The data set at the below URL was downloaded and uncompressed:

http://taraoceans.sb-roscoff.fr/EukDiv/data/Database_W5_OTU_occurences.tsv.zip

raw_data <- readr::read_tsv("data/Database_W5_OTU_occurences.tsv")
obj <- parse_tax_data(raw_data, class_cols = "lineage", class_sep = "\\|", sep_is_regex = TRUE)

Getting sample data

The sample data was downloaded from the URL below:

http://taraoceans.sb-roscoff.fr/EukDiv/data/Database_W1_Sample_parameters.xls

sample_data <- read_excel("data/Database_W1_Sample_parameters.xls")

Caluculate read abundance per taxon

The input data included read abundance for each sample-OTU combination, but we need the abundances associated with each taxon for graphing and regression. There will usually be multiple OTUs assigned to the same taxon, especially for coarse taxonomic ranks (e.g. the root will have all OTU indexes), so the abundances at those indexes are are summed to provide the total abundance for each taxon.

obj$data$otu_prop <- calc_obs_props(obj, data = "tax_data", cols = sample_data[["PANGAEA ACCESSION NUMBER"]])

## Calculating proportions from counts for 334 columns for 140707 observations.

obj$data$tax_abund <- calc_taxon_abund(obj, data = "otu_prop",
                                       cols = sample_data[["PANGAEA ACCESSION NUMBER"]])

## Summing per-taxon counts from 334 columns for 9512 taxa

Looking for correlations between latitude and taxon abundance

I will be using simple linear regression to demonstrate how this might work for plotting with metacoder, but it is likely the linear regression is not the correct method for this kind of proportional data. This code does the bare minimum as an example and should not be used as is for research. Once I learn what an appropriate method is I will try to remember to update this post.

The first step is to get a table for each taxon in a format that typical regression functions like lm expect. This would be a table with one row per sample and columns for abundance and another for the continuous variable of interest from the sample metadata. Since the taxonomic abundance matrix in the taxmap object has the abundance data in rows and the samples in columns, a single row corresponding to each taxon must be transposed and combined with the sample IDs from the column names. This can then be joined with the sample metadata based on sample IDs to generate the input for lm or similar functions. The output of lm can then be returned for each taxon and reformatted into a table with one row per taxon that includes columns for taxon ID, p-value, and correlation strength. Here is a function to do the test for each taxon (row):

run_one_test <- function(tax_prop_row) {
  sample_ids <- sample_data[["PANGAEA ACCESSION NUMBER"]]
  props <- unlist(tax_prop_row[1, sample_ids])
  test_data <- tibble(sample_id = names(props), prop = props) %>%
    left_join(sample_data, c("sample_id" = "PANGAEA ACCESSION NUMBER"))
  lm_result = summary(lm(prop ~ `LATITUDE (Decimal Degrees)`, data = test_data))
  output <- tibble(
    taxon_id = tax_prop_row$taxon_id,
    coeff = lm_result$coefficients[2, 1],
    pvalue = lm_result$coefficients[2, 4]
  )
  return(output)
}

And here is how to run that function for each row and format the results as a table:

obj$data$tax_lm <- obj$data$tax_abund %>%
  group_by(taxon_id) %>%
  group_split() %>%
  map_dfr(run_one_test)
obj

## <Taxmap>
##   9512 taxa: aab. Eukaryota ... obw. Metridia+gerlachei
##   9512 edges: NA->aab, NA->aac, NA->aad ... nxm->obv, nom->obw
##   4 data sets:
##     tax_data:
##       # A tibble: 140,707 × 349
##         taxon_id md5sum     cid  totab TARA_…¹ TARA_…² TARA_…³ TARA_…⁴
##         <chr>    <chr>    <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 isw      43fc0f… 127368 1.44e8  258322   73563  262269  331687
##       2 cpo      7e5e35…  86387 6.12e7    6476  141987  122993  379254
##       3 deo      fe8c24… 122743 3.56e7  643876      30  877579     355
##       # … with 140,704 more rows, 341 more variables:
##       #   TARA_N000000839 <dbl>, TARA_N000000845 <dbl>,
##       #   TARA_N000000837 <dbl>, TARA_N000000829 <dbl>,
##       #   TARA_N000000833 <dbl>, TARA_N000000841 <dbl>,
##       #   TARA_A100000402 <dbl>, TARA_A100001646 <dbl>,
##       #   TARA_A100000394 <dbl>, TARA_A100000393 <dbl>, …, and
##       #   abbreviated variable names ¹TARA_N000000077, …
##     otu_prop:
##       # A tibble: 140,707 × 335
##         taxon_id TARA_X00000…¹ TARA_…² TARA_…³ TARA_…⁴ TARA_…⁵ TARA_…⁶
##         <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 isw           0.123    3.22e-1 0.0209  0.400   5.89e-1 0.0327 
##       2 cpo           0.00735  1.33e-1 0.210   0.00913 9.90e-2 0.216  
##       3 deo           0.000325 3.07e-4 0.00137 0.175   5.33e-4 0.00230
##       # … with 140,704 more rows, 328 more variables:
##       #   TARA_X000000741 <dbl>, TARA_A200000148 <dbl>,
##       #   TARA_A200000158 <dbl>, TARA_A200000102 <dbl>,
##       #   TARA_A200000170 <dbl>, TARA_A200000110 <dbl>,
##       #   TARA_A200000123 <dbl>, TARA_A200000139 <dbl>,
##       #   TARA_X000001056 <dbl>, TARA_X000001010 <dbl>, …, and
##       #   abbreviated variable names ¹TARA_X000000361, …
##     tax_abund:
##       # A tibble: 9,512 × 335
##         taxon_id TARA_X00000…¹ TARA_…² TARA_…³ TARA_…⁴ TARA_…⁵ TARA_…⁶
##         <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 aab          0.996     9.68e-1  0.733  0.984   9.76e-1 0.811  
##       2 aac          0.00360   3.21e-2  0.222  0.0140  1.13e-2 0.176  
##       3 aad          0.0000433 6.11e-5  0.0298 0.00112 5.63e-4 0.00811
##       # … with 9,509 more rows, 328 more variables:
##       #   TARA_X000000741 <dbl>, TARA_A200000148 <dbl>,
##       #   TARA_A200000158 <dbl>, TARA_A200000102 <dbl>,
##       #   TARA_A200000170 <dbl>, TARA_A200000110 <dbl>,
##       #   TARA_A200000123 <dbl>, TARA_A200000139 <dbl>,
##       #   TARA_X000001056 <dbl>, TARA_X000001010 <dbl>, …, and
##       #   abbreviated variable names ¹TARA_X000000361, …
##     tax_lm:
##       # A tibble: 9,512 × 3
##         taxon_id       coeff pvalue
##         <chr>          <dbl>  <dbl>
##       1 aab      -0.00000805  0.971
##       2 aac       0.0000573   0.782
##       3 aad      -0.0000372   0.260
##       # … with 9,509 more rows
##   0 functions:

It would also be useful to have the per-taxon mean proportion for plotting and filtering, so we can add that as another table:

obj$data$tax_mean_prop <- calc_group_mean(obj, data = "tax_abund",
                                          cols = sample_data[["PANGAEA ACCESSION NUMBER"]],
                                          groups = "mean_prop")

Now we have the results of the per-taxon regression in a format that can be plotted.

obj %>%
  filter_taxa(taxon_names == "Bacteria", subtaxa = TRUE, reassign_obs = FALSE) %>%
  filter_taxa(n_supertaxa < 3, taxon_names != "NA", reassign_obs = FALSE) %>%
  # filter_taxa(mean_prop > 0.00001, reassign_obs = FALSE) %>%
  filter_taxa(pvalue < 0.05, supertaxa = TRUE, reassign_obs = FALSE) %>%
  heat_tree(node_label = taxon_names,
            node_size = mean_prop, 
            node_color = ifelse(pvalue < 0.05, coeff, 0), 
            node_color_interval = c(-0.00001, 0.00001), 
            node_color_range = c("cyan", "gray", "tan"), 
            node_size_range = c(0.01, 0.04),
            node_size_axis_label = "Mean taxon read proportion",
            node_color_axis_label = "Regression coeffecient",
            layout = "davidson-harel",
            initial_layout = "reingold-tilford")

unnamed-chunk-6-1

Its clearly not the best dataset/variable as far as interesting results go, but it still demonstrates the idea. Once I have a chance to look into which statistical methods would be most appropriate I might turn this process into a function to make it easier.

grunwaldlab / metacoder