Open jpcasasruiz opened 2 years ago
Thanks!
Yea, it should be, but there is no function to do that for you yet. You would have to fit a model for each taxon's abundance vs that external variable and add the correlation and p-value as columns in a table with per-taxon data in the taxmap
object and then color by the correlation column, while setting all correlations with a p value > 0.05 to 0. Make sense?
oops, did not mean to close.
Apparently Ctrl
+ Enter
closes issues now, haha
Thanks for your fast response! That is what I had in mind, but I am not familiar enough with the taxmap object as to do it. Let me know if you ever give it a try. I think this would be a cool addition to future versions of metacoder.
Here is a prototype of the technique that could be used to make a function in the future.
library(metacoder)
## This is metacoder version 0.3.6 (stable)
library(tidyr)
library(readr)
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(purrr)
I will use the TARA expedition dataset since that is the first that comes to mind that had sample data with a continuous variable (latitude).
The data set at the below URL was downloaded and uncompressed:
http://taraoceans.sb-roscoff.fr/EukDiv/data/Database_W5_OTU_occurences.tsv.zip
raw_data <- readr::read_tsv("data/Database_W5_OTU_occurences.tsv")
obj <- parse_tax_data(raw_data, class_cols = "lineage", class_sep = "\\|", sep_is_regex = TRUE)
The sample data was downloaded from the URL below:
http://taraoceans.sb-roscoff.fr/EukDiv/data/Database_W1_Sample_parameters.xls
sample_data <- read_excel("data/Database_W1_Sample_parameters.xls")
The input data included read abundance for each sample-OTU combination, but we need the abundances associated with each taxon for graphing and regression. There will usually be multiple OTUs assigned to the same taxon, especially for coarse taxonomic ranks (e.g. the root will have all OTU indexes), so the abundances at those indexes are are summed to provide the total abundance for each taxon.
obj$data$otu_prop <- calc_obs_props(obj, data = "tax_data", cols = sample_data[["PANGAEA ACCESSION NUMBER"]])
## Calculating proportions from counts for 334 columns for 140707 observations.
obj$data$tax_abund <- calc_taxon_abund(obj, data = "otu_prop",
cols = sample_data[["PANGAEA ACCESSION NUMBER"]])
## Summing per-taxon counts from 334 columns for 9512 taxa
I will be using simple linear regression to demonstrate how this might
work for plotting with metacoder
, but it is likely the linear
regression is not the correct method for this kind of proportional data.
This code does the bare minimum as an example and should not be used as
is for research. Once I learn what an appropriate method is I will try
to remember to update this post.
The first step is to get a table for each taxon in a format that typical
regression functions like lm
expect. This would be a table with one
row per sample and columns for abundance and another for the continuous
variable of interest from the sample metadata. Since the taxonomic
abundance matrix in the taxmap
object has the abundance data in rows
and the samples in columns, a single row corresponding to each taxon
must be transposed and combined with the sample IDs from the column
names. This can then be joined with the sample metadata based on sample
IDs to generate the input for lm
or similar functions. The output of
lm
can then be returned for each taxon and reformatted into a table
with one row per taxon that includes columns for taxon ID, p-value, and
correlation strength. Here is a function to do the test for each taxon
(row):
run_one_test <- function(tax_prop_row) {
sample_ids <- sample_data[["PANGAEA ACCESSION NUMBER"]]
props <- unlist(tax_prop_row[1, sample_ids])
test_data <- tibble(sample_id = names(props), prop = props) %>%
left_join(sample_data, c("sample_id" = "PANGAEA ACCESSION NUMBER"))
lm_result = summary(lm(prop ~ `LATITUDE (Decimal Degrees)`, data = test_data))
output <- tibble(
taxon_id = tax_prop_row$taxon_id,
coeff = lm_result$coefficients[2, 1],
pvalue = lm_result$coefficients[2, 4]
)
return(output)
}
And here is how to run that function for each row and format the results as a table:
obj$data$tax_lm <- obj$data$tax_abund %>%
group_by(taxon_id) %>%
group_split() %>%
map_dfr(run_one_test)
obj
## <Taxmap>
## 9512 taxa: aab. Eukaryota ... obw. Metridia+gerlachei
## 9512 edges: NA->aab, NA->aac, NA->aad ... nxm->obv, nom->obw
## 4 data sets:
## tax_data:
## # A tibble: 140,707 × 349
## taxon_id md5sum cid totab TARA_…¹ TARA_…² TARA_…³ TARA_…⁴
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 isw 43fc0f… 127368 1.44e8 258322 73563 262269 331687
## 2 cpo 7e5e35… 86387 6.12e7 6476 141987 122993 379254
## 3 deo fe8c24… 122743 3.56e7 643876 30 877579 355
## # … with 140,704 more rows, 341 more variables:
## # TARA_N000000839 <dbl>, TARA_N000000845 <dbl>,
## # TARA_N000000837 <dbl>, TARA_N000000829 <dbl>,
## # TARA_N000000833 <dbl>, TARA_N000000841 <dbl>,
## # TARA_A100000402 <dbl>, TARA_A100001646 <dbl>,
## # TARA_A100000394 <dbl>, TARA_A100000393 <dbl>, …, and
## # abbreviated variable names ¹TARA_N000000077, …
## otu_prop:
## # A tibble: 140,707 × 335
## taxon_id TARA_X00000…¹ TARA_…² TARA_…³ TARA_…⁴ TARA_…⁵ TARA_…⁶
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 isw 0.123 3.22e-1 0.0209 0.400 5.89e-1 0.0327
## 2 cpo 0.00735 1.33e-1 0.210 0.00913 9.90e-2 0.216
## 3 deo 0.000325 3.07e-4 0.00137 0.175 5.33e-4 0.00230
## # … with 140,704 more rows, 328 more variables:
## # TARA_X000000741 <dbl>, TARA_A200000148 <dbl>,
## # TARA_A200000158 <dbl>, TARA_A200000102 <dbl>,
## # TARA_A200000170 <dbl>, TARA_A200000110 <dbl>,
## # TARA_A200000123 <dbl>, TARA_A200000139 <dbl>,
## # TARA_X000001056 <dbl>, TARA_X000001010 <dbl>, …, and
## # abbreviated variable names ¹TARA_X000000361, …
## tax_abund:
## # A tibble: 9,512 × 335
## taxon_id TARA_X00000…¹ TARA_…² TARA_…³ TARA_…⁴ TARA_…⁵ TARA_…⁶
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aab 0.996 9.68e-1 0.733 0.984 9.76e-1 0.811
## 2 aac 0.00360 3.21e-2 0.222 0.0140 1.13e-2 0.176
## 3 aad 0.0000433 6.11e-5 0.0298 0.00112 5.63e-4 0.00811
## # … with 9,509 more rows, 328 more variables:
## # TARA_X000000741 <dbl>, TARA_A200000148 <dbl>,
## # TARA_A200000158 <dbl>, TARA_A200000102 <dbl>,
## # TARA_A200000170 <dbl>, TARA_A200000110 <dbl>,
## # TARA_A200000123 <dbl>, TARA_A200000139 <dbl>,
## # TARA_X000001056 <dbl>, TARA_X000001010 <dbl>, …, and
## # abbreviated variable names ¹TARA_X000000361, …
## tax_lm:
## # A tibble: 9,512 × 3
## taxon_id coeff pvalue
## <chr> <dbl> <dbl>
## 1 aab -0.00000805 0.971
## 2 aac 0.0000573 0.782
## 3 aad -0.0000372 0.260
## # … with 9,509 more rows
## 0 functions:
It would also be useful to have the per-taxon mean proportion for plotting and filtering, so we can add that as another table:
obj$data$tax_mean_prop <- calc_group_mean(obj, data = "tax_abund",
cols = sample_data[["PANGAEA ACCESSION NUMBER"]],
groups = "mean_prop")
Now we have the results of the per-taxon regression in a format that can be plotted.
obj %>%
filter_taxa(taxon_names == "Bacteria", subtaxa = TRUE, reassign_obs = FALSE) %>%
filter_taxa(n_supertaxa < 3, taxon_names != "NA", reassign_obs = FALSE) %>%
# filter_taxa(mean_prop > 0.00001, reassign_obs = FALSE) %>%
filter_taxa(pvalue < 0.05, supertaxa = TRUE, reassign_obs = FALSE) %>%
heat_tree(node_label = taxon_names,
node_size = mean_prop,
node_color = ifelse(pvalue < 0.05, coeff, 0),
node_color_interval = c(-0.00001, 0.00001),
node_color_range = c("cyan", "gray", "tan"),
node_size_range = c(0.01, 0.04),
node_size_axis_label = "Mean taxon read proportion",
node_color_axis_label = "Regression coeffecient",
layout = "davidson-harel",
initial_layout = "reingold-tilford")
Its clearly not the best dataset/variable as far as interesting results go, but it still demonstrates the idea. Once I have a chance to look into which statistical methods would be most appropriate I might turn this process into a function to make it easier.
Hi there, first of all, thanks for taxa and metacoder. These are extremely useful packages.
I was wondering if it would be possible to color the nodes and edges of a taxonomic tree according the correlation between the relative abundance of each taxa and an external variable (e.g. any environmental variable such as pH or Salinity). Thanks!