Accessing sample data within a taxmap object to create a heat tree.

foreignsand commented 3 years ago

I would like to access sample data within a taxmap object to create a heat tree.

My data are produced using ecoPCR to test primer sets against a few thousand genetic sequences—all fishes—from GenBank. For each sequence, the output includes the number of mismatches associated with each primer and sequence information is then linked with taxonomic information in Metacoder.

I have brought the data into Metacoder and created a taxmap object and plotted heat trees, but I would like to subset and/or analyze the data using the fwd_mismatch and rev_mismatch information from the ecoPCR output. My R script follows and I've attached my data files and am hoping someone can help me figure out how to do this. There are additional comments below to clarify what I am trying to do.

## Environment Preparation
# Load libraries
library(metacoder)
library(dplyr)
library(taxa)
library(readr)
library(ape)

# Set working directory
setwd("/Users/lemurbear/Documents/_mine/OSU/Master's/tutorials")

## Data Preparation
# Load data
test_data = read_csv("andersen.csv.txt")
tax_data = read_tsv("taxonomy.txt")
head(test_data)
head(tax_data)

# Combine data
my_test_data = cbind(test_data, tax_data)
head(my_test_data$taxonomy)

# Create taxmap object
test_taxa = parse_tax_data(my_test_data, class_cols = c("taxonomy"), 
                           class_sep = ";", 
                           class_regex = "^([a-z])_(.*)$", 
                           class_key = c("tax_rank" = "taxon_rank", "name" = "taxon_name"))

tail(taxon_names(test_taxa), 40)
head(taxon_ranks(test_taxa), 40)

test_taxa$data$class_data = NULL

## Plotting
# Plot test tree
heat_tree(test_taxa)

# Filter to family
test_taxa %>% 
  filter_taxa(taxon_ranks == "f", supertaxa = TRUE) %>% 
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_color = n_obs, 
            layout = "fr")

# Filter taxa to select only samples with mismatches:
#   I'm not sure how to get this to work
test_filter = test_taxa$data$tax_data$fwd_mismatch > 0
head(test_filter)

test_taxa %>% 
  filter_taxa(test_taxa$data$tax_data$fwd_mismatch > 0) %>% 
  filter_taxa(taxon_ranks == "s", supertaxa = TRUE) %>%
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_color = n_obs, 
            layout = "fr")

# Would it be possible to plot the heat tree and set node_size and node_color 
# to be associated with the percentage of taxa within a taxonomic group that 
# have a mismatch? i.e. The percentage of Perciformes with mismatches and then, within
# Perciformes, the percentage of each family with mismatches, and so on.

Many thanks for any and all help!

andersen.csv.txt taxonomy.txt

zachary-foster commented 3 years ago

Hello, I am just about to leave for camping until Monday. I will get back to this then. Sorry for the delay!

zachary-foster commented 3 years ago

Check out the attached example:

## Environment Preparation
# Load libraries
library(metacoder)
library(dplyr)
library(taxa)
library(readr)
library(ape)

# Set working directory
#setwd("/Users/lemurbear/Documents/_mine/OSU/Master's/tutorials")
# See https://www.tidyverse.org/blog/2017/12/workflow-vs-script/

## Data Preparation
# Load data
# test_data = read_csv("andersen.csv.txt") # .txt is not needed
test_data = read_csv("andersen.csv") 
tax_data = read_tsv("taxonomy.txt")
head(test_data)
head(tax_data)

# Combine data
my_test_data = cbind(test_data, tax_data)
head(my_test_data$taxonomy)

# Create taxmap object
test_taxa = parse_tax_data(my_test_data, class_cols = c("taxonomy"), 
                           class_sep = ";", 
                           class_regex = "^([a-z])_(.*)$", 
                           class_key = c("tax_rank" = "taxon_rank", "name" = "taxon_name"))

tail(taxon_names(test_taxa), 40)
head(taxon_ranks(test_taxa), 40)

test_taxa$data$class_data = NULL

## Plotting
# Plot test tree
heat_tree(test_taxa)

# Filter to family
test_taxa %>% 
  filter_taxa(taxon_ranks == "f", supertaxa = TRUE) %>% 
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_color = n_obs, 
            layout = "fr")

# Get per-taxon data for plotting/filtering
test_taxa$data$per_taxon <- tibble(taxon_id = taxon_ids(test_taxa))
test_taxa$data$mean_fwd_mismatch <- unlist(lapply(obs(test_taxa, data = 'tax_data'), function(index) {
  mean(test_taxa$data$tax_data$fwd_mismatch[index])
}))
test_taxa$data$mean_rev_mismatch <- unlist(lapply(obs(test_taxa, data = 'tax_data'), function(index) {
  mean(test_taxa$data$tax_data$rev_mismatch[index])
}))

# Filter taxa to select only samples with mismatches:
test_taxa %>% 
  filter_taxa(mean_fwd_mismatch >= 3) %>% 
  filter_taxa(taxon_ranks == "f", supertaxa = TRUE) %>%
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_color = mean_fwd_mismatch, 
            initial_layout = "re",
            layout = "da")

# Would it be possible to plot the heat tree and set node_size and node_color 
# to be associated with the percentage of taxa within a taxonomic group that 
# have a mismatch? i.e. The percentage of Perciformes with mismatches and then, within
# Perciformes, the percentage of each family with mismatches, and so on.
test_taxa$data$fwd_mismatch_perc <- unlist(lapply(obs(test_taxa, data = 'mean_fwd_mismatch'), function(index) {
  100 * sum(test_taxa$data$mean_fwd_mismatch[index] > 0) / length(index)
}))

test_taxa %>% 
  filter_taxa(taxon_ranks == "f", supertaxa = TRUE) %>%
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_color = fwd_mismatch_perc, 
            initial_layout = "re",
            layout = "da")

A few points:

It helps organization to not rely on setwd, since if you move the folder or give it to another person the code will not work. I suggest using R projects if using RStudio, or the here package. See https://www.tidyverse.org/blog/2017/12/workflow-vs-script/
You dont need the '.txt' on a '.csv' file. '.csv' implies a plain-text format
The reason you were having problems is that you needed per-taxon data. obs is used to get each value for in a dataset for each taxon. You can use that to generate per-taxon values to use to filter taxa or plot values.

Hope this helps!

R project

foreignsand commented 3 years ago

Thanks so much! I do generally use projects to share R code but because I had just thrown this together I didn't resave it as one. Apologies if that caused any problems. I also only added the .txt extension here as a workaround because github doesn't allow files with the .csv extension to be uploaded. It's so ridiculous because, as you pointed out, the extension is essentially meaningless in this case, but it worked.

Many many thanks for your help! I really appreciate it. I'll try this out and will let you know how it goes. :)

foreignsand commented 3 years ago

Thanks so much again for your help! This is working fantastically well. I have one additional question about whether it's possible to force the range of the plots of the proportion of specimens with mismatches.

To clarify, for some of the barcodes I am analyzing, 100% of the specimens have primer mismatches and the plot then looks like this:

plot_pctfwd.pdf

where the range of proportions has a single value of 1. I would like for the range to be from 0 to 1 on all plots. Is there a way to specify this range?

Many thanks!

zachary-foster commented 3 years ago

No problem!

Give node_color_interval = c(0, 1) a try

foreignsand commented 3 years ago

Thanks so much again! That worked like a charm and my plots are gorge!

plot_pctrev.pdf

grunwaldlab / metacoder

Accessing sample data within a taxmap object to create a heat tree. #298