grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
135 stars 28 forks source link

Accessing sample data within a taxmap object to create a heat tree. #298

Closed foreignsand closed 3 years ago

foreignsand commented 3 years ago

I would like to access sample data within a taxmap object to create a heat tree.

My data are produced using ecoPCR to test primer sets against a few thousand genetic sequences—all fishes—from GenBank. For each sequence, the output includes the number of mismatches associated with each primer and sequence information is then linked with taxonomic information in Metacoder.

I have brought the data into Metacoder and created a taxmap object and plotted heat trees, but I would like to subset and/or analyze the data using the fwd_mismatch and rev_mismatch information from the ecoPCR output. My R script follows and I've attached my data files and am hoping someone can help me figure out how to do this. There are additional comments below to clarify what I am trying to do.

## Environment Preparation
# Load libraries
library(metacoder)
library(dplyr)
library(taxa)
library(readr)
library(ape)

# Set working directory
setwd("/Users/lemurbear/Documents/_mine/OSU/Master's/tutorials")

## Data Preparation
# Load data
test_data = read_csv("andersen.csv.txt")
tax_data = read_tsv("taxonomy.txt")
head(test_data)
head(tax_data)

# Combine data
my_test_data = cbind(test_data, tax_data)
head(my_test_data$taxonomy)

# Create taxmap object
test_taxa = parse_tax_data(my_test_data, class_cols = c("taxonomy"), 
                           class_sep = ";", 
                           class_regex = "^([a-z])_(.*)$", 
                           class_key = c("tax_rank" = "taxon_rank", "name" = "taxon_name"))

tail(taxon_names(test_taxa), 40)
head(taxon_ranks(test_taxa), 40)

test_taxa$data$class_data = NULL

## Plotting
# Plot test tree
heat_tree(test_taxa)

# Filter to family
test_taxa %>% 
  filter_taxa(taxon_ranks == "f", supertaxa = TRUE) %>% 
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_color = n_obs, 
            layout = "fr")

# Filter taxa to select only samples with mismatches:
#   I'm not sure how to get this to work
test_filter = test_taxa$data$tax_data$fwd_mismatch > 0
head(test_filter)

test_taxa %>% 
  filter_taxa(test_taxa$data$tax_data$fwd_mismatch > 0) %>% 
  filter_taxa(taxon_ranks == "s", supertaxa = TRUE) %>%
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_color = n_obs, 
            layout = "fr")

# Would it be possible to plot the heat tree and set node_size and node_color 
# to be associated with the percentage of taxa within a taxonomic group that 
# have a mismatch? i.e. The percentage of Perciformes with mismatches and then, within
# Perciformes, the percentage of each family with mismatches, and so on.

Many thanks for any and all help!

andersen.csv.txt taxonomy.txt

zachary-foster commented 3 years ago

Hello, I am just about to leave for camping until Monday. I will get back to this then. Sorry for the delay!

zachary-foster commented 3 years ago

Check out the attached example:

## Environment Preparation
# Load libraries
library(metacoder)
library(dplyr)
library(taxa)
library(readr)
library(ape)

# Set working directory
#setwd("/Users/lemurbear/Documents/_mine/OSU/Master's/tutorials")
# See https://www.tidyverse.org/blog/2017/12/workflow-vs-script/

## Data Preparation
# Load data
# test_data = read_csv("andersen.csv.txt") # .txt is not needed
test_data = read_csv("andersen.csv") 
tax_data = read_tsv("taxonomy.txt")
head(test_data)
head(tax_data)

# Combine data
my_test_data = cbind(test_data, tax_data)
head(my_test_data$taxonomy)

# Create taxmap object
test_taxa = parse_tax_data(my_test_data, class_cols = c("taxonomy"), 
                           class_sep = ";", 
                           class_regex = "^([a-z])_(.*)$", 
                           class_key = c("tax_rank" = "taxon_rank", "name" = "taxon_name"))

tail(taxon_names(test_taxa), 40)
head(taxon_ranks(test_taxa), 40)

test_taxa$data$class_data = NULL

## Plotting
# Plot test tree
heat_tree(test_taxa)

# Filter to family
test_taxa %>% 
  filter_taxa(taxon_ranks == "f", supertaxa = TRUE) %>% 
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_color = n_obs, 
            layout = "fr")

# Get per-taxon data for plotting/filtering
test_taxa$data$per_taxon <- tibble(taxon_id = taxon_ids(test_taxa))
test_taxa$data$mean_fwd_mismatch <- unlist(lapply(obs(test_taxa, data = 'tax_data'), function(index) {
  mean(test_taxa$data$tax_data$fwd_mismatch[index])
}))
test_taxa$data$mean_rev_mismatch <- unlist(lapply(obs(test_taxa, data = 'tax_data'), function(index) {
  mean(test_taxa$data$tax_data$rev_mismatch[index])
}))

# Filter taxa to select only samples with mismatches:
test_taxa %>% 
  filter_taxa(mean_fwd_mismatch >= 3) %>% 
  filter_taxa(taxon_ranks == "f", supertaxa = TRUE) %>%
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_color = mean_fwd_mismatch, 
            initial_layout = "re",
            layout = "da")

# Would it be possible to plot the heat tree and set node_size and node_color 
# to be associated with the percentage of taxa within a taxonomic group that 
# have a mismatch? i.e. The percentage of Perciformes with mismatches and then, within
# Perciformes, the percentage of each family with mismatches, and so on.
test_taxa$data$fwd_mismatch_perc <- unlist(lapply(obs(test_taxa, data = 'mean_fwd_mismatch'), function(index) {
  100 * sum(test_taxa$data$mean_fwd_mismatch[index] > 0) / length(index)
}))

test_taxa %>% 
  filter_taxa(taxon_ranks == "f", supertaxa = TRUE) %>%
  heat_tree(node_label = taxon_names, 
            node_size = n_obs, 
            node_color = fwd_mismatch_perc, 
            initial_layout = "re",
            layout = "da")

A few points:

Hope this helps!

R project

foreignsand commented 3 years ago

Thanks so much! I do generally use projects to share R code but because I had just thrown this together I didn't resave it as one. Apologies if that caused any problems. I also only added the .txt extension here as a workaround because github doesn't allow files with the .csv extension to be uploaded. It's so ridiculous because, as you pointed out, the extension is essentially meaningless in this case, but it worked.

Many many thanks for your help! I really appreciate it. I'll try this out and will let you know how it goes. :)

foreignsand commented 3 years ago

Thanks so much again for your help! This is working fantastically well. I have one additional question about whether it's possible to force the range of the plots of the proportion of specimens with mismatches.

To clarify, for some of the barcodes I am analyzing, 100% of the specimens have primer mismatches and the plot then looks like this:

plot_pctfwd.pdf

where the range of proportions has a single value of 1. I would like for the range to be from 0 to 1 on all plots. Is there a way to specify this range?

Many thanks!

zachary-foster commented 3 years ago

No problem!

Give node_color_interval = c(0, 1) a try

foreignsand commented 3 years ago

Thanks so much again! That worked like a charm and my plots are gorge!

plot_pctrev.pdf