Open sghignone opened 3 years ago
Hi Stefano,
Thanks!
Since heat tree matrix supports multiple comparisons you have to consider how to handle cases where a taxon has higher than a 2.5 fold change in one comparison but not another. The example below includes taxa if they are > 2.5 in any comparison.
Try this:
library(metacoder)
# Parse data for plotting
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = hmp_samples$sample_id)
# Calculate difference between groups
x$data$diff_table <- calc_diff_abund_deseq2(x, data = "tax_table",
cols = hmp_samples$sample_id,
groups = hmp_samples$body_site)
# Remove taxa with only small differences
per_taxon_fold_changes <- obs(x, data = 'diff_table', value = 'log2FoldChange')
per_taxon_max_change <- unlist(lapply(per_taxon_fold_changes, function(tax_changes) max(abs(tax_changes))))
x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
I saw you posted this to the google groups and in that question you asked about filtering out NAs as well. You can use the same technique, but again you have to consider if a NA in one comparison should cause the taxon to be removed from all comparisons.
Thanks for the explanation Zachary! In the case of only two comparisons, I can identify my differentially abundant taxa as follows, i.e. padj > 0.05 and Log2FoldChange > 2.5 (upregulated) and < -2.5 (downregulated):
signALL<-obj.physeq.TvS$data$tax_T_v_S %>% filter((padj > 0.05 & log2FoldChange < -2.5) | (padj > 0.05 & log2FoldChange > 2.5))
I was wondering how to implement this filter in a filter_taxa function, so to pass to heat_tree only those taxa really significant (and just not plotting them out of the total)
To take into account the p-values, you could modify the code like so:
library(metacoder)
# Parse data for plotting
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = hmp_samples$sample_id)
# Calculate difference between groups
x$data$diff_table <- calc_diff_abund_deseq2(x, data = "tax_table",
cols = hmp_samples$sample_id,
groups = hmp_samples$body_site)
# Remove taxa with only small differences
per_taxon_fold_changes <- obs(x, data = 'diff_table', value = 'log2FoldChange')
per_taxon_p_values <- obs(x, data = 'diff_table', value = 'pvalue')
per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
if (all(per_taxon_p_values[[i]] > 0.05)) {
return(0)
} else {
return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05])))
}
}))
x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
# Plot results (might take a few minutes)
heat_tree_matrix(x,
data = "diff_table",
node_size = n_obs,
node_label = taxon_names,
node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange),
node_color_range = diverging_palette(),
node_color_trans = "linear",
node_color_interval = c(-3, 3),
edge_color_interval = c(-3, 3),
node_size_axis_label = "Number of OTUs",
node_color_axis_label = "Log2 fold change")
This would preserve only taxa that have a change (positive or negative) of greater than 2.5 associated with a p value less than .05 in at least one comparison. Does this do what you want?
Hello!
I found this discussion very helpful and I am trying to mimic these steps to create a similar heat tree. I think the problem that I am running into is that my taxa have zero values for at least one one of my samples (but I could be wrong - very amateur at R), resulting in the following message when I run the "calc_diff_abund_deseq2" command:
converting counts to integer mode estimating size factors Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means
I read on another discussion board that this can be avoided by adding pseudo-counts, or adding +1, to each value in the otu table. Is there a quick way that you can think to do this? I imagine this will work out fine, as this will not change which taxa are significantly different.
Thanks in advance! Brett
I have not had that problem personally that I remember, but I imagine you have 3 options: 1) filter the input data in some way to remove low-abundance samples. Do you have a negative control with all 0s for example? 2) Add some small amount like 1 to all counts like you said. This is pretty easy to do. It depends a bit on your data structure, but might be as simple as my_table[my_cols] <- my_table[my_cols] + 1
, where my_cols
contains the subset of columns with counts (could be a column in your sample metadata table, if you have one). 3) Use a different method for differential taxon abundance.
Thanks for the quick response!
I had initially followed the Metacoder workshop tutorial (only using my own data) so the names of my files and the structure are similar. I already removed all of the taxa that had no values (e.g. 0 for all samples) and then tried this:
obj$data$otu_counts <- obj$data$otu_counts + 1 Error in FUN(left, right) : non-numeric argument to binary operator
Sorry - really no clue what I'm doing here... but I really appreciate the help!
No problem!
The way that you would do it assuming you have data structured like the workshop would be something like:
obj$data$otu_counts[sample_data$SampleID] <- obj$data$otu_counts[sample_data$SampleID] + 1
Basically, if you add 1 to a table (or matrix), R guesses that you want to add 1 to all the values. However, that might not make sense if some of the columns are not numeric. So what you need to do is just replace the data for the numeric columns with the incremented version, and leave the non-numeric columns alone. Here sample_data$SampleID
is a column in the sample_data
table that corresponds to the column names (sample names) in the abundance matrix.
Referring to the columns for index would also work, but is less readable (e.g. obj$data$otu_counts[1:12345] <- obj$data$otu_counts[1:12345] + 1
)
Amazing - it worked! In the home stretch now :)
Unfortunately when I try to generate the heat tree, now this happens:
> ##Removes taxa with small/not significant differences##
> per_taxon_fold_changes <- obs(obj, data = 'diff_table', value = 'log2FoldChange')
> per_taxon_p_values <- obs(obj, data = 'diff_table', value = 'pvalue')
> per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
+ if (all(per_taxon_p_values[[i]] > 0.05)) {
+ return(0)
+ } else {
+ return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05])))
+ }
+ }))
> obj <- filter_taxa(obj, per_taxon_max_change > 1.2, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
> ##Builds heat tree##
> heat_tree_matrix(obj,
+ data = "diff_table",
+ node_size = n_obs,
+ node_label = taxon_names,
+ node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange),
+ node_color_range = diverging_palette(),
+ node_color_trans = "linear",
+ node_color_interval = c(-3, 3),
+ edge_color_interval = c(-3, 3),)
Error: All pairs being compared should have one value per taxon (142). The following do not:
Vehicle vs. Chemo (815)
Did I use the wrong part of the data as input...? Maybe I am entering multiple values on accident...
Thanks again!
That error happens when the data being plotted is not one value per taxon for at least one comparison. Its saying that there are 142 taxa to be plotted, but "diff_table" has 815 rows for one comparison. At some point between calculating taxon differences and plotting, you filtered out taxa without filtering out filtering the differences in "diff_table" in the same way. I cant tell how that might have happened. I suggest check how the number of taxa and number of rows per comparison in "diff_table" change as the code progresses to find which step is causing the problem.
Okay, I took out any filtering steps that I added (like the steps from the tutorial that filter out non-Bacteria and OTUs that had all zero counts) and edited my commands to resemble the above script (posted on Jan 7th)
as closely as possible. This did remove the need for adding the pseduo-counts and the error regarding too many rows for comparison, but now I have a new problem. Everything runs without error until the "heat_tree_matrix()", which returns the error "Error in apply(layout_matrix, MARGIN = 2, function(x) all(is.na(x))) : dim(X) must have a positive length"
I'm copying my code here (and happy to share the files if need-be) - any ideas on what's wrong now? I ultimately just want to show the taxonomic differences (in both positive and negative directions above a determined fold-difference) between my control and treatment group on (perhaps) a single heat tree. I'm just trying to do some hypothesis generation, so I'm also okay with using the un-corrected pvalues (for now). Am I going about this all wrong?
Thanks a TON!!!
library(readr)
otu_data <- read_tsv("/Volumes/Secure KeyZ/Chemo_BA/16S/Chemo_For_HeatTrees/ChemoDose_Bioms/DC_Biom.txt")
tax_data <- read_tsv("/Volumes/Secure KeyZ/Chemo_BA/16S/Chemo_For_HeatTrees/ChemoDose_Taxonomy/taxonomy_edited.txt")
library(dplyr)
otu_data <- left_join(otu_data, tax_data, by = c("OTU_ID" = "OTU_ID"))
library(taxa) library(metacoder)
sample_data <- read_tsv("/Volumes/Secure KeyZ/Chemo_BA/16S/DC_sample_data.txt")
x = parse_tax_data(otu_data, class_cols = "Taxonomy", class_sep = ";", class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"), classregex = "^([a-z]{0,1}){0,2}(.*)$")
x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = sample_data$Sample_ID)
x$data$diff_table <- calc_diff_abund_deseq2(x, data = "tax_table", cols = sample_data$Sample_ID, groups = sample_data$Treatment)
per_taxon_fold_changes <- obs(x, data = 'diff_table', value = 'log2FoldChange') per_taxon_p_values <- obs(x, data = 'diff_table', value = 'pvalue') per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) { if (all(per_taxon_p_values[[i]] > 0.05)) { return(0) } else { return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05]))) } })) x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
heat_tree_matrix(x, data = "diff_table", node_size = n_obs, node_label = taxon_names, node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange), node_color_range = diverging_palette(), node_color_trans = "linear", node_color_interval = c(-3, 3), edge_color_interval = c(-3, 3), node_size_axis_label = "Number of OTUs", node_color_axis_label = "Log2 fold change")
Its hard to tell without the data to test. Can you send your code in an .R file and enough data to reproduce the problem so I can see what is going on? Thanks!
Hi Zachary, sorry for extremely slow reply,
I saw you posted this to the google groups and in that question you asked about filtering out NAs as well. You can use the same technique, but again you have to consider if a NA in one comparison should cause the taxon to be removed from all comparisons.
This was resolved when I realized to work with a rarefied otu_table; reverting back to raw count solved everything. Cheers.
per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
if (all(per_taxon_p_values[[i]] > 0.05)) {
return(0)
} else {
return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05])))
}
}))
x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
This would preserve only taxa that have a change (positive or negative) of greater than 2.5 associated with a p value less than .05 in at least one comparison. Does this do what you want?
In theory, YES. In practice, I'm getting an error when running that step:
Error in if (all(per_taxon_p_values[[i]] > 0.05)) { :
missing value where TRUE/FALSE needed
Called from: FUN(X[[i]], ...)
what's going on?
S.-
...EDIT... PS. OOK, I have NAs in per_taxon_p_values ;how can i deal with it?
Many R functions have the na.rm
option to ignore NAs, including all
and max
, so this would probably work:
per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
if (all(per_taxon_p_values[[i]] > 0.05, na.rm = TRUE)) {
return(0)
} else {
return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05]), na.rm = TRUE))
}
}))
x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
You could also do something like:
per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
pvals <- per_taxon_p_values[[i]]
pvals <- pvals[! is.na(pvals)]
if (length(pvals) == 0 || all(pvals > 0.05)) {
return(0)
} else {
return(max(abs(per_taxon_fold_changes[[i]][pvals <= 0.05])))
}
}))
x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
Thanks Zachary for the follow-op. Both functions run without warnings or errors, but I get them downstream. 1) first function version
> per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
+ if (all(per_taxon_p_values[[i]] > 0.05, na.rm = TRUE)) {
+ return(0)
+ } else {
+ return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05]), na.rm = TRUE))
+ }
+ }))
> x <- filter_taxa(x, per_taxon_max_change > 2.5,
+ supertaxa = TRUE,
+ reassign_obs = c(diff_table = FALSE))
**Warning message:
There is no "taxon_id" column in the data set "3", so there are no taxon IDs.**
> heat_tree_matrix(x,
+ data = "diff_table",
+ node_size = n_obs,
+ node_label = taxon_names,
+ node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange),
+ node_color_range = diverging_palette(),
+ node_color_trans = "linear",
+ node_color_interval = c(-3, 3),
+ edge_color_interval = c(-3, 3),
+ node_size_axis_label = "Number of OTUs",
+ node_color_axis_label = "Log2 fold change")
**Error in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
VECTOR_ELT() can only be applied to a 'list', not a 'NULL'**
2) second function version
> per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
+ pvals <- per_taxon_p_values[[i]]
+ pvals <- pvals[! is.na(pvals)]
+ if (length(pvals) == 0 || all(pvals > 0.05)) {
+ return(0)
+ } else {
+ return(max(abs(per_taxon_fold_changes[[i]][pvals <= 0.05])))
+ }
+ }))
> x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
Error: TRUE/FALSE vector (length = 1352) must be the same length as the number of taxa (760)
thanks for your attention s.-
I just got the VECTOR_ELT() can only be applied to a 'list', not a 'NULL'**
error too yesterday. I reinstalled metacoder
from and it went away. Might be a issue with a dependency. For the 2nd error, did you by any chance try to filter the object x
that was already filtered by the first chunk of code?
Hi @zachary-foster , i am using metacoder to visualise groups that are differentially abundant and significant between nursery and all other forest compartments, I have 21 forest compartments in total. I would like to use the nursery as the baseline for comparison against all the other forest compartments such that the nursery is one group and the other group is all the forest compartments (21 in total). My end goal is to end up with a plot showing nursery against all other forest compartments comparisons...Would the code here work for these comparisons? I have been able to group them into these two categories by splitting the phyloseq object and generating a figure but i'd like to check if this would be the best way to do it or would you recommend doing individual forest compartments vs nursery. My code below for nursery vs all other forest compartments that i've grouped under two separate categories
OTU_nursery = as(otu_table(ps.nursery), "matrix")
OTU_nurserydf = as.data.frame(t(OTU_nursery))
OTU_nurserydf <- OTU_nurserydf %>% rownames_to_column(var="ASVs")
Nursery_taxa_root_only_decontam_noNTC_rar=data.frame(phyloseq::tax_table(ps.nursery))
write.csv(Nursery_taxa_root_only_decontam_noNTC_rar,file="/raid/projects/scratch/TRMP/RA_2/PM8_1/SG_Nursery_plant_out_trials/R_analysis/Phyloseq_16S_condensed/16S_nursery_Taxa_root_only_decontam_noNTC_rar_forMETACODER.csv")
Taxa_nursery=read.csv("/raid/projects/scratch/TRMP/RA_2/PM8_1/SG_Nursery_plant_out_trials/R_analysis/Phyloseq_16S_condensed/16S_nursery_Taxa_root_only_decontam_noNTC_rar_forMETACODER_edited.csv")
ps.all.except.nursery <- subset_samples(ps_aftercontam_removal_noNTC_root_rar_p, Forest.comp.sort != "0_1") ps.all.except.nursery<- prune_taxa(taxa_sums(ps.all.except.nursery) > 0, ps.all.except.nursery) # prune OTUs that are not present in at least one sample
OTU_all_except_nursery = as(otu_table(ps.all.except.nursery), "matrix")
OTU_all_except_nurserydf = as.data.frame(t(OTU_all_except_nursery))
OTU_all_except_nurserydf <- OTU_all_except_nurserydf %>% rownames_to_column(var="ASVs")
write.csv(OTU_all_except_nurserydf,file="/raid/projects/scratch/TRMP/RA_2/PM8_1/SG_Nursery_plant_out_trials/R_analysis/Phyloseq_16S_condensed/16S_allexcept_nursery_ASVdf_root_only_decontam_noNTC_rar_forMETACODER.csv")
All_except_nursery_taxa_root_only_decontam_noNTC_rar=data.frame(phyloseq::tax_table(ps.all.except.nursery))
write.csv(All_except_nursery_taxa_root_only_decontam_noNTC_rar,file="/raid/projects/scratch/TRMP/RA_2/PM8_1/SG_Nursery_plant_out_trials/R_analysis/Phyloseq_16S_condensed/16S_allexcept_nursery_Taxa_root_only_decontam_noNTC_rar_forMETACODER.csv")
Taxa_allexcept_nursery=read.csv("/raid/projects/scratch/TRMP/RA_2/PM8_1/SG_Nursery_plant_out_trials/R_analysis/Phyloseq_16S_condensed/16S_allexcept_nursery_Taxa_root_only_decontam_noNTC_rar_forMETACODER_edited.csv")
Taxa_allexcept_nursery$Category="All Forest compartments" Taxa_nursery$Category="Nursery"
Taxa_nursery_allFC=rbind(Taxa_nursery,Taxa_allexcept_nursery)
Taxa_nursery_allFC$ASVs
<- as.character(Taxa_nursery_allFC$ASVs
) # Must be same type for join to work
OTU2_root_only_decontam_noNTC_rar$ASVs <- as.character(OTU2_root_only_decontam_noNTC_rar$ASVs) # Must be same type for join to work
OTU_root_df <- left_join(OTU2_root_only_decontam_noNTC_rar, Taxa_nursery_allFC,
by = c("ASVs")) # identifies cols with shared IDs
print(OTU_root_df)
tail(colnames(OTU_root_df), n = 10)
Root_sample_data=subset.data.frame(Root_sample_data,Type=="Root") Root_sample_data=subset.data.frame(Root_sample_data,!SampleID%in% c("NIR16S_T_9_LCHV7","SIR16S_MC2_C1_LCHV7"))
Root_sample_data_ed=subset.data.frame(Root_sample_data,Forest.compartment.ID=="Tokoroa nursery") #nursery metadata Root_sample_data_ed2=subset.data.frame(Root_sample_data,Forest.compartment.ID!="Tokoroa nursery") #all other forest comp metadata
Root_sample_data_ed$Category="Nursery" Root_sample_data_ed2$Category="All Forest compartments"
Root_sample_data_ed3=rbind(Root_sample_data_ed,Root_sample_data_ed2)
x = parse_tax_data(OTU_root_df, class_cols = "Taxonomy", class_sep = ";", class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"), classregex = "^([a-z]{0,1}){0,2}(.*)$")
x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = Root_sample_data_ed3$SampleID)
x$data$diff_table <- calc_diff_abund_deseq2(x, data = "tax_table", cols = Root_sample_data_ed3$SampleID, groups = Root_sample_data_ed3$Category)
per_taxon_fold_changes <- obs(x, data = 'diff_table', value = 'log2FoldChange') per_taxon_p_values <- obs(x, data = 'diff_table', value = 'pvalue') per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) { if (all(per_taxon_p_values[[i]] > 0.05, na.rm = TRUE)) { return(0) } else { return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05]), na.rm = TRUE)) } })) x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
heat_tree_matrix(x, data = "diff_table", node_size = n_obs, node_label = taxon_names, node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange), node_color_range = diverging_palette(), node_color_trans = "linear", node_color_interval = c(-3, 3), edge_color_interval = c(-3, 3), node_size_axis_label = "Number of OTUs", node_color_axis_label = "Log2 fold change")
I also noticed that the phyloseq before the split contains 7573 ASV but after splitting and recombining them its showing up as 7910 in the plot attached, not sure why ? and is there a way to adjust the wording in the plots so its more visible? Can you help please? Thanks !
Hello, I will try to take a look at this next week.
Hi again, first of all, this package is amazing! I'm dealing with heat_tree_matrix on data obtained with calc_diff_abund_deseq2; what should I do on the diff_table in order to
keep in consideration only those taxa with Log2FoldChange > 2.5 (upregulated) and < -2.5 (downregulated)?
Many thanks in advance! Cheers Stefano