grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
135 stars 28 forks source link

differential heat tree interpretation #222

Closed dougwyu closed 6 years ago

dougwyu commented 6 years ago

Hi,

could you please help me interpret the attached image? The sample codes are different habitats: CL = cropland (i.e. farms), and the other codes are different forest types. we have metabarcoded insect samples in these habitats and assigned taxonomies using RDP Classifier (0.8 cutoff).

Not surprisingly, cropland is compositionally very different from the forests, and in the differential heat trees comparing CL to the other habitats, the compositional differences are traced to down to genus level (essentially individual OTUs but they don't have species dets).

in contrast, compositional differences between the forest types are mostly only traced down to order level. E.g. NF and BB differ in the number of Araneae and Hemiptera OTUs and in the number of Mycetophila OTUs.

I'm not sure how to interpret this. Is it because all the cropland samples consistently have more or fewer of specific OTUs compared to all the forest samples, whereas the forest samples within a forest type differ in which (say, Araneae) OTUs they have, but the different forest types differ in the overall number of Araneae OTUs?

screen shot 2018-03-14 at 12 20 37
zachary-foster commented 6 years ago

Hello @dougwyu, sure!

However, the interpretation is dependent on the code you used to make the figure. Did you follow the example on the GitHub readme?

Did you set any differences to 0 that were not significant? If so, this could explain why there are sometimes genera that are not different although their orders are since the variation of read depths could be different between the two ranks.

I'm not sure how to interpret this. Is it because all the cropland samples consistently have more or fewer of specific OTUs compared to all the forest samples, whereas the forest samples within a forest type differ in which (say, Araneae) OTUs they have, but the different forest types differ in the overall number of Araneae OTUs?

If you followed the example, then the colors represent differences in read depth, not differences between the number of OTUs. You could do that if you want though.

I cant say much more without more specifics about how you did the analysis.

By the way, the development version will force the matrix to be square, so it will look nicer.

dougwyu commented 6 years ago

Hi Zachary,

sorry for the delay. My student ran this code, using the code in your tutorial. I wanted to look at the code again and see if i could figure out myself, but i haven't been able to.

we convert our OTU tables to presence/absence (1/0) data before all community analyses, so if the code below is for read count, it will automatically be for OTU count.

doug

library(metacoder) library(tibble)

mbc_otus <- read.table("data/for_metacoder/2014mbc_otus_655_metacoder.txt", header = T, sep = "\t") mbc_samples <- read.table("data/for_metacoder/2014mbc_samples_655_metacoder.txt", header = T, sep = "\t")

str(mbc_otus) str(mbc_samples)

change data type

mbc_otus$OTU_id <- as.character(mbc_otus$OTU_id) mbc_otus$lineage <- as.character(mbc_otus$lineage) mbc_samples$Site <- as.character(mbc_samples$Site) mbc_samples$Habitat <- as.character(mbc_samples$Habitat)

mbc_otus <- as_tibble(mbc_otus) mbc_samples <- as_tibble(mbc_samples) print(mbc_otus) print(mbc_samples)

obj <- parse_tax_data(mbc_otus, class_cols = "lineage", class_sep = ";", class_key = c(tax_rank = "info", tax_name = "taxon_name"), class_regex = "^(.+)__(.+)$")

This returns a taxmap object. The taxmap class is designed to store any number of tables, lists, or vectors associated with taxonomic information and facilitate manipulating the data in a cohesive way. Here is what that object looks like:

print(obj) # or click on the object in the Environment pane

accounting for un-even sampling

obj$data$tax_data <- calc_obs_props(obj, "tax_data")

print(obj)

Getting per-taxon information

Currently, we have values for the abundance of each OTU, not each taxon. To get information on the taxa, we can sum the abundance per-taxon like so:

obj$data$tax_abund <- calc_taxon_abund(obj, "tax_data", cols = mbc_samples$Site) print(obj)

Note that there is now an additional table with one row per taxon.

We can also easily calculate the number of samples have reads for each taxon:

obj$data$tax_occ <- calc_n_samples(obj, "tax_abund", groups = mbc_samples$Habitat) print(obj)

Plotting taxonomic data

Now that we have per-taxon information, we can plot the information using heat trees. The code below plots the number of “Nose” samples that have reads for each taxon. It also plots the number of OTUs assigned to each taxon in the overall dataset.

heat_tree(obj, node_label = obj$taxon_names(), node_size = obj$n_obs(), node_color = obj$data$tax_occ$NF, node_size_axis_label = "OTU count", node_color_axis_label = "Samples with reads")

Comparing two treaents/groups

obj$data$diff_table <- compare_groups(obj, dataset = "tax_abund", cols = mbc_samples$Site, groups = mbc_samples$Habitat) print(obj$data$diff_table)

note column called log2_median_ratio

heat_tree(obj, node_label = taxon_names, node_size = n_obs, node_color = log2_median_ratio, node_color_interval = c(-2, 2), node_color_range = c("cyan", "gray", "tan"), node_size_axis_label = "OTU count", node_color_axis_label = "Log 2 ratio of median proportions")

obj$data$diff_table$wilcox_p_value <- p.adjust(obj$data$diff_table$wilcox_p_value, method = "fdr") hist(obj$data$diff_table$wilcox_p_value)

Comparing any number of treatments/groups

obj$data$diff_table <- compare_groups(obj, dataset = "tax_abund", cols = mbc_samples$Site, groups = mbc_samples$Habitat) print(obj$data$diff_table) heat_tree_matrix(obj, dataset = "diff_table", node_size = n_obs, node_label = taxon_names, node_color = log2_median_ratio, node_color_range = diverging_palette(), node_color_trans = "linear", node_color_interval = c(-3, 3), edge_color_interval = c(-3, 3), node_size_axis_label = "Number of OTUs", node_color_axis_label = "Log2 ratio median proportions")

zachary-foster commented 6 years ago

so if the code below is for read count, it will automatically be for OTU count.

Yea, thats right. That means the color is the log 2 ratio of median proportions of OTUs rather than reads.

I think it has something to do with you converting counts to presence/absence and calculating proportion of OTUs for each taxon. Since everything is 0 or 1, there are only a few values possible for the proportion of OTUs for each taxon, so the median does not always capture the difference. For example:

> x1 = c(.1, .1, .1, .1, 0, 0, 0)
> x2 = c(.5, .5, .5, .1, 0, 0, 0)
> median(x1) == median(x2)
[1] TRUE
> mean(x1)
[1] 0.05714286
> mean(x2)
[1] 0.2285714

Perhaps you can use mean instead of median? You can see how to change what compare_groups returns in the docs ?compare_groups. Look at the doc for the func option. Let me know if you have questions.

Make sense?

zachary-foster commented 6 years ago

closing due to inactivity. Let me know if you want to discuss this more