Add correlation plots? Or any other helpful plots.

By correlation plots (#229) I just mean a scatter plot comparing two variables. Below I've given a use case example from one of our labs projects vallenderlab also datasnakes.

Vallender Lab Microbiome Project

Stressed samples and Control samples
Used log2 mean ratio of abundance before and after stress
Used a log based axis
Fitted a 1:1 ratio line
Added color to the data points to indicate significant values
- Insignificant changes in abundance are left black
- Data points above the 1:1 line are blue and above are red

Interactive Plotly Example

Static plot:

Cool, I like it! It seems like the output of compare_groups would work well for this. A few thoughts:

How to determine which taxa are displayed? There will not always be rank information and perhaps people might what to have taxa of different ranks. Perhaps the simplest and most versatile option is just to plot all taxa and expect the user the use filter_taxa before plotting and %>% the result in. There would need to be examples of this if that is the strategy.
If the output of compare_groups is used and there are more than two things being compared, how to select which comparison to plot? Should this be extended to multiple comparisons by faceting? If so, should the facets be wrapped with the titles saying what is compared, or in a matrix like heat_tree_matrix. Perhaps there can be an as_matrix = FALSE option, if multiple comparisons are supported.
I would like a blank white background, but thats just my preference. I am open to other opinions.
It would be nice to include an option for the minimum significant p value.
Perhaps an option for what to include for the labels, but default to taxon_names.
Should there be an option to scale the size of the dots by some other value (e.g. n_obs)? If the function returns a ggplot2 obejct, can this be done after by the user?

It seems like the output of _comparegroups would work well for this.

@zachary-foster That's exactly what I did with the correlation plots lol. I'll type up a gist tomorrow and show you how I did it. Pretty simple stuff.

Also, I really need to go back through all of the functions again with the test data. I feel like I haven't used some of the functions or don't understand a few of them. I really explored your functions pretty hard though, so I think I have a handle on it. I just don't want to leave anything undone.

What do you think about a table containing some key statistical output from the _comparegroups function? Maybe unnecessary, but it was another thing I did for the same project:

Like a table ready to used in Rmarkdown? How would this differ from the current output of compare_groups?

Like a table ready to used in Rmarkdown?

Yes, this would be a table that's markdown ready.

How would this differ from the current output of compare_groups?

The current tibble returned by compare_groups looks a bit different. Below I've added the diff_table I created using a custom function for generating other statistics:

> silva_smb_data$metacoder_data$data$diff_table
# A tibble: 420 x 14
   taxon_id treatment_1 treatment_2 log2_median_ratio log2_mean_ratio median_diff mean_diff mean_treat1
   <chr>    <chr>       <chr>                   <dbl>           <dbl>       <dbl>     <dbl>       <dbl>
 1 ac       Stressed    Control               0.00272         0.00155   0.00186    0.00106     0.986   
 2 af       Stressed    Control               0.322           0.377     0.0850     0.102       0.442   
 3 ag       Stressed    Control              -0.542          -0.228    -0.00454   -0.00455     0.0265  
 4 ah       Stressed    Control               0.804           0.622     0.00733    0.0109      0.0310  
 5 ai       Stressed    Control              -2.07            0.493    -0.000742   0.000546    0.00189 
 6 aj       Stressed    Control              -0.260          -0.296    -0.0955    -0.109       0.477   
 7 ak       Stressed    Control               1.07            1.64      0.000365   0.00143     0.00210 
 8 al       Stressed    Control              -2.65           -2.25     -0.000880  -0.000953    0.000253
 9 an       Stressed    Control               1.21            1.16      0.000441   0.000764    0.00138 
10 ao       Stressed    Control               0.653          -1.63      0.0000817 -0.00190     0.000903
# ... with 410 more rows, and 6 more variables: mean_treat2 <dbl>, wilcox_p_value <dbl>,
#   hartigan_dip_treat1 <dbl>, hartigan_dip_treat2 <dbl>, bimodality_coeff_treat1 <dbl>,
#   bimodality_coeff_treat2 <dbl>

The difference is that there are only select variables (columns) in the publication ready table, and the data is agglomerated by a specific rank (Phylum in the example). Other subtaxa columns could be added if desired. While this could be useful, the user could also do this themselves, and I might have missed an important concept in one of the other metacoder/taxmap functions.

It might be cooler to add an agglomeration function like in phyloseq::tax_glom() as well.

Phyloseq workflow

In my project I've done this with phyloseq. I'm not always a super efficient R programmer so some of it might be redundant or coded better with functions like lapply, cbind, etc.:

phylum_phy <- phyloseq::tax_glom(phyloseq_object, taxrank=rank)
# Get data frames
  df_tax <- data.frame(tax_table(phylum_phy))
  df_otu <- data.frame(otu_table(phylum_phy))
  df_sam <- data.frame(sample_data(phylum_phy))

# Get various variables for filtering
  otu_names <- rownames(df_tax)
  stressed_samples <- rownames(df_sam[df_sam$TreatmentGroup == "Stressed",])
  control_samples <- rownames(df_sam[df_sam$TreatmentGroup == "Control",])
# Add a OTU column
  df_tax <- mutate(df_tax, OTU=rownames(df_tax))
  df_otu <- mutate(df_otu, OTU=rownames(df_otu))

# Join the tax table and otu table (phyloseq)
  df_tax_otu <- left_join(x=df_otu, y=df_tax, by="OTU")
  rownames(df_tax_otu) <- df_tax_otu$OTU
  temp <- df_tax_otu[,!(colnames(df_tax_otu) %in% c("OTU"))][c(stressed_samples, control_samples)]

# Get statistical vectors for the main table
  wilcox_p_value <- apply(as.matrix(temp), 1, function(x) wilcox.test(x[stressed_samples], x[control_samples])$p.value)
  mean_stressed <- apply(as.matrix(temp), 1, function(x) mean(x[stressed_samples]))
  mean_control <- apply(as.matrix(temp), 1, function(x) mean(x[control_samples]))
  log2_mean_ratio <- apply(as.matrix(temp), 1, function(x) { log_ratio <- log2(mean(x[stressed_samples]) / mean(x[control_samples]))
                             if (is.nan(log_ratio)) {
                               log_ratio <- 0
                             }
                             if (is.infinite(log_ratio)) {
                               log_ratio <- 0
                             }
                              log_ratio
                              })
  stats <- data.frame(wilcox_p_value=wilcox_p_value, mean_stressed = mean_stressed, mean_control = mean_control, log2_mean_ratio = log2_mean_ratio)
  stats$OTU <- names(wilcox_p_value)
  df_tax_otu <- left_join(df_tax_otu, stats, by="OTU")

# Curate the main data table
  df_tax_otu <- mutate(df_tax_otu, Significance=wilcox_p_value < 0.05)
  rownames(df_tax_otu) <- df_tax_otu$OTU
  df_tax_otu$OTU <- NULL
  main_df <- df_tax_otu[49:62][,c(rank, "mean_stressed", "mean_control", "wilcox_p_value", "log2_mean_ratio", "Significance", unlist(ranks[ranks != rank]))]
  main_df <- main_df %>% select_if(~sum(!is.na(.)) > 0)

Metacoder workflow

The metacoder workflow would conceptually look something like this (don't focus on the data types or anything. this is just a rough draft):

...
metacoder_obj <- metacoder::agglomerate(metacoder_obj, rank="Phylum")
# Calc functions before or after agglomeration??
# Comp group??
# Merge the diff table and tax table, filter by p-value, and keep specific columns
metacoder_obj$data$phylum_table <- metacoder::merge_and_filter(tables=c(diff_table, tax_table), filter = wilcox_p_value <= 0.05, 
columns=c(Phylum, Class, Order, wilcox_p_value, log2_mean_ratio, mean_stressed, mean_control))

Ok, I think I get it now. Yea, I think this could be useful! It could be abstracted to a function that combines per-taxon info from multiple tables (anything in obj$all_names() or any external variable named by taxon ids) into a single table. It would error if told to use data with multiple values per taxon. There could be a TRUE/FALSE option to return a markdown-ready table using knitr::kable. This was kindof the goal of taxa::get_data_frame, but it is not very developed/useful yet.

It might be cooler to add an agglomeration function like in phyloseq::tax_glom() as well.

Im not sure this is needed since this behavior is handled well by other functions. For filtering data, filter_taxa can remove ranks and reassign everything to the remaining ranks automatically. Also, calc_taxon_abund sums the counts for every taxon at every rank, so you can just subset the result of that to get the counts for a specific rank. However, since the info for every rank is there, subsetting the whole object to a specific rank is not really needed. I guess the equivalent of phyloseq::tax_glom() is something like:

library(metacoder)

# Parse dataset
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Get per-taxon counts
x$data$tax_counts <- calc_taxon_abund(x, dataset = "tax_data",
                                      cols = hmp_samples$sample_id)

# Subset 
filter_taxa(x, taxon_ranks == "p", supertaxa = TRUE, reassign_obs = c(tax_counts = FALSE))

<Taxmap>
  8 taxa: ab. Root, ac. Proteobacteria, ad. Bacteroidetes ... ag. Fusobacteria, ah. Tenericutes, ai. Spirochaetes
  8 edges: NA->ab, ab->ac, ab->ad, ab->ae, ab->af, ab->ag, ab->ah, ab->ai
  3 data sets:
    tax_data:
      # A tibble: 1,000 x 53
        taxon_id otu_id  lineage    `700035949` `700097855` `700100489` `700111314` `700033744` `700109581` `700111044`
        <chr>    <chr>   <chr>            <int>       <int>       <int>       <int>       <int>       <int>       <int>
      1 ac       OTU_97… r__Root;p…           0           2           1           0           0           0           0
      2 ad       OTU_97… r__Root;p…           0           0           0           0           0           0           0
      3 ad       OTU_97… r__Root;p…           0           1           0           0           0           0           0
      # ... with 997 more rows, and 43 more variables: `700101365` <int>, `700100431` <int>, `700016050` <int>,
      #   `700032425` <int>, `700024855` <int>, `700103488` <int>, `700096869` <int>, `700107379` <int>,
      #   `700096422` <int>, `700102417` <int>, …
    class_data:
      # A tibble: 5,922 x 5
        taxon_id input_index tax_rank tax_name            regex_match           
        <chr>          <int> <chr>    <chr>               <chr>                 
      1 ab                 1 r        Root                r__Root               
      2 ac                 1 p        Proteobacteria      p__Proteobacteria     
      3 ac                 1 c        Gammaproteobacteria c__Gammaproteobacteria
      # ... with 5,919 more rows
    tax_counts:
      # A tibble: 8 x 51
        taxon_id `700035949` `700097855` `700100489` `700111314` `700033744` `700109581` `700111044` `700101365`
        <chr>          <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
      1 ab              955.       4330.        887.        598.       6303.       3379.       1219.        829.
      2 ac              193.        144.         17.        132.       1388.         30.          0.         21.
      3 ad                5.         43.          6.         21.         20.         15.          7.          5.
      # ... with 5 more rows, and 42 more variables: `700100431` <dbl>, `700016050` <dbl>, `700032425` <dbl>,
      #   `700024855` <dbl>, `700103488` <dbl>, `700096869` <dbl>, `700107379` <dbl>, `700096422` <dbl>,
      #   `700102417` <dbl>, `700114168` <dbl>, …
  0 functions:

Phyloseq workflow

Thanks for the phyoseq workflow! Its nice to be able to compare the methods.

Metacoder workflow

It can be done something like this:

library(phyloseq)
library(metacoder)
library(dplyr)

# Parse dataset
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Convert counts to proportions
x$data$otu_table <- calc_obs_props(x, dataset = "tax_data",
                                   cols = hmp_samples$sample_id)

# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, dataset = "otu_table",
                                     cols = hmp_samples$sample_id)

# Get per-taxon body site mean proportions 
x$data$site_means <- calc_group_mean(x, dataset = "tax_table",
                                     cols = hmp_samples$sample_id,
                                     groups = hmp_samples$body_site)

# Calculate difference between groups
x$data$diff_table <- compare_groups(x, dataset = "tax_table",
                                    cols = hmp_samples$sample_id,
                                    groups = hmp_samples$body_site)

# Make table of most different classes between nose an saliva
filter_taxa(x, taxon_ranks == "c", reassign_obs = FALSE) %>%
  mutate_obs("diff_table", taxon_name = taxon_names, mean_nose = Nose, mean_saliva = Saliva) %>%
  get_dataset("diff_table") %>% # New function I just added to `taxa`
  filter(wilcox_p_value <= 0.05, treatment_1 == "Nose", treatment_2 == "Saliva") %>%
  arrange(mean_diff) %>%
  transmute(Class = taxon_name, "Mean nose proportion" = mean_nose,  "Mean saliva proportion" = mean_saliva,  "Mean difference" = mean_diff, "P value" = wilcox_p_value) %>%
  knitr::kable()

Class	Mean nose proportion	Mean saliva proportion	Mean difference	P value
Bacteroidia	0.0062143	0.2488121	-0.2425979	0.0000108
Clostridia	0.0058709	0.2016887	-0.1958178	0.0000108
Gammaproteobacteria	0.0365834	0.1138908	-0.0773074	0.0051960
Betaproteobacteria	0.0312206	0.0900281	-0.0588075	0.0112070
Fusobacteria	0.0004220	0.0452540	-0.0448320	0.0001494
Flavobacteria	0.0046150	0.0242023	-0.0195873	0.0012604
Epsilonproteobacteria	0.0175916	0.0089710	0.0086206	0.0116113
Actinobacteria	0.6039006	0.0229178	0.5809828	0.0000108

In this example you gave me:

I guess the equivalent of phyloseq::tax_glom() is something like:

library(metacoder)

# Parse dataset
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
                   class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
                   class_regex = "^(.+)__(.+)$")

# Get per-taxon counts
x$data$tax_counts <- calc_taxon_abund(x, dataset = "tax_data",
                                      cols = hmp_samples$sample_id)

# Subset 
filter_taxa(x, taxon_ranks == "p", supertaxa = TRUE, reassign_obs = c(tax_counts = FALSE))

Does reassign_obs need to be false? I would think that it would need to be true so that all of the indexes would change with the data.

This issue is also old so things may have changed between now and then.

This issue is also old so things may have changed between now and then.

I think filter_taxa behaves mostly the same, at least in this regard.

Does reassign_obs need to be false? I would think that it would need to be true so that all of the indexes would change with the data.

Was the goal similar to #255 ?

If it was false, "tax_counts" would not be subset, just reassigned to the remaining 8 taxa. For example the row with info for Clostridia would be reassinged to Firmicutes, even though a row for Firmicutes already exists, so now you have two rows assigned to Firmicutes, one for Firmicutes and one for Clostridia, which is confusing.

One source of confusion for a lot of people is that this is using filter_taxa for its side effect (filtering a table), rather than its main goal of filtering the taxonomy.

Perhaps I need to think of a way to make filter_obs use per-taxon info to filter, so something like this is possible:

filter_obs(x, "tax_counts", taxon_ranks == "p")

That way just the table is filtered, not the taxonomy. It actually already works in this case probably, but only because the rows of "tax_counts" happen to correspond to taxa 1 to 1. It would have to work in other cases too, which gets more tricky. Also, if I do:

filter_obs(x, "tax_counts", taxon_names == "Actinobacteria")

I probably want the subtaxa as well, but the "subtaxa" option in filter_obs relates to how taxa are removed from the taxonomy if drop_taxa is TRUE, so the following would not work currently:

filter_obs(x, "tax_counts", taxon_names == "Actinobacteria", subtaxa = TRUE)

Perhaps I could rename that option "drop_subtaxa" and make the above work.

Just brainstorming, not sure if that is relevant to your question.

Did I answer your question?

If it was false, "tax_counts" would not be subset, just reassigned to the remaining 8 taxa.

Yes, this answers my question and more!

Just brainstorming, not sure if that is relevant to your question.

I have some more ideas for you to brew over, but I'll make another issue (https://github.com/grunwaldlab/metacoder/issues/254) as I did recently :rofl:

grunwaldlab / metacoder