grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
136 stars 28 forks source link

compare_groups function gives multiple comparison per taxon per comparison #272

Open LChWu opened 5 years ago

LChWu commented 5 years ago

Hello, I have an issue with the compare_groups function. I want to create a heat tree matrix for which I use compare_groups on relative abundance counts created with calc_obs_props. But it will return multiple different values for the same taxon for the comparison between two treatments. Here is some of my code:

## create taxmap file, 
# structure of taxonomy of used tibble:
head(otu$taxonomy)

## [1] "k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Stramenopiles; f__; g__; s__"                        
## [2] "k__Bacteria; p__Planctomycetes; c__Phycisphaerae; o__MSBL9; f__; g__; s__"                             
## [3] "k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__; g__; s__"                
## [4] "k__Bacteria; p__Chloroflexi; c__Anaerolineae; o__SB-34; f__; g__; s__"                                 
## [5] "k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Hyphomicrobiaceae; g__; s__"
## [6] "k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Stramenopiles; f__; g__; s__"

obj <- parse_tax_data(otu,
                      class_cols = "taxonomy",
                      class_sep = ";",
                      class_regex = "^\\s{0,1}D{0,1}_{0,1}([0-9]{0,1})_{0,2}(.*)$",
                      class_key = c("rank" = "taxon_rank", "name" = "taxon_name"))
print(obj)

## <Taxmap>
##   2996 taxa: aab. Bacteria ... elg. uncultured bacterium
##   2996 edges: NA->aab, NA->aac ... ddb->elf, ddc->elg
##   2 data sets:
##     tax_data:
##       # A tibble: 7,429 x 83
##         taxon_id OTU_ID   B48   B78   B79   B34   B82   B73
##         <chr>    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##       1 ddd      OTU29~    21     6     4     2    19    24
##       2 aac      OTU32~     0   404     2     0     3     0
##       3 dde      OTU63~     2     0     2     0     5     3
##       # ... with 7,426 more rows, and 75 more variables:
##       #   B80 <dbl>, B63 <dbl>, B40 <dbl>, B42 <dbl>,
##       #   B16 <dbl>, B46 <dbl>, B36 <dbl>, B45 <dbl>,
##       #   B60 <dbl>, B72 <dbl>, ...
##     class_data:
##       # A tibble: 41,163 x 5
##         taxon_id input_index rank  name        regex_match     
##         <chr>          <int> <chr> <chr>       <chr>           
##       1 aab                1 0     Bacteria    D_0__Bacteria   
##       2 aae                1 1     Cyanobacte~ D_1__Cyanobacte~
##       3 acr                1 2     Oxyphotoba~ D_2__Oxyphotoba~
##       # ... with 4.116e+04 more rows
##   0 functions:

## after some renaming and deleting certain taxa calculate relative abundance
obj$data$rel_abd <- calc_obs_props(obj, "otu_counts", other_cols = T)

## calculate difference between groups
obj$data$diff_site <- compare_groups(obj, data = "rel_abd",
                                     cols = mdata$SampleID,
                                     groups = mdata$SiteName)
print(obj$data$diff_site, n=20)

## # A tibble: 30,942 x 7
##    taxon_id treatment_1    treatment_2  log2_median_ratio median_diff  mean_diff wilcox_p_value
##    <chr>    <chr>          <chr>                    <dbl>       <dbl>      <dbl>          <dbl>
##  1 ajd      DrygalskiFjord ChurchTrough             0.568   0.000193   0.000210       0.0232   
##  2 ayo      DrygalskiFjord ChurchTrough            -1.57   -0.0000391 -0.0000447      0.274   
##  3 ajd      DrygalskiFjord ChurchTrough             0.760   0.186      0.196          0.00209  
##  4 bwj      DrygalskiFjord ChurchTrough            -0.582  -0.000980  -0.000640       0.247    
##  5 bwk      DrygalskiFjord ChurchTrough            -1.25   -0.00313   -0.00433        0.0115   
##  6 ayr      DrygalskiFjord ChurchTrough            -3.16   -0.00277   -0.00459        0.0000433
##  7 acx      DrygalskiFjord ChurchTrough          -Inf      -0.000405  -0.000333       0.00793  
##  8 bwo      DrygalskiFjord ChurchTrough             1.09    0.000482   0.000185       0.393    
##  9 bwj      DrygalskiFjord ChurchTrough            -0.539  -0.00198   -0.00181        0.218    
## 10 ayw      DrygalskiFjord ChurchTrough            -4.54   -0.00502   -0.00487        0.000130 
## 11 bws      DrygalskiFjord ChurchTrough            -0.481  -0.00220   -0.00476        0.0115   
## 12 acx      DrygalskiFjord ChurchTrough             0.476   0.000539   0.00127        0.353    
## 13 ajo      DrygalskiFjord ChurchTrough          -Inf      -0.0000370 -0.0000466      0.0549   
## 14 ajp      DrygalskiFjord ChurchTrough            -1.22   -0.00318   -0.00269        0.0753   
## 15 ayy      DrygalskiFjord ChurchTrough             2.62    0.000587   0.000706       0.000431 
## 16 ayy      DrygalskiFjord ChurchTrough             1.65    0.0136     0.0117         0.0115   
## 17 ajq      DrygalskiFjord ChurchTrough          -Inf      -0.0000857 -0.000174       0.000751 
## 18 ajr      DrygalskiFjord ChurchTrough          -Inf      -0.000382  -0.000579       0.00221  
## 19 bwy      DrygalskiFjord ChurchTrough            -6.52   -0.00301   -0.00283        0.000179 
## 20 bwz      DrygalskiFjord ChurchTrough            -2.27   -0.0133    -0.0117         0.0000758
## # ... with 3.092e+04 more rows

You can see in line 15 and 16 that they have the same taxon ID but comparing the same two sites. There are four sites in total which are compared. When I try to calculate the heat_tree_matrix it will give me the error Error: All pairs being compared should have one value per taxon. The following do not: even though I used reassign_obs = c(diff_site = FALSE) in filter_taxa.

I looked a lot around and tried to find similar issues from other people, but I was not succesful so far. I don't think the problem is that it recognizes more than four different sites, because I can e.g. calculate the relative abundance for the four sites seperatly and it will only give me these four.

I hope you can help me. Cheers!

zachary-foster commented 5 years ago

Hello @LChWu,

Can you send me a data set and code to reproduce this? I am not sure what the issue is without being able to look at the data. You can email me if you would rather at zacharyfoster1989@gmail.com. Thanks

LChWu commented 5 years ago

Hello @zachary-foster , I send you my data and code per mail. Cheers

zachary-foster commented 5 years ago

Hi @LChWu, I found the problem.

You gave compare_groups the OTU abundance rather than the taxon abundance, so you had multiple OTUs matching a single taxon. So you had info on the OTU differences rather than taxon differences and heat trees can only plot taxon data.

I changed this:

## Calculate relative abundance
obj$data$rel_abd <- calc_obs_props(obj, "otu_counts", other_cols = T)
print(obj)

############### heat tree #####################

## difference of taxon abundances per site
# compare relative abundance per site
obj$data$diff_site <- compare_groups(obj, data = "rel_abd",
                                     cols = mdata$SampleID,
                                     groups = mdata$SiteName)

to this:

## Calculate relative abundance
obj$data$rel_abd <- calc_obs_props(obj, "otu_counts", other_cols = T)
print(obj)

## Calculate per-taxon abundance
obj$data$tax_rel_abd <- calc_taxon_abund(obj, "rel_abd")
print(obj)

############### heat tree #####################

## difference of taxon abundances per site
# compare relative abundance per site
obj$data$diff_site <- compare_groups(obj, data = "tax_rel_abd",
                                     cols = mdata$SampleID,
                                     groups = mdata$SiteName)

and it seemed to work. Let me know if you have questions about his or anything else.

Maria-Vivas commented 4 years ago

Hello, I have the same error with the compare_groups function as @LChWu. I am trying to create a heat tree matrix using the compare_groups function. However, when I plot the heat tree it returns an error indicating that I have different values for the same taxon.

Here is some of my code:

obj <- parse_tax_data(otu_data,
                      class_cols = "taxonomy", 
                      class_sep = ";", 
                      class_regex = "^([a-z]{0,1})_{0,2}(.*)$", 
                      class_key = c("tax_rank" = "taxon_rank", "name" = "taxon_name"))

obj$data$class_data <- NULL

names(obj$data) <- "otu_counts"

obj <- filter_taxa(obj, taxon_names != "")

obj <- filter_taxa(obj, taxon_names == "Fungi", subtaxa = TRUE)
## calculate the abundance per taxon from the OTU counts.
obj$data$tax_abund <- calc_taxon_abund(obj, "otu_counts",
                                       cols = sample_data$sample_ID)
##Summing per-taxon counts from 108 columns for 744 taxa
print(obj)
##<Taxmap>
##  744 taxa: aac. Fungi ... cdn. Mortierella_tsukubaensis
##  744 edges: NA->aac, aac->aaf ... bjn->cdn
##  2 data sets:
##    otu_counts:
##      # A tibble: 607 x 111
##        taxon_id OTU_ID   AA1   AA2   AA3   AA4
##        <chr>    <chr>  <dbl> <dbl> <dbl> <dbl>
##      1 aac      OTU2     173     0   545    58
##      2 aaf      OTU3     997     0  2577   808
##      3 art      OTU4       0     0     0     0
##      # ... with 604 more rows, and 105 more
##      #   variables: AA5 <dbl>, AA6 <dbl>,
##      #   AA7 <dbl>, AA8 <dbl>, AA9 <dbl>,
##      #   AA10 <dbl>, AA11 <dbl>, AA12 <dbl>,
##      #   AB1 <dbl>, AB2 <dbl>, ...
##    tax_abund:
##      # A tibble: 744 x 109
##        taxon_id   AA1   AA2    AA3   AA4   AA5
##        <chr>    <dbl> <dbl>  <dbl> <dbl> <dbl>
##      1 aac      50919   501 100902 60211 54678
##      2 aaf       6430     0  26293 25793 23953
##      3 aag      39191   482  59539 27125 24979
##      # ... with 741 more rows, and 103 more
##      #   variables: AA6 <dbl>, AA7 <dbl>,
##      #   AA8 <dbl>, AA9 <dbl>, AA10 <dbl>,
##      #   AA11 <dbl>, AA12 <dbl>, AB1 <dbl>,
##      #   AB2 <dbl>, AB3 <dbl>, ...
##  0 functions:
## For each taxon, at every rank, compare groups of counts.
obj$data$diff_table <- compare_groups(obj, data = "tax_abund",
                                      cols = sample_data$sample_ID,
                                      groups = sample_data$site)

print(obj$data$diff_table)
##      # A tibble: 2,232 x 7
##         taxon_id treatment_1 treatment_2 log2_median_rat~
##         <chr>    <chr>       <chr>                  <dbl>
##       1 aac      Clan        WindyHill              0.612
##       2 aaf      Clan        WindyHill             -0.149
##       3 aag      Clan        WindyHill              0.873
##       4 aah      Clan        WindyHill             -1.83 
##       5 aai      Clan        WindyHill              0    
##       6 aaj      Clan        WindyHill              0    
##       7 aak      Clan        WindyHill              0    
##       8 aal      Clan        WindyHill              0    
##       9 aam      Clan        WindyHill              2.33 
##      10 aan      Clan        WindyHill              0.730
##      # ... with 2,222 more rows, and 3 more variables:
##      #   median_diff <dbl>, mean_diff <dbl>, wilcox_p_value <dbl>
## To correct for multiple comparisions (false discovery rate, FDR). 
obj <- mutate_obs(obj, "diff_table",
                  wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))
## Set any differences that are not significant to 0.
obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0
## Differential heat trees - Plot: Fungi-order

obj %>%
  filter_taxa(taxon_ranks == "o", supertaxa = TRUE) %>% 
  mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
  filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$")) %>%
  heat_tree_matrix(data = "diff_table",
                   node_label = cleaned_names,
                   node_size = n_obs, 
                   node_color = log2_median_ratio, 
                   node_color_trans = "linear",
                   node_color_interval = c(-3, 3), 
                   edge_color_interval = c(-3, 3), 
                   node_color_range = diverging_palette(), 
                   node_size_axis_label = "OTUs number",
                   node_color_axis_label = "Median counts Log2 ratio",
                   layout = "da", initial_layout = "re",
                   key_size = 0.67,
                   seed = 2,
                   output_file = "Fig_comparation_sites.jpg")

## Adding a new "character" vector of length 144.
## Error: All pairs being compared should have one value per taxon (130). The following do not:
##    Clan vs. WindyHill (744), Clan vs. Dukuduku (744), WindyHill vs. Dukuduku (744)

I think the problem could be that I have a different number of rows in obj$data$diff_table than there are taxa in obj$data$tax_abund. However, I ran the same script with the same data a year ago and everything ran fine, not sure what is the problem right now...

Any help would be great! Thank you very much! Maria

zachary-foster commented 4 years ago

Hello Maria,

Can you email me that data used in that code so I can tests it? My email is zacharyfoster1989@gmail.com. Thanks

zachary-foster commented 4 years ago

Thanks for the data! I think I found the problem. You need to add reassign_obs = FALSE to the filter_taxa calls here:

obj %>%
  filter_taxa(taxon_ranks == "o", supertaxa = TRUE) %>% 
  mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
  filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$")) %>%
  heat_tree_matrix( ...

It should be

obj %>%
  filter_taxa(taxon_ranks == "o", supertaxa = TRUE, reassign_obs = FALSE) %>% 
  mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
  filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$"), reassign_obs = FALSE) %>%
  heat_tree_matrix(data = "diff_table",

Without reassign_obs = FALSE, no rows will be removed from any of the tables in obj, including diff_table, when taxa are removed. If you run:

obj %>%
  filter_taxa(taxon_ranks == "o", supertaxa = TRUE) %>% #plot the taxa to the rank "order"
  mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
  filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$"))

you can see that there 2232 rows in diff_table but only 130 taxa. There should be 3 * 130 = 390 rows, since there are three comparisons. That is because instead of removing rows in diff_table when taxa were filtered out, filter_taxa reassigned the rows to supertaxa that were not filtered out. For some things, like abundance matrices, this is desirable, but not in this case. Adding reassign_obs = FALSE means any rows assigned to removed taxa are also removed from all the data in obj.

Does that fix the problem?

Maria-Vivas commented 4 years ago

Hello @zachary-foster,

I added reassign_obs = FALSE but it was not working. Then, I realised it also is necessary to include drop_obs = TRUE to remove the rows from any table.

It should be:

obj %>%
  filter_taxa(taxon_ranks == "o", supertaxa = TRUE, reassign_obs = FALSE, drop_obs =TRUE) %>% 

Thank you so much for your help!

antoine4ucsd commented 2 years ago

Hello I have the same error but none of the solutions mentioned above seems to work. any chance some can help me? I can share the data and elaborate via mp. my goal is to compare microbiome composition between 2 groups (stool vs tissues) and ideally between each sample site (5 tissue types + stool)

I am starting with phyloseq object

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 326 taxa and 71 samples ]
sample_data() Sample Data:       [ 71 samples by 25 sample variables ]
tax_table()   Taxonomy Table:    [ 326 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 326 tips and 325 internal nodes ]

that I convert :

psmat<- parse_phyloseq(ps)

then

psmat$data$rel_abd <- calc_obs_props(psmat, "otu_table", other_cols = T)
psmat$data$tax_rel_abd <- calc_taxon_abund(psmat, "rel_abd")
psmat$data$diff_site <- compare_groups(psmat, data = "tax_rel_abd",
                                       cols = meta2$SampleID,
                                       groups = meta2$bin)
psmat <- mutate_obs(psmat, "diff_site",
                    wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))
psmat$data$diff_table$log2_median_ratio[psmat$data$diff_table$wilcox_p_value > 0.05] <- 0

but it failed here:

 psmat %>%
        metacoder::filter_taxa(taxon_ranks == "Family", supertaxa = TRUE) %>%
        mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
        metacoder::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$")) %>%
        heat_tree_matrix(dataset = "diff_site",
                         node_label = cleaned_names,
                         node_size = n_obs, # number of OTUs
                         node_color = log2_median_ratio, # difference between groups
                         node_color_trans = "linear",
                         node_color_interval = c(-3, 3), # symmetric interval
                         edge_color_interval = c(-3, 3), # symmetric interval
                         node_color_range = diverging_palette(), # diverging colors
                         node_size_axis_label = "OTU count",
                         node_color_axis_label = "Log 2 ratio of median counts",
                         layout = "da", initial_layout = "re",
                         key_size = 0.67,
                         seed = 2)

ERROR: Error: All pairs being compared should have one value per taxon (187). The following do not: Tissues vs. Stool (213)

I tried reassign_obs = FALSE, drop_obs =TRUE

Thank you so much for your help!

zachary-foster commented 2 years ago

Try adding reassign_obs = FALSE to your metacoder::filter_taxa calls. Since you have one value per taxon, you don't want values for the removed genus being reassigned to their families, which is the default behavior.

antoine4ucsd commented 2 years ago

thank you! I tihnk this is working. let me do some checks.

antoine4ucsd commented 2 years ago

it is working but I would need some more advices... I want to compare groups (ideally more than 2 but starting with 2). I prepared my data:

psmat$data$diff_table <- compare_groups(psmat, data = "tax_abund",
                                        cols = meta2$SampleID,
                                        groups = meta2$bin)
psmat$data$diff_site <- compare_groups(psmat, data = "tax_rel_abd",
                                       cols = meta2$SampleID,
                                       groups = meta2$bin)

diff_table looks like this:

_diff_table 2022-03-09

but nothing is showing up in the heat tree when running this (I tried at the family level or genus)

plot2=psmat %>%
        mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
        metacoder::filter_taxa(taxon_ranks == "Family", supertaxa = TRUE,grepl(cleaned_names, pattern = "^[a-zA-Z]+$"),reassign_obs = FALSE) %>%
       # OR metacoder::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$"),reassign_obs = FALSE) %>%
        heat_tree_matrix(dataset = "diff_site", # also tried with "diff_table"
                         node_label = cleaned_names,
                         node_size = n_obs, # number of OTUs
                         node_color = log2_median_ratio, # difference between groups
                         node_color_trans = "linear",
                         node_color_interval = c(-3, 3), # symmetric interval
                        edge_color_interval = c(-3, 3), # symmetric interval
                         node_color_range = diverging_palette(), # diverging colors
                         node_size_axis_label = "OTU count",
                         node_color_axis_label = "Log 2 ratio of median counts",
                         layout = "da", initial_layout = "re",
                         key_size = 0.67,
                         seed = 2)

Happy to share data via mp if it can help thank you! _heat tree tissue2stool 2022-03-09

the head of the diff table looks like this

antoine4ucsd commented 2 years ago

it works. thank you!

zachary-foster commented 2 years ago

No problem! Did you solve your problem?

antoine4ucsd commented 2 years ago

Yes. I was trying to apply a heat_matrix to a comparison between 2 groups only. Works nicely now. Thank you!