ropensci / taxa

taxonomic classes for R
https://docs.ropensci.org/taxa
Other
48 stars 12 forks source link

Filter_obs error. #188

Closed grabear closed 5 years ago

grabear commented 5 years ago

Hey @zachary-foster !

I'm currently trying to resolve a weird issue...

I am trying to filter a taxmap object, and I keep getting this error at the bottom.

> metacoder_object
<Taxmap>
  400 taxa: ab. Bacteria, ad. Firmicutes, ae. Proteobacteria ... rr. uncultured bacterium, rs. uncultured bacterium, rt. uncultured organism
  400 edges: NA->ab, ab->ad, ab->ae, ab->af, ab->ag, ab->ah, ab->ai, ab->aj, ab->ak ... jp->rj, io->rk, jq->rl, jr->ro, jr->rp, js->rr, jt->rs, hf->rt
  6 data sets:
    otu_table:
      # A tibble: 1,662 x 50
        taxon_id otu_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17 Sample_21 Sample_22 Sample_24 Sample_25
    tax_data:
      # A tibble: 1,662 x 8
        taxon_id Kingdom  Phylum     Class         Order           Family             Genus                 Species             
    tax_table:
      # A tibble: 400 x 49
        taxon_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17 Sample_21 Sample_22 Sample_24 Sample_25 Sample_26
    diff_table:
      # A tibble: 400 x 14
        taxon_id treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2 wilcox_p_value hartigan_dip_tr~
  0 functions:
#
#
# Error
> metacoder_object %>% filter_taxa(n_supertaxa < 2) %>% filter_obs("diff_table", n_supertaxa < 2) 
Error: All logical filtering criteria must be the same length as the data sets filtered.
In addition: Warning messages:
1: There is no "taxon_id" column in the data set "3", so there are no taxon IDs. 
2: The data set "4" is named, but not named by taxon ids. 

However, if I switch the order that I filter (filter_obs %>% filter_taxa) it works out fine. filter_taxa on it's own also works fine. So there is some hangup with filter_obs when it comes after filter_taxa for this use case.

Do you know what's happening here??

@sdhutchins Keep up with this issue :+1:

grabear commented 5 years ago

@sdhutchins Anything to add here?

sdhutchins commented 5 years ago

@grabear I think you're covering it perfectly.

Should be noted this is using the GitHub version of metacoder. I can recreate the error as well with the same object.

Session Info

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2       rstudioapi_0.8       ggtree_1.14.0        ggthemes_4.0.1       ggpubr_0.1.8         magrittr_1.5        
 [7] knitr_1.20           vegan_2.5-3          lattice_0.20-35      permute_0.9-4        microbiome_1.4.0     extrafont_0.17      
[13] RColorBrewer_1.1-2   forcats_0.3.0        stringr_1.3.1        dplyr_0.7.7          purrr_0.2.5          readr_1.1.1         
[19] tidyr_0.8.2          tibble_1.4.2         tidyverse_1.2.1      plyr_1.8.4           plotly_4.8.0         ampvis2_2.4.1       
[25] ggplot2_3.1.0        modes_0.7.0          diptest_0.75-7       metacoder_0.3.0.9001 taxa_0.3.1.9001      phyloseq_1.26.0     
[31] workflowr_1.1.1     
grabear commented 5 years ago

I'm noticing several things, after a bunch of testing:

  1. The drop_obs/reassing_obs parameters for taxa::filter_taxa do not properly filter any of the tables in taxmap_obj$data a. Doing filter_taxa first removes taxa from the classification, but keeps it in the observation data. This is what's causing the error after trying to use filter_taxa %>% filter_obs. Essentially, the length(observation_data) > length(taxa_data) :

    Error: All logical filtering criteria must be the same length as the data sets filtered.'

    
    metacoder_obj %>% filter_taxa(n_supertaxa < 2)
    <Taxmap>
    14 taxa: ab. Bacteria, ad. Firmicutes, ae. Proteobacteria ... ap. Planctomycetes, aq. Tenericutes
    14 edges: NA->ab, ab->ad, ab->ae, ab->af, ab->ag, ab->ah, ab->ai ... ab->ak, ab->al, ab->am, ab->an, ab->ap, ab->aq
    6 data sets:
    diff_table:
    # A tibble: 400 x 14
    taxon_id treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2
    ...
    # skip most datasets for saving space.  The key here is that we went from 400 taxa to 14 taxa in our object
    # The diff_table was not filtered properly (at 400 taxa), but the taxonomy was (at 14 taxa)
```r
metacoder_obj %>% filter_obs(c("diff_table", "tax_table"), n_supertaxa < 2)
<Taxmap>
  400 taxa: ab. Bacteria, ad. Firmicutes, ae. Proteobacteria ... rs. uncultured bacterium, rt. uncultured organism
  400 edges: NA->ab, ab->ad, ab->ae, ab->af, ab->ag, ab->ah, ab->ai ... jq->rl, jr->ro, jr->rp, js->rr, jt->rs, hf->rt
  6 data sets:
    tax_table:
      # A tibble: 14 x 49
        taxon_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17 Sample_21 Sample_22
...
    diff_table:
      # A tibble: 14 x 14
        taxon_id treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2
...
# Observation data was filtered properly.  Taxa data cannot be manipulated appropriately this way
...
# Piping in %>% filter_taxa(n_supertaxa < 2), to the previous code block gives:
<Taxmap>
  14 taxa: ab. Bacteria, ad. Firmicutes, ae. Proteobacteria ... ap. Planctomycetes, aq. Tenericutes
  14 edges: NA->ab, ab->ad, ab->ae, ab->af, ab->ag, ab->ah, ab->ai ... ab->ak, ab->al, ab->am, ab->an, ab->ap, ab->aq
  6 data sets:
    tax_table:
      # A tibble: 14 x 49
        taxon_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17 Sample_21 Sample_22
    diff_table:
      # A tibble: 14 x 14
        taxon_id treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2
...
# Using filter_obs with the original data (which cosists of >1600 taxa vs 400 taxa) will throw the same error we've been getting
> metacoder_object %>% filter_obs(c("otu_table"), n_supertaxa < 2)
Error: All logical filtering criteria must be the same length as the data sets filtered.
...
# This is why I think we need to swith how we are piping our functions.
  1. Maybe we should do filter_obs %>% filter_taxa instead. That way we can filter the observation data using the full taxonomy heirarchy, and then filter the taxa down afterward to the same level of the observation data.

I feel like this is some kind of bug. It may be of note that we used @zachary-foster's metacoder package along with metacoder::parse_phyloseq' to get our data into a taxmap object. Not sure if this is relevant or not.. Either way I'm going to continue our analysis, usingfilter_obs %>% filter_taxa` until further help is given here @sdhutchins.

grabear commented 5 years ago

This is also true in other filtering situations (235 taxa vs 408 taxa):

> filter_ambiguous_taxa(xx)
<Taxmap>
  235 taxa: ab. Bacteria, ac. Archaea ... qq. Campylobacter fetus subsp. fetus, qs. Helicobacter macacae
  235 edges: NA->ab, NA->ac, ab->ad, ab->ae, ab->af, ab->ag, ab->ah ... ib->os, id->oy, id->oz, il->pk, jd->qq, je->qs
  6 data sets:
    otu_table:
      # A tibble: 1,665 x 50
        taxon_id otu_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17 Sample_21
    tax_data:
      # A tibble: 1,665 x 8
        taxon_id Kingdom  Phylum     Class         Order          Family           Genus               Species           
    tax_table:
      # A tibble: 408 x 49
        taxon_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17 Sample_21 Sample_22.
    diff_table:
      # A tibble: 408 x 14
        taxon_id treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2
  0 functions:
Warning messages:
1: There is no "taxon_id" column in the data set "3", so there are no taxon IDs. 
2: The data set "4" is named, but not named by taxon ids. 

This means I have to do something with a for loop to remove ambiguous taxonomies, which is nowhere near as thorough as taxa::filter_amb_taxa.

    for (amb_term in ambiguous_filter) {
      filtered_meta <- filtered_meta %>%
        taxa::filter_obs(c("diff_table", "tax_table"), !stringr::str_detect(taxon_names, amb_term)) %>%
        taxa::filter_taxa(!stringr::str_detect(taxon_names, amb_term))
    }
grabear commented 5 years ago
...
# Using filter_obs with the original data (which cosists of 1665 taxa vs 400 taxa) will throw the same error we've been getting
> metacoder_object %>% filter_obs(c("otu_table"), n_supertaxa < 2)
Error: All logical filtering criteria must be the same length as the data sets filtered.
...
# This is why I think we need to swith how we are piping our functions.

Below I've included a situation where the above :point_up: statement isn't true. When using a different filtering condition, all of the tables are filtered ("otu_table", "tax_data" are normally at 1665 taxa after filtering, but here they are filtered to 1662 taxa).

Original Data

> xx
<Taxmap>
  408 taxa: ab. Bacteria, ac. Archaea, ad. Firmicutes ... rs. uncultured bacterium, rt. uncultured organism
  408 edges: NA->ab, NA->ac, ab->ad, ab->ae, ab->af, ab->ag, ab->ah ... jq->rl, jr->ro, jr->rp, js->rr, jt->rs, hf->rt
  6 data sets:
    otu_table:
      # A tibble: 1,665 x 50
        taxon_id otu_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17 Sample_21
    tax_data:
      # A tibble: 1,665 x 8
        taxon_id Kingdom  Phylum     Class         Order          Family           Genus               Species           
    tax_table:
      # A tibble: 408 x 49
        taxon_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17 Sample_21 Sample_22
    diff_table:
      # A tibble: 408 x 14
        taxon_id treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2

Filtered Data with New Condition

> filter_taxa(xx, taxon_names == "Archaea", subtaxa = TRUE, invert = TRUE)
<Taxmap>
  400 taxa: ab. Bacteria, ad. Firmicutes, ae. Proteobacteria ... rs. uncultured bacterium, rt. uncultured organism
  400 edges: NA->ab, ab->ad, ab->ae, ab->af, ab->ag, ab->ah, ab->ai ... jq->rl, jr->ro, jr->rp, js->rr, jt->rs, hf->rt
  6 data sets:
    otu_table:
      # A tibble: 1,662 x 50
        taxon_id otu_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17 Sample_21
    tax_data:
      # A tibble: 1,662 x 8
        taxon_id Kingdom  Phylum     Class         Order          Family           Genus               Species           
    tax_table:
      # A tibble: 400 x 49
        taxon_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13 Sample_14 Sample_17 Sample_21 Sample_22
    diff_table:
      # A tibble: 400 x 14
        taxon_id treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2
Warning messages:
1: There is no "taxon_id" column in the data set "3", so there are no taxon IDs. 
2: The data set "4" is named, but not named by taxon ids. 

So there's something going on that makes filtering inconsistent, when using _taxonid vs ??regex??. I know it must have something to do with the taxon_id, but I can't put my finger on it.

grabear commented 5 years ago

Maybe this is an otu_id vs taxon_id thing?

grabear commented 5 years ago

Here's another weird case:

Original Data

> metac
<Taxmap>
  237 taxa: ab. Bacteria, ad. Firmicutes ... qs. Helicobacter macacae
  237 edges: NA->ab, ab->ad, ab->ae, ab->af ... id->oz, il->pk, jd->qq, je->qs
  6 data sets:
    otu_table:
      # A tibble: 1,662 x 50
        taxon_id otu_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13
    tax_data:
      # A tibble: 1,662 x 8
        taxon_id Kingdom  Phylum   Class    Order     Family     Genus      Species    
    tax_table:
      # A tibble: 237 x 49
        taxon_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13
    diff_table:
      # A tibble: 237 x 14
        taxon_id treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff
  0 functions:

Filtered Data with New Condition

> metac %>% filter_taxa(taxon_ranks == "Class", reassign_obs = FALSE)
<Taxmap>
  21 taxa: ar. Negativicutes, as. Clostridia ... bm. Mollicutes
  21 edges: NA->ar, NA->as, NA->at, NA->au, NA->av ... NA->bi, NA->bj, NA->bl, NA->bm
  6 data sets:
    otu_table:
      # A tibble: 3 x 50
        taxon_id otu_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13
    tax_data:
      # A tibble: 3 x 8
        taxon_id Kingdom  Phylum    Class  Order      Family     Genus      Species    
    tax_table:
      # A tibble: 21 x 49
        taxon_id Sample_1 Sample_2 Sample_3 Sample_6 Sample_9 Sample_10 Sample_13
    diff_table:
      # A tibble: 21 x 14
        taxon_id treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff
  0 functions:
grabear commented 5 years ago

For my analysis, I used obj$data$tax_table <- metacoder::calc_taxon_abund(obj, "otu_table", ...) and obj$data$diff_table <- metacoder::compare_groups(obj, "tax_table", ...). @zachary-foster I know that the tables mentioned use intermediate taxa, and obj$data$otu_table and obj$data$tax_data both use leaf taxa. But shouldn't all obj$data reflect any change made with the taxa::filter functions?

I know this comment contains a lot of the metacoder stuff, and I think it might belong there. But for now i'm going to keep it going in this thread.

This may be related to https://github.com/grunwaldlab/metacoder/issues/188

zachary-foster commented 5 years ago

Hello @grabear and @sdhutchins. Thanks for all the testing! I am looking at this now.

zachary-foster commented 5 years ago

filter_obs("diff_table", n_supertaxa < 2)

This will only make sense if "diff_table" has a one row per taxon in the same order as the taxa. You would get this if you used compare_groups with only one comparison (two groups), but otherwise, I would expect the length difference error. The two warnings you get can probably be ignored unless you thought those table should have been filtered. Tables without a "taxon_ids" column will not be filtered.

. Doing filter_taxa first removes taxa from the classification, but keeps it in the observation data.

filter_taxa only removes observation data (ie user-defined) if it cant find a way to preserve it and usually there is a way to preserve it. filter_taxa can be used to filter observation data by taxon, but it takes some non-default parameters and it gets harder to understand what is going on when multiple tables are in the object. For example, when trying to do filter_taxa( n_supertaxa < 2), I would not expect the "diff_table" to loose any rows, because all of the rows would be able to be reassigned to a supertaxon. I expect the taxon ids changed, but not row count.

If the table has a one-to-one relationship with the taxa, then you can make filter_taxa filter that table the same way it filters taxa like so:

> obj
<Taxmap>
  174 taxa: ab. Root, ac. Proteobacteria, ad. Bacteroidetes ... gq. Collinsella, gr. Blautia, gs. Clostridium
  174 edges: NA->ab, ab->ac, ab->ad, ab->ae, ab->af, ab->ag, ab->ah ... di->gm, dj->gn, bu->go, dk->gp, cm->gq, cf->gr, cw->gs
  4 data sets:
    tax_data:
      # A tibble: 1,000 x 53
        taxon_id otu_id lineage `700035949` `700097855` `700100489` `700111314` `700033744` `700109581` `700111044` `700101365`
        <chr>    <chr>  <chr>         <int>       <int>       <int>       <int>       <int>       <int>       <int>       <int>
      1 dm       OTU_9… r__Roo…           0           2           1           0           0           0           0           0
      2 dn       OTU_9… r__Roo…           0           0           0           0           0           0           0           0
      3 do       OTU_9… r__Roo…           0           1           0           0           0           0           0           0
      # ... with 997 more rows, and 42 more variables: `700100431` <int>, `700016050` <int>, `700032425` <int>,
      #   `700024855` <int>, `700103488` <int>, `700096869` <int>, `700107379` <int>, `700096422` <int>, `700102417` <int>,
      #   `700114168` <int>, …
    class_data:
      # A tibble: 5,922 x 5
        taxon_id input_index tax_rank tax_name            regex_match           
        <chr>          <int> <chr>    <chr>               <chr>                 
      1 ab                 1 r        Root                r__Root               
      2 ac                 1 p        Proteobacteria      p__Proteobacteria     
      3 aj                 1 c        Gammaproteobacteria c__Gammaproteobacteria
      # ... with 5,919 more rows
    tax_abund:
      # A tibble: 174 x 51
        taxon_id `700035949` `700097855` `700100489` `700111314` `700033744` `700109581` `700111044` `700101365` `700100431`
      * <chr>          <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
      1 ab               955        4330         887         598        6303        3379        1219         829        1880
      2 ac               193         144          17         132        1388          30           0          21         168
      3 ad                 5          43           6          21          20          15           7           5          43
      # ... with 171 more rows, and 41 more variables: `700016050` <dbl>, `700032425` <dbl>, `700024855` <dbl>,
      #   `700103488` <dbl>, `700096869` <dbl>, `700107379` <dbl>, `700096422` <dbl>, `700102417` <dbl>, `700114168` <dbl>,
      #   `700037540` <dbl>, …
    diff_table:
      # A tibble: 174 x 7
        taxon_id treatment_1 treatment_2 log2_median_ratio median_diff mean_diff wilcox_p_value
        <chr>    <chr>       <chr>                   <dbl>       <dbl>     <dbl>          <dbl>
      1 ab       female      male                    0.138         183     -11            0.648
      2 ac       female      male                    0.482          48      96.5          0.461
      3 ad       female      male                    0.211          31     180.           0.634
      # ... with 171 more rows
  0 functions:
> obj %>% filter_taxa(n_supertaxa < 2, drop_obs = c(diff_table = TRUE), reassign_obs = c(diff_table = FALSE))
<Taxmap>
  8 taxa: ab. Root, ac. Proteobacteria, ad. Bacteroidetes ... ag. Fusobacteria, ah. Tenericutes, ai. Spirochaetes
  8 edges: NA->ab, ab->ac, ab->ad, ab->ae, ab->af, ab->ag, ab->ah, ab->ai
  4 data sets:
    tax_data:
      # A tibble: 1,000 x 53
        taxon_id otu_id lineage `700035949` `700097855` `700100489` `700111314` `700033744` `700109581` `700111044` `700101365`
        <chr>    <chr>  <chr>         <int>       <int>       <int>       <int>       <int>       <int>       <int>       <int>
      1 ac       OTU_9… r__Roo…           0           2           1           0           0           0           0           0
      2 ad       OTU_9… r__Roo…           0           0           0           0           0           0           0           0
      3 ad       OTU_9… r__Roo…           0           1           0           0           0           0           0           0
      # ... with 997 more rows, and 42 more variables: `700100431` <int>, `700016050` <int>, `700032425` <int>,
      #   `700024855` <int>, `700103488` <int>, `700096869` <int>, `700107379` <int>, `700096422` <int>, `700102417` <int>,
      #   `700114168` <int>, …
    class_data:
      # A tibble: 5,922 x 5
        taxon_id input_index tax_rank tax_name            regex_match           
        <chr>          <int> <chr>    <chr>               <chr>                 
      1 ab                 1 r        Root                r__Root               
      2 ac                 1 p        Proteobacteria      p__Proteobacteria     
      3 ac                 1 c        Gammaproteobacteria c__Gammaproteobacteria
      # ... with 5,919 more rows
    tax_abund:
      # A tibble: 174 x 51
        taxon_id `700035949` `700097855` `700100489` `700111314` `700033744` `700109581` `700111044` `700101365` `700100431`
        <chr>          <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
      1 ab               955        4330         887         598        6303        3379        1219         829        1880
      2 ac               193         144          17         132        1388          30           0          21         168
      3 ad                 5          43           6          21          20          15           7           5          43
      # ... with 171 more rows, and 41 more variables: `700016050` <dbl>, `700032425` <dbl>, `700024855` <dbl>,
      #   `700103488` <dbl>, `700096869` <dbl>, `700107379` <dbl>, `700096422` <dbl>, `700102417` <dbl>, `700114168` <dbl>,
      #   `700037540` <dbl>, …
    diff_table:
      # A tibble: 8 x 7
        taxon_id treatment_1 treatment_2 log2_median_ratio median_diff mean_diff wilcox_p_value
        <chr>    <chr>       <chr>                   <dbl>       <dbl>     <dbl>          <dbl>
      1 ab       female      male                    0.138         183     -11            0.648
      2 ac       female      male                    0.482          48      96.5          0.461
      3 ad       female      male                    0.211          31     180.           0.634
      # ... with 5 more rows
  0 functions:

To do the same thing (mostly) with filter_obs:

> obj %>% filter_obs("diff_table", n_supertaxa[obj$data$diff_table$taxon_id] < 2)
<Taxmap>
  174 taxa: ab. Root, ac. Proteobacteria, ad. Bacteroidetes ... gq. Collinsella, gr. Blautia, gs. Clostridium
  174 edges: NA->ab, ab->ac, ab->ad, ab->ae, ab->af, ab->ag, ab->ah ... di->gm, dj->gn, bu->go, dk->gp, cm->gq, cf->gr, cw->gs
  4 data sets:
    tax_data:
      # A tibble: 1,000 x 53
        taxon_id otu_id lineage `700035949` `700097855` `700100489` `700111314` `700033744` `700109581` `700111044` `700101365`
        <chr>    <chr>  <chr>         <int>       <int>       <int>       <int>       <int>       <int>       <int>       <int>
      1 dm       OTU_9… r__Roo…           0           2           1           0           0           0           0           0
      2 dn       OTU_9… r__Roo…           0           0           0           0           0           0           0           0
      3 do       OTU_9… r__Roo…           0           1           0           0           0           0           0           0
      # ... with 997 more rows, and 42 more variables: `700100431` <int>, `700016050` <int>, `700032425` <int>,
      #   `700024855` <int>, `700103488` <int>, `700096869` <int>, `700107379` <int>, `700096422` <int>, `700102417` <int>,
      #   `700114168` <int>, …
    class_data:
      # A tibble: 5,922 x 5
        taxon_id input_index tax_rank tax_name            regex_match           
        <chr>          <int> <chr>    <chr>               <chr>                 
      1 ab                 1 r        Root                r__Root               
      2 ac                 1 p        Proteobacteria      p__Proteobacteria     
      3 aj                 1 c        Gammaproteobacteria c__Gammaproteobacteria
      # ... with 5,919 more rows
    tax_abund:
      # A tibble: 174 x 51
        taxon_id `700035949` `700097855` `700100489` `700111314` `700033744` `700109581` `700111044` `700101365` `700100431`
      * <chr>          <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
      1 ab               955        4330         887         598        6303        3379        1219         829        1880
      2 ac               193         144          17         132        1388          30           0          21         168
      3 ad                 5          43           6          21          20          15           7           5          43
      # ... with 171 more rows, and 41 more variables: `700016050` <dbl>, `700032425` <dbl>, `700024855` <dbl>,
      #   `700103488` <dbl>, `700096869` <dbl>, `700107379` <dbl>, `700096422` <dbl>, `700102417` <dbl>, `700114168` <dbl>,
      #   `700037540` <dbl>, …
    diff_table:
      # A tibble: 8 x 7
        taxon_id treatment_1 treatment_2 log2_median_ratio median_diff mean_diff wilcox_p_value
        <chr>    <chr>       <chr>                   <dbl>       <dbl>     <dbl>          <dbl>
      1 ab       female      male                    0.138         183     -11            0.648
      2 ac       female      male                    0.482          48      96.5          0.461
      3 ad       female      male                    0.211          31     180.           0.634
      # ... with 5 more rows
  0 functions:

This will work even if there is not a 1-to-1 relationship with taxa.

But shouldn't all obj$data reflect any change made with the taxa::filter functions?

The short answer is that the taxa::filter functions main goal is to keep the two in sync. filter_taxa does all is can to not filter observations by default and filter_obs does all it can to not filter taxa by default. filter_taxa, by default, will only remove a row if it cant reassign it to a supertaxon that did not get filtered out (n_supertaxa > 2 will always preserve a supertaxon). filter_obs will only remove a taxon if it or a subtaxon no longer exist in any table.

I am not sure I understand every question brought up in this issue, but I have a feeling that there is not a bug. Things just get tricky when there are multiple tables and each table has different characteristics. If you still think there is a bug or have questions after going through this answer, then repost an example and I will try to address that example directly.

grabear commented 5 years ago

@zachary-foster Thanks for your response! I know it's a lot, but I kept running into things I didn't understand. Your comments have been very helpful. As you know, I've been having these issues all along :rofl:

I think that we mainly would like to produce observation data tables in obj$data$diff and obj$data$tax_abund and use the taxon_ids to taxa::map_data. That way we can produce other plots like the correlation plots here that accurately reflect the heat_tree visualizations.

If the table has a one-to-one relationship with the taxa, then you can make filter_taxa filter that table the same way it filters taxa like so:

Based on your comments I believe we can achieve this, but only with a a 1:1 relationship with the taxa/obs data. If we have more than two groups, then what should we do? Split the diff table? Or split our data into multiple taxmap objects?

zachary-foster commented 5 years ago

No problem! Filtering with multiple different tables gets confusing. I have often thought I found bugs myself until realizing that I forgot some detail of how my own code works. I try to make it the default usages of the functions as intuitive as possible, but understanding how all the options interact can be difficult.

I think I see the confusion. The output of calc_taxon_abund and compare_groups return per-taxon information, so you can filter them with taxon attributes, but once you filter them or the taxa, they are no longer per-taxon, so it gets more complicated. Also, if you dont use drop_* and reassing_obs = FALSE they will not be filtered out at all, just reassigned taxon IDs.

If we have more than two groups, then what should we do?

If you want to filter data sets that are not 1-to-1 with taxa, using taxon characteristics (n_supertaxa, ranks, taxon names, etc), then you need to map the taxon characteristic to the data before filtering. Typically this involves using the helper functions like n_supertaxa(obj) directly and subsetting their output. Since the result of these functions is always named by taxon ID, you can use the taxon_id column of the data set of interested to "subset" them like so:

obj %>% filter_obs("diff_table", n_supertaxa[obj$data$diff_table$taxon_id] < 2)

Another way:

obj$data$diff_table <- obj$data$diff_table[n_supertaxa(obj)[obj$data$diff_table$taxon_id] < 2, ]

Another way:

obj %>% 
  mutate_obs(diff_n_supertaxa = n_supertaxa[obj$data$diff_table$taxon_id]) %>%
  filter_obs("diff_table", diff_n_supertaxa < 2) %>% ... more filtering using diff_n_supertaxa.

Think of this like adding a column to "diff_table" that has the number of supertaxa for each entry, based on the taxon id, and then filtering on that. You could actually do that if it makes it easier to understand.

All that being said, there could be bugs in the code, so if things dont make sense, feel free to let me know.

grabear commented 5 years ago

Here is one of my solutions to get a diff table with the data that I need.

...
# Custom functions
...
## Import data as phyloseq
> phy <- get_phyloseq_obj(".RDATA")
## Preprocess data using phyloseq package 
> proc_phy <- preprocess_phyloseq(phy)
## Convert phyloseq object and analyze data using metacoder
> obj <- get_metacoder_obj(proc_phy)
...
# Diff Table Solution
...
> x1 <- obj %>% filter_taxa(n_supertaxa < 2, reassign_obs = FALSE)
> dplyr::right_join(x1$taxonomy_table(subset = taxon_ids(x1), add_id_col = TRUE), x1$data$diff_table, by = c("taxon_ids" = "taxon_id"))
# A tibble: 16 x 16
   taxon_ids Kingdom Phylum treatment_1 treatment_2 log2_median_rat~ log2_mean_ratio median_diff mean_diff mean_treat1 mean_treat2 wilcox_p_value hartigan_dip_tr~ hartigan_dip_tr~ bimodality_coef~
   <chr>     <fct>   <fct>  <chr>       <chr>                  <dbl>           <dbl>       <dbl>     <dbl>       <dbl>       <dbl>          <dbl>            <dbl>            <dbl>            <dbl>
 1 ab        Bacter~ NA     Stressed    Control            -0.000165        -0.00122 -0.000114    -8.44e-4    0.999      1.000          0.0170               0.996           0.554             0.879
 2 ac        Archaea NA     Stressed    Control             3.48             2.53     0.000114     8.44e-4    0.00102    0.000176       0.0200               0.996           0.554             0.879
 3 ad        Bacter~ Firmi~ Stressed    Control             0.266            0.357    0.0737       1.01e-1    0.460      0.359          0.000170             0.501           0.169             0.473
 4 ae        Bacter~ Prote~ Stressed    Control            -0.901           -0.298   -0.00748     -6.14e-3    0.0267     0.0329         0.141                0.847           0.656             0.912
 5 af        Bacter~ Spiro~ Stressed    Control             0.830            0.607    0.00885      1.21e-2    0.0352     0.0231         0.135                0.222           0.961             0.646
 6 ag        Bacter~ Verru~ Stressed    Control            -2.05             0.416   -0.000824     5.13e-4    0.00205    0.00154        0.00852              0.814           0.797             0.946
 7 ah        Bacter~ Bacte~ Stressed    Control            -0.180           -0.303   -0.0660      -1.09e-1    0.468      0.577          0.00355              0.377           0.699             0.598
 8 ai        Bacter~ Fibro~ Stressed    Control             1.15             1.65     0.000466     1.65e-3    0.00242    0.000773       0.0728               0.871           0.929             0.796
 9 aj        Bacter~ Actin~ Stressed    Control            -2.67            -2.26    -0.000591    -6.42e-4    0.000169   0.000811       0.0000153            0.918           0.954             0.832
10 ak        Bacter~ Cyano~ Stressed    Control             2.10             1.53     0.000473     7.57e-4    0.00116    0.000402       0.0279               0.577           0.906             0.783
11 al        Bacter~ Elusi~ Stressed    Control             0.706           -1.71     0.000105    -2.33e-3    0.00103    0.00336        0.584                0.822           0.756             0.853
12 am        Bacter~ Lenti~ Stressed    Control            -0.420            0.450   -0.0000118    3.67e-5    0.000137   0.000100       0.868                0.890           0.604             0.891
13 an        Bacter~ Sacch~ Stressed    Control             3.60             2.74     0.000736     1.27e-3    0.00149    0.000224       0.000227             0.564           0.662             0.830
14 ao        Archaea Eurya~ Stressed    Control             3.48             2.53     0.000114     8.44e-4    0.00102    0.000176       0.0200               0.996           0.554             0.879
15 ap        Bacter~ Planc~ Stressed    Control             0                2.82     0.000178     5.50e-4    0.000641   0.0000911      0.00206              0.991           0.917             0.780
16 aq        Bacter~ Tener~ Stressed    Control            -0.0916          -0.169   -0.00000927  -5.17e-5    0.000416   0.000468       0.638                0.855           0.0624            0.804
# ... with 1 more variable: bimodality_coeff_treat2 <dbl>
zachary-foster commented 5 years ago

Closing due to inactivity. Feel free to reopen if needed