grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
135 stars 28 forks source link

Confusion about filter_taxa and filter_obs #332

Open levlitichev opened 2 years ago

levlitichev commented 2 years ago

Hello,

Thank you for developing MetaCoder. I am enjoying the package so far. I am trying to filter my dataset before plotting heat trees, but instead of throwing counts away for low-abundant taxa, I'd like to reassign those counts to the lowest available supertaxa.

Here is my dataset:

library(metacoder)
#> This is metacoder verison 0.3.5 (stable)

(toy.df <- data.frame(
  sample1=c(1,2,4,10,0,1,18,3),
  sample2=c(0,1,0,9,1,0,4,2),
  sample3=c(0,2,3,13,0,1,17,2),
  taxonomy=c("K","K;P1","K;P2","K;P1;C1","K;P1;C2","K;P2;C3","K;P2;C4","K;P3;C5")))
#>   sample1 sample2 sample3 taxonomy
#> 1       1       0       0        K
#> 2       2       1       2     K;P1
#> 3       4       0       3     K;P2
#> 4      10       9      13  K;P1;C1
#> 5       0       1       0  K;P1;C2
#> 6       1       0       1  K;P2;C3
#> 7      18       4      17  K;P2;C4
#> 8       3       2       2  K;P3;C5

So, classes 1 and 2 (C1, C2) belong to phylum 1 (P1); classes 3 and 4 (C3, C4) belong to phylum 2; and class 5 (C5) belongs to phylum 3 (P3). I only want to display classes 1 and 4 (and their supertaxa) in my heat tree. But instead of throwing out counts for the other classes (C2, C3, C5), I want their counts to be assigned to the lowest available supertaxon. This is my desired output:

(desired.df <- data.frame(
  sample1=c(4,2,5,10,18),
  sample2=c(2,2,0,9,4),
  sample3=c(2,2,4,13,17),
  taxonomy=c("K","K;P1","K;P2","K;P1;C1","K;P2;C4")))
#>   sample1 sample2 sample3 taxonomy
#> 1       4       2       2        K
#> 2       2       2       2     K;P1
#> 3       5       0       4     K;P2
#> 4      10       9      13  K;P1;C1
#> 5      18       4      17  K;P2;C4

I want the counts for C2 to be assigned to P1, the counts for C3 to be assigned to P2, and the counts for C5 to be assigned to K. (I feel like this is the best way to simplify my heat tree without throwing away information, but please let me know if I'm thinking about that incorrectly.)

I've tried different things unsuccessfully:

toy.taxmap <- parse_tax_data(toy.df, class_cols = "taxonomy", class_sep = ";")

# wrong because I lose supertaxa of C1 and C4
toy.taxmap %>%
  filter_taxa(taxon_names %in% c("C1", "C4")) %>%
  get_dataset("tax_data")
#> # A tibble: 2 × 5
#>   taxon_id sample1 sample2 sample3 taxonomy
#>   <chr>      <dbl>   <dbl>   <dbl> <chr>   
#> 1 f             10       9      13 K;P1;C1 
#> 2 i             18       4      17 K;P2;C4

# wrong because no filtering seems to occur
toy.taxmap %>%
  filter_taxa(taxon_names %in% c("C1", "C4"), supertaxa=T) %>%
  get_dataset("tax_data")
#> # A tibble: 8 × 5
#>   taxon_id sample1 sample2 sample3 taxonomy
#>   <chr>      <dbl>   <dbl>   <dbl> <chr>   
#> 1 b              1       0       0 K       
#> 2 c              2       1       2 K;P1    
#> 3 d              4       0       3 K;P2    
#> 4 f             10       9      13 K;P1;C1 
#> 5 c              0       1       0 K;P1;C2 
#> 6 d              1       0       1 K;P2;C3 
#> 7 i             18       4      17 K;P2;C4 
#> 8 b              3       2       2 K;P3;C5

# wrong because I lose supertaxa of C1 and C4
toy.taxmap %>%
  filter_obs(data="tax_data", taxonomy %in% c("K;P1;C1","K;P2;C4"),
             drop_taxa=T, reassign_obs=T) %>%
  get_dataset("tax_data")
#> # A tibble: 2 × 5
#>   taxon_id sample1 sample2 sample3 taxonomy
#>   <chr>      <dbl>   <dbl>   <dbl> <chr>   
#> 1 f             10       9      13 K;P1;C1 
#> 2 i             18       4      17 K;P2;C4

I think the FAQs describe a very similar situation to mine, but I wasn't able to use those examples to do what I wanted.

Thanks in advance for your help.

Created on 2022-02-10 by the reprex package (v2.0.1)

zachary-foster commented 2 years ago

Thanks for the example code! If you use calc_taxon_abund, this will happen automatically. Right now you are filtering ASV/OTUs (I assume) by taxon, but it does not really make sense to add the counts from one ASV to another, even if that was identified as a supertaxon. Rather, you should add up the counts by taxon and then filter the per-taxon counts. That way, removing the counts for a genus, for example, would not change the counts for the family the genus is a part of, since those counts were already used to make the family-level counts. Anyway, you need per-taxon data to plot with heat_tree and right now it appears that you have per-ASV/organism data. Does that make sense? I can provide an example if needed.

levlitichev commented 2 years ago

Thanks for your reply. Aggregating to taxons first and then filtering makes sense. My only reservation is that eventually I want to show relative abundance, but it doesn't make sense to compute relative abundance after aggregating to taxons. Right?

zachary-foster commented 2 years ago

You could calculate the relative abundances before aggregating to the taxon level and it will work. You can use calc_obs_props for that. You are right that trying to calculate relative abundance on the taxon counts would not work.