Bug? filter_taxon with taxon_ranks removes all taxa without information with selected rank even with supertaxa = TRUE

LChWu commented 4 years ago

Hello,

I think I discovered something when using filter_taxon together with taxon_ranks. If a taxon does not have information on the rank you filter for, also with supertaxa = TRUE, the taxon will be discarded or maybe reassigned to the root, I am not sure where it disappears to, because the total OTU number does not change.

I am not sure if this phenomenum is documented anywhere. I wasn't able to find it so far. Maybe it can be added to describtion of the function or an option can be added to the function to stop this? Is there already one?

Here I reproduced what I found with some data from the metacoder workshop:

Data

# load packages
library(taxa)
library(metacoder)

## This is metacoder verison 0.3.3 (stable)

# load data from metacoder workshop "Plotting taxonomic data"
load("filtered_data.Rdata")
# reduce taxa for better visualization
obj <- filter_taxa(obj, grepl(pattern = "[a-zA-Z]+$", taxon_names), supertaxa = T)
print(obj)

## <Taxmap>
##   420 taxa: aad. Bacteria ... chs. Alcanivoracaceae
##   420 edges: NA->aad, aad->aaf, aad->aag ... amj->cfp, apt->chs
##   1 data sets:
##     otu_counts:
##       # A tibble: 21,684 x 294
##         taxon_id OTU_ID M1981P563 M1977P1709 M1980P502 M1981P606
##         <chr>    <chr>      <dbl>      <dbl>     <dbl>     <dbl>
##       1 ben      2              7          0         2         7
##       2 beo      3          10251         59     15857       309
##       3 ben      4           1447       2496      2421      3331
##       # … with 2.168e+04 more rows, and 288 more variables:
##       #   M1981P599 <dbl>, M1981P611 <dbl>, M1981P587 <dbl>,
##       #   M1982P729 <dbl>, M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>,
##       #   M1555P221 <dbl>, …
##   0 functions:

# filter taxa by order, 'traditional way'
obj <- filter_taxa(obj, taxon_ranks == 'o', supertaxa = T)
print(obj)

## <Taxmap>
##   186 taxa: aad. Bacteria ... azv. Dehalococcoidales
##   186 edges: NA->aad, aad->aaf, aad->aag ... adg->ayq, ajz->azv
##   1 data sets:
##     otu_counts:
##       # A tibble: 21,684 x 294
##         taxon_id OTU_ID M1981P563 M1977P1709 M1980P502 M1981P606
##         <chr>    <chr>      <dbl>      <dbl>     <dbl>     <dbl>
##       1 amg      2              7          0         2         7
##       2 amh      3          10251         59     15857       309
##       3 amg      4           1447       2496      2421      3331
##       # … with 2.168e+04 more rows, and 288 more variables:
##       #   M1981P599 <dbl>, M1981P611 <dbl>, M1981P587 <dbl>,
##       #   M1982P729 <dbl>, M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>,
##       #   M1555P221 <dbl>, …
##   0 functions:

# calculate relative abundance
obj$data$otu_counts <- calc_obs_props(obj, "otu_counts", other_cols = T)

## No `cols` specified, so using all numeric columns:
##    M1981P563, M1977P1709, M1980P502 ... M1978P342, M1955P778
## Calculating proportions from counts for 292 columns for 21684 observations.
## Warning: The following columns will be replaced in the output:
##    M1981P563, M1977P1709, M1980P502 ... M1978P342, M1955P778

obj$data$otu_counts <- calc_taxon_abund(obj, "otu_counts")

## No `cols` specified, so using all numeric columns:
##    M1981P563, M1977P1709, M1980P502 ... M1978P342, M1955P778
## Summing per-taxon counts from 292 columns for 186 taxa

# heat tree
set.seed(5)
heat_tree(obj,
    node_color = M1981P563,
    node_label = taxon_names,
    node_color_axis_label = "Relative Abundance",
    output_file = "filtered_traditional.pdf"
  )

filtered_traditional

Here you can see that all taxa have subtaxa up to order level

# filter taxa by order with supertaxa manually (need to reload data)
load("filtered_data.Rdata")
# reduce taxa for better visualization
obj <- filter_taxa(obj, grepl(pattern = "[a-zA-Z]+$", taxon_names), supertaxa = T)
print(obj)

## <Taxmap>
##   420 taxa: aad. Bacteria ... chs. Alcanivoracaceae
##   420 edges: NA->aad, aad->aaf, aad->aag ... amj->cfp, apt->chs
##   1 data sets:
##     otu_counts:
##       # A tibble: 21,684 x 294
##         taxon_id OTU_ID M1981P563 M1977P1709 M1980P502 M1981P606
##         <chr>    <chr>      <dbl>      <dbl>     <dbl>     <dbl>
##       1 ben      2              7          0         2         7
##       2 beo      3          10251         59     15857       309
##       3 ben      4           1447       2496      2421      3331
##       # … with 2.168e+04 more rows, and 288 more variables:
##       #   M1981P599 <dbl>, M1981P611 <dbl>, M1981P587 <dbl>,
##       #   M1982P729 <dbl>, M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>,
##       #   M1555P221 <dbl>, …
##   0 functions:

obj <- filter_taxa(obj, taxon_ranks %in% c('k', 'p', 'c', 'o'))
print(obj)

## <Taxmap>
##   194 taxa: aad. Bacteria ... azv. Dehalococcoidales
##   194 edges: NA->aad, aad->aaf, aad->aag ... adg->ayq, ajz->azv
##   1 data sets:
##     otu_counts:
##       # A tibble: 21,684 x 294
##         taxon_id OTU_ID M1981P563 M1977P1709 M1980P502 M1981P606
##         <chr>    <chr>      <dbl>      <dbl>     <dbl>     <dbl>
##       1 amg      2              7          0         2         7
##       2 amh      3          10251         59     15857       309
##       3 amg      4           1447       2496      2421      3331
##       # … with 2.168e+04 more rows, and 288 more variables:
##       #   M1981P599 <dbl>, M1981P611 <dbl>, M1981P587 <dbl>,
##       #   M1982P729 <dbl>, M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>,
##       #   M1555P221 <dbl>, …
##   0 functions:

notice that here are 194 taxa now, with the filtering above there were only 186 taxa

# calculate relative abundance
obj$data$otu_counts <- calc_obs_props(obj, "otu_counts", other_cols = T)

## No `cols` specified, so using all numeric columns:
##    M1981P563, M1977P1709, M1980P502 ... M1978P342, M1955P778
## Calculating proportions from counts for 292 columns for 21684 observations.
## Warning: The following columns will be replaced in the output:
##    M1981P563, M1977P1709, M1980P502 ... M1978P342, M1955P778

obj$data$otu_counts <- calc_taxon_abund(obj, "otu_counts")

## No `cols` specified, so using all numeric columns:
##    M1981P563, M1977P1709, M1980P502 ... M1978P342, M1955P778
## Summing per-taxon counts from 292 columns for 194 taxa

# heat tree
set.seed(5)
heat_tree(obj,
          node_color = M1981P563,
          node_label = taxon_names,
          node_color_axis_label = "Relative Abundance",
          output_file = "filtered_manual.pdf"
)

filtered_manual

Now, here are also taxa displayed where the lowest subtaxa is class level, for example 3BR-5I, which were missing in the tree above.

So, is that kind of filtering supposed to work like this? In my data I had a lot of taxa which were not assigned to the rank level I filtered to, that was how I noticed something was off. Would be great if you could add an entry about this phenomenum somewhere.

Thank you!

zachary-foster commented 4 years ago

Thanks for the feedback!

Yes, this is expected behavior, since when you do filter_taxa(obj, taxon_ranks == 'o', supertaxa = T), only taxa that are orders are returned, as well as their supertaxa , so classes with no orders, for example, are not returned. The other classes returned are only included because they were supertaxa of an order.

When people think "I want all taxa the rank of order or above", what they want is something like filter_taxa(obj, taxon_ranks %in% c('k', 'p', 'c', 'o')), as you mentioned.

In the future, I expect this will be possible: filter_taxa(obj, taxon_ranks <= 'o')

Its part of a big rewrite of the taxa package that is slowly progressing. Once that is done, I will emphasize using the > and < with taxon ranks for more intuitive subsetting of this kind and things like filter_taxa(obj, taxon_ranks == 'o', supertaxa = T) wont be used for this purpose, so hopefully that will solve the confusion.

LChWu commented 4 years ago

Ok, that makes sense than. Thank you for solving my confusion and also for the great package!

grunwaldlab / metacoder

Bug? filter_taxon with taxon_ranks removes all taxa without information with selected rank even with supertaxa = TRUE #280