joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
567 stars 187 forks source link

tax_glom bad_empty agglomerating taxa based on NAs #1667

Open jsogin574 opened 1 year ago

jsogin574 commented 1 year ago

Hi, I was doing something and realized that tax_glom is not working how I thought it is supposed to. How I thought it was supposed to work is that with bad_empty, taxa would not be agglomerated based on the presence of NAs. This makes sense, as two sequences classified as 'unresolved' by a classifier may very well come from separate organisms; unfortunately this becomes tricky with bacteria with intragenomic heterogeneity of 16s rRNA genes.

Here is a reproducible example: `

using phyloseq version 1.42.0

data("GlobalPatterns") GP <- GlobalPatterns

Let's agglomerate the data by 'Species'

I am not removing taxa classified as NA, i.e., NArm=F

GP_glomspecies_nobadempty <- tax_glom(GP, taxrank = "Species", NArm=F, bad_empty=c()) ##no bad_empty setting GP_glomspecies <- tax_glom(GP, taxrank ="Species", NArm=F, bad_empty=c(NA, "", " ", "\t")) ##default bad_empty setting

Let's compare the species agglomeration for the genus 'Methylobacterium'

This is a good choice for this example because there aren't too many total OTUs for this genus, there are several NAs, there are two resolved species observed, and for each of those species there are two OTUs assigned.

this is the non-agglomerated data

tax_table(subset_taxa(GP, Genus=="Methylobacterium"))

image

this is the agglomerated data with no bad_empty setting, i.e. bad_empty=c()

tax_table(subset_taxa(GP_glomspecies_nobadempty, Genus=="Methylobacterium"))

image

this is the default bad_empty setting, i.e. bad_empty=c(NA, "", " ", "\t")

tax_table(subset_taxa(GP_glomspecies, Genus=="Methylobacterium"))

image

`

ycl6 commented 1 year ago

Hi @jsogin574

It does seem the function is not working as intended. The way that rank info for each ASV was concatenated as a string in Line 408 of the original code makes the subset in Line 413 to be unfeasible. Ignoring ASV that contains bad_empty character(s) in their taxonomy info has to be achieved in a different way.

As the original author has not been active in this repository, your best bet will be to write your own code to do the merging in the way you wanted.

library(phyloseq)
data(GlobalPatterns)

# Make a smaller demo
subset = subset_taxa(GlobalPatterns, Genus == "Methylobacterium")

#bad_empty: (Optional). Character vector. Default: ‘c(NA, "", " ",
#          "\t")’. Defines the bad/empty values that should be ignored
#          and/or considered unknown. They will be removed from the
#          internal agglomeration vector derived from the argument to
#          ‘tax’, and therefore agglomeration will not combine taxa
#          according to the presence of these values in ‘tax’.
#          Furthermore, the corresponding taxa can be optionally pruned
#          from the output if ‘NArm’ is set to ‘TRUE’.

taxrank = "Species"
bad_empty = c(NA, "", " ", "\t")

CN  = which( rank_names(subset) %in% taxrank )

# Line 407-408
tax = as(access(subset, "tax_table"), "matrix")[, 1:CN, drop=FALSE]
tax = apply(tax, 1, function(i){paste(i, sep=";_;", collapse=";_;")})

head(tax)
#>                                                                                                                               335585 
#>                        "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA" 
#>                                                                                                                               579710 
#>                        "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA" 
#>                                                                                                                                81189 
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae" 
#>                                                                                                                               548603 
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae" 
#>                                                                                                                               107137 
#>                        "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA" 
#>                                                                                                                                 4797 
#>                        "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA"

head(tax[ !(tax %in% bad_empty) ])
#>                                                                                                                               335585 
#>                        "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA" 
#>                                                                                                                               579710 
#>                        "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA" 
#>                                                                                                                                81189 
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae" 
#>                                                                                                                               548603 
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae" 
#>                                                                                                                               107137 
#>                        "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA" 
#>                                                                                                                                 4797 
#>                        "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA"

head(tax[grep("NA", tax, invert = T)])
#>                                                                                                                                   81189 
#>    "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae" 
#>                                                                                                                                  548603 
#>    "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae" 
#>                                                                                                                                  166094 
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumorganophilum" 
#>                                                                                                                                  208196 
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumorganophilum"

Created on 2023-04-05 by the reprex package (v2.0.1)

https://github.com/joey711/phyloseq/blob/c2605619682acb7167487f703d135d275fead748/R/transform_filter-methods.R#L396-L413

slambrechts commented 4 months ago

We experience the same problem, bad_empty is not working