Open jsogin574 opened 1 year ago
Hi @jsogin574
It does seem the function is not working as intended. The way that rank info for each ASV was concatenated as a string in Line 408 of the original code makes the subset in Line 413 to be unfeasible. Ignoring ASV that contains bad_empty
character(s) in their taxonomy info has to be achieved in a different way.
As the original author has not been active in this repository, your best bet will be to write your own code to do the merging in the way you wanted.
library(phyloseq)
data(GlobalPatterns)
# Make a smaller demo
subset = subset_taxa(GlobalPatterns, Genus == "Methylobacterium")
#bad_empty: (Optional). Character vector. Default: ‘c(NA, "", " ",
# "\t")’. Defines the bad/empty values that should be ignored
# and/or considered unknown. They will be removed from the
# internal agglomeration vector derived from the argument to
# ‘tax’, and therefore agglomeration will not combine taxa
# according to the presence of these values in ‘tax’.
# Furthermore, the corresponding taxa can be optionally pruned
# from the output if ‘NArm’ is set to ‘TRUE’.
taxrank = "Species"
bad_empty = c(NA, "", " ", "\t")
CN = which( rank_names(subset) %in% taxrank )
# Line 407-408
tax = as(access(subset, "tax_table"), "matrix")[, 1:CN, drop=FALSE]
tax = apply(tax, 1, function(i){paste(i, sep=";_;", collapse=";_;")})
head(tax)
#> 335585
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA"
#> 579710
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA"
#> 81189
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae"
#> 548603
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae"
#> 107137
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA"
#> 4797
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA"
head(tax[ !(tax %in% bad_empty) ])
#> 335585
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA"
#> 579710
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA"
#> 81189
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae"
#> 548603
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae"
#> 107137
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA"
#> 4797
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;NA"
head(tax[grep("NA", tax, invert = T)])
#> 81189
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae"
#> 548603
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumkomagatae"
#> 166094
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumorganophilum"
#> 208196
#> "Bacteria;_;Proteobacteria;_;Alphaproteobacteria;_;Rhizobiales;_;Methylobacteriaceae;_;Methylobacterium;_;Methylobacteriumorganophilum"
Created on 2023-04-05 by the reprex package (v2.0.1)
We experience the same problem, bad_empty
is not working
Hi, I was doing something and realized that
tax_glom
is not working how I thought it is supposed to. How I thought it was supposed to work is that withbad_empty
, taxa would not be agglomerated based on the presence of NAs. This makes sense, as two sequences classified as 'unresolved' by a classifier may very well come from separate organisms; unfortunately this becomes tricky with bacteria with intragenomic heterogeneity of 16s rRNA genes.Here is a reproducible example: `
using phyloseq version 1.42.0
data("GlobalPatterns") GP <- GlobalPatterns
Let's agglomerate the data by 'Species'
I am not removing taxa classified as NA, i.e., NArm=F
GP_glomspecies_nobadempty <- tax_glom(GP, taxrank = "Species", NArm=F, bad_empty=c()) ##no bad_empty setting GP_glomspecies <- tax_glom(GP, taxrank ="Species", NArm=F, bad_empty=c(NA, "", " ", "\t")) ##default bad_empty setting
Let's compare the species agglomeration for the genus 'Methylobacterium'
This is a good choice for this example because there aren't too many total OTUs for this genus, there are several NAs, there are two resolved species observed, and for each of those species there are two OTUs assigned.
this is the non-agglomerated data
tax_table(subset_taxa(GP, Genus=="Methylobacterium"))
this is the agglomerated data with no bad_empty setting, i.e. bad_empty=c()
tax_table(subset_taxa(GP_glomspecies_nobadempty, Genus=="Methylobacterium"))
this is the default bad_empty setting, i.e. bad_empty=c(NA, "", " ", "\t")
tax_table(subset_taxa(GP_glomspecies, Genus=="Methylobacterium"))
`