tax_glom will not agglomerate "kingdom"

joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:

http://joey711.github.io/phyloseq/

584 stars 187 forks source link

tax_glom will not agglomerate "kingdom" #223

Closed aaronsaunders closed 11 years ago

aaronsaunders commented 11 years ago

I am trying to count the assignments at each level and use tax_glom to summarise the tax_table. But tax_glom() with rank_names[1] throws an error.

myData

phyloseq-class experiment-level object
OTU Table:          [6217 taxa and 33 samples]
                     taxa are rows
Sample Data:         [33 samples by 28 sample variables]:
Taxonomy Table:     [6217 taxa by 6 taxonomic ranks]:

tax_glom(physeq=myData, taxrank=rank_names(myData)[1])

Error in apply(tax, 1, function(i) { : dim(X) must have a positive length

# The data is there
table(x=as.vector(tax_table(myData)[,1]))
x

k__Archaea  k__Bacteria Unclassified 
          27         6186            4

# tax_glom for phylum level works fine.
tax_glom(physeq=myData, taxrank=rank_names(myData)[2])

phyloseq-class experiment-level object
OTU Table:          [47 taxa and 33 samples]
                     taxa are rows
Sample Data:         [33 samples by 28 sample variables]:
Taxonomy Table:     [47 taxa by 6 taxonomic ranks]:

joey711 commented 11 years ago

Sorry for the delay. This sounds at first like a difficult problem to diagnose without a reproducible example with data. I originally started writing a response asking if you could make this, but in drafting an example I was able to reproduce the error.

library("phyloseq")
data("GlobalPatterns")
tax_glom(physeq=GlobalPatterns, taxrank=rank_names(GlobalPatterns)[1])

Error in apply(tax, 1, function(i) { : dim(X) must have a positive length

tax_glom(physeq=GlobalPatterns, taxrank="Kingdom")

Error in apply(tax, 1, function(i) { : dim(X) must have a positive length

On the other hand, if you're just hoping to create a table of sums based on taxonomic elements, you don't actually need to use tax_glom. For example (continuing the code called above):

tapply(taxa_sums(GlobalPatterns), factor(tax_table(GlobalPatterns)[, "Kingdom"]), sum)

 Archaea Bacteria 
  195598 28021080

First, does this solve your problem? Second, have you encountered this other than in the left-most rank? It looks as though this is a problem with R automatically converting the 1-column taxonomy table character matrix into a character vector, which would return NULL when dim is called internally by the apply function. I will try to sniff this out. The solution above using tapply for getting sums is much faster than tax_glom, though, because it avoids carefully pruning the tree and other data management steps that you don't need if you just want the sums.

joey711 commented 11 years ago

I will leave this issue open until I have sniffed out and squashed this bug. It does look like it only applies to tax_glom for the left-most rank. Please anyone let me know if there are examples outside of this scope.

joey711 commented 11 years ago

I working out a fix. I'll announce here when it is posted to the github-devel branch. The phyloseq version number will be 1.5.20 or greater.

joey711 commented 11 years ago

Yep, this was fixed in the aforementioned commit:

04dc4336843b5e172ae0fe7cecd24b35437eefde

annidjurhuus commented 6 years ago

Hi Joey,

I have had a similar issue to the one mentioned above. I created my own taxonomic string with these functions:

split_species = function(string, n = 2) { splits = str_split(string, "/", n + 1) res = map_if(splits, ~length(.x) > 2, ~.x[1:n]) %>% map_chr(str_c, collapse = "/") return(res) } add_taxonomy_column = function(physeq, num_species = 2) { tax_df = as.data.frame(tax_table(physeq)) %>% rownames_to_column("OTU") %>% mutate(Species = split_species(Species, n = num_species)) %>% mutate(Taxonomy = case_when( is.na(Class) ~ str_c("p:", Phylum), is.na(Order) ~ str_c("c:", Class), is.na(Family) ~ str_c("o:", Order), is.na(Genus) ~ str_c("f:", Family), is.na(Species) ~ str_c("g:", Genus), TRUE ~ str_c(Genus, " ", Species) ) )

tax = as.matrix(tax_df[, -1]) rownames(tax) = tax_df$OTU tax_table(physeq) = tax_table(tax)

return(physeq) }

As seen here: https://rdrr.io/github/mworkentine/mattsUtils/src/R/microbiome_helpers.R#sym-add_taxonomy_column

If applied to the global patterns dataset this creates an output with 19216 taxa. If I use the tax_glom function on this new string "Taxonomy" the GlobalPatterns dataset gets agglomerated to 2306 taxa, however, there are actually only 2217 unique taxa in this dataset. i.e. there are 89 taxa not agglomerated (one example being f:Oceanospirillaceae).

Here is the example code:

data("GlobalPatterns") GP <- add_taxonomy_column(GlobalPatterns) ntaxa(GP) GP_taxonomy <- tax_glom(GP, "Taxonomy") ntaxa(GP_taxonomy) unique <- unique(GP@tax_table[,8])

It seems that the function does not want to agglomerate the taxa if they are indeed different species or one of the higher classifications is not the same for all, i.e. g:Clostridium.

In my own dataset the annotation is considerably worse than of the global patterns dataset and in some cases has the highest classification to be assigned to domain bacteria, however, if I have two taxa that are both assigned d:bacteria, these do not agglomerate when using tax_glom.

My intention here is to get as much information from the taxonomic assignments as possible, but not to have any duplicate assignments in my dataset.

I would very much appreciate your help.

Thank you, Anni

nick-youngblut commented 6 years ago

I'm getting a similar error with tip_glom. Here's a reproducible example:

library(phyloseq)
data(enterotype)
# create random tree
symbiont_tree = ape::rtree(phyloseq::ntaxa(enterotype))
symbiont_tree$tip.label = phyloseq::taxa_names(enterotype)
# phyloseq object with tree
physeq = phyloseq::phyloseq(
  phyloseq::otu_table(enterotype),
  phyloseq::tax_table(enterotype),
  phyloseq::sample_data(enterotype),
  phyloseq::phy_tree(symbiont_tree)
)
# tip glom
phyloseq::tip_glom(physeq, h=2)

The error is:

Error in apply(taxmerge, 2, function(i) { : 
  dim(X) must have a positive length

My sessionInfo:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /opt/microsoft/ropen/3.4.3/lib64/R/lib/libRblas.so
LAPACK: /opt/microsoft/ropen/3.4.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] phyloseq_1.22.3      RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17        compiler_3.4.3      pillar_1.2.3        plyr_1.8.4          XVector_0.18.0      iterators_1.0.9     tools_3.4.3         zlibbioc_1.24.0    
 [9] packrat_0.4.8-1     jsonlite_1.5        tibble_1.4.2        nlme_3.1-131        rhdf5_2.22.0        gtable_0.2.0        lattice_0.20-35     mgcv_1.8-22        
[17] pkgconfig_2.0.1     rlang_0.2.1         Matrix_1.2-12       foreach_1.4.4       igraph_1.2.1        parallel_3.4.3      stringr_1.3.1       cluster_2.0.6      
[25] Biostrings_2.46.0   RevoUtils_10.0.7    S4Vectors_0.16.0    IRanges_2.12.0      multtest_2.34.0     stats4_3.4.3        ade4_1.7-11         grid_3.4.3         
[33] Biobase_2.38.0      data.table_1.11.4   survival_2.41-3     reshape2_1.4.3      ggplot2_2.2.1       magrittr_1.5        splines_3.4.3       scales_0.5.0       
[41] codetools_0.2-15    MASS_7.3-47         BiocGenerics_0.24.0 biomformat_1.6.0    permute_0.9-4       ape_5.1             colorspace_1.3-2    stringi_1.2.2      
[49] lazyeval_0.2.1      munsell_0.4.3       vegan_2.5-2