Closed HenrikEckermann closed 3 years ago
It is caused, because by default rowtree
is not agglomerated. It still has those OTU level taxa, which do not match with agglomerated taxa. When argument agglomerateTree = TRUE
, then it works.
Should the value of agglomerateTree
be TRUE
by default. The problem is hard to solve by user, but is there any drawbacks if it is TRUE
?
> library("mia")
> data("GlobalPatterns")
> tse <- GlobalPatterns
> tse_phylum <- agglomerateByRank(GlobalPatterns, rank ="Phylum", agglomerateTree = TRUE)
Warning message:
In toTree(td) : The root is added with label 'ALL'
> tse_phylum_wont_work <- agglomerateByRank(GlobalPatterns, rank ="Phylum")
> makePhyloseqFromTreeSummarizedExperiment(tse)
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data() Sample Data: [ 26 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
> makePhyloseqFromTreeSummarizedExperiment(tse_phylum)
phyloseq-class experiment-level object
otu_table() OTU Table: [ 66 taxa and 26 samples ]
sample_data() Sample Data: [ 26 samples by 7 sample variables ]
tax_table() Taxonomy Table: [ 66 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 66 tips and 3 internal nodes ]
> makePhyloseqFromTreeSummarizedExperiment(tse_phylum_wont_work)
Error in validObject(.Object) : invalid class “phyloseq” object:
Component taxa/OTU names do not match.
Taxa indices are critical to analysis.
Try taxa_names()
@antagomir @FelixErnst
Great. I am not sure why we have chosen to have it FALSE by default, I do not immediately see reasons against changing to TRUE. In a way, that would seem better justified if the user is interested in agglomeration? What are the advantages of having this FALSE by default (if any)?
When agglomerateTree
is TRUE
, it causes an error when there is no rowTree
. That can be easily handled, e.g. by checking if rowTree
exists
Also, pruning causes a warning but I think it's how it works Warning in toTree(td) : The root is added with label 'ALL'
I think this is ok.
In case you want to agglomerate but keep the tree in its original state (for example a tree generated via FastTree), then the default value is much more user friendly.
I would suggest keeping the default value FALSE
, but include a check in makePhyloseqFromTreeSummarizedExperiment
. This is where the error occurs, since the input is not validate in the required detail.
In addition you can also view it like this:
makePhyloseqFromTreeSummarizedExperiment
has to work with every TSE
out there, not the ones only created/modified via mia
functions. Since @HenrikEckermann used a valid TSE
object as input, the fault lies within makePhyloseqFromTreeSummarizedExperiment
and not somewhere else.
That's ok for me. I was just wondering if user makes an assumption that also the rowTree
is agglomerated when user does the agglomeration.
Also making agglomerateTree = TRUE
breaks couple of functions so this might be a better approach
So:
agglomerateTree
stays as FALSE
makePhyloseqFromTreeSummarizedExperiment
check if rownames
and tip labels
match
if( any(!rowTree(x)$tip %in% rownames(x)) ) {
"rowTree does not match with rownames. Check rowTree(x)$tip and rownames(x)..."
}
Is that the only assumption a phyloseq object requires? I mean a match between tips and rownames? If yes, pruning the tree with a warning in makePhyloseqFromTreeSummarizedExperiment
could also work.
I think that is the only assumption: tips of rowtree and rownames must match.
I think
if (rowTree is wrong){
rowTree(x) <- addTaxonomyTree(x)
}
would do that. However, I'm wondering if there could be a situation where that does not work. Should that be, e.g., inside try-catch
? Otherwise the error could be hard to solve by user
Actually as far as I know, rownames(x)
do not have to match with rowTree(x)$tip
. However, every rowTree(x)$tip
must be found from rownames(x)
, but rownames(x)
can contain higher level taxa also. E.g. it can contain Kingdom:Bacteria
, when data is agglomerated to, e.g., Phylum
level.
This would mean that agglomerateByRank
could, in principle, just add the higher-level agglomerated taxa to the rowData
? - at the moment agglomeration replaces rowData
entirely.
We have at least three options for storing the agglomerated assay:
We use now option 1. I wonder how explicit decision this has been, and is this choice in line with how these are being treated e.g. single cell sequencing. Ideally, we could support the best practices if these have already been developed in other application areas of the tse
class. This seems like a key design aspect.
This is not a discussion about agglomerateByRank
, but about makePhyloseqFromTreeSummarizedExperiment
. Therefore everything in this issue should be about the latter function.
As Tuomas has said, the assumptions a phyloseq
object requires for a tree are quite soft and similar to the one TSE
has. The difference is, that TSE
matches the relation with rowLinks
and does not require name matches. From my point of view the tree tips linked to rows, need to be renamed and then it would work. (Have a look at the TSE documentation)
Otherwise, the solution
if (rowTree is wrong){
rowTree(x) <- addTaxonomyTree(x)
}
should also work.
@antagomir
The workflow you describe is covered by splitByRanks
/unsplitByRanks
(Have a look at the man page or the pkgdown site). It uses agglomerateByRanks
internally, but we cannot add another layer of functionality on top of an already compley function. This would be hell
Very good, I had missed splitByRanks
/ unsplitByRanks
and indeed great to have that as well, although irrelevant here. We just need a way to convert to phyloseq now, and the latest suggestion on renaming the tree tips linked to rows by @FelixErnst seems good to me.
I updated using remotes::install_github("microbiome/mia")
With the same dataset I can now convert the Phylum level TSE object to pseq, although that gives a warning:
Warning message:
In makePhyloseqFromTreeSummarizedExperiment(tse_phylum) :
rowTree is pruned to match rownames.
However, trying to do the same at the genus level gives another error now:
pseq_genus <- makePhyloseqFromTreeSummarizedExperiment(tse_genus)
Error in method(object) :
rowTree: Duplicated labels are not allowed for leaves.
That warning is expected, error not.
That error happens when pruning is done.
if( !is.null(rowTree(x)) && any(!( rowTree(x)$tip) %in% rownames(x)) ){
# Gets node labels
node_labs <- rowLinks(x)$nodeLab
# Gets the corresponding rownames
node_labs_rownames <- rownames(rowLinks(x))
# Prunes the tree
tree_pruned <- ape::keep.tip(rowTree(x), node_labs)
# Replace tip labels with corresponding rownames
# THERE ARE SOME ROWNAMES THAT ARE DUPLICATED
tree_pruned$tip.label <- node_labs_rownames
# Assigns the pruned tree back to TSE object
# ERROR, THERE CANNOT BE DUPLICATED ROWNAMES!!!
rowTree(x) <- tree_pruned
warning("rowTree is pruned to match rownames.")
}
For example, GlobalPatterns has these duplicated, and because there are duplicated names, error occurs
> pruned_tree$tip.label[duplicated(pruned_tree$tip.label)]
[1] "Genus:Streptomyces" "Family:Alteromonadaceae" "Family:Thiotrichaceae" "Family:Sinobacteraceae" "Family:Rhodobacteraceae"
[6] "Genus:Mycoplana" "Genus:Clostridium" "Genus:Eubacterium" "Genus:Clostridium" "Genus:Clostridium"
[11] "Genus:Bacteroides" "Genus:Eubacterium" "Genus:Clostridium" "Genus:Bacteroides" "Genus:Ruminococcus"
[16] "Genus:Eubacterium" "Genus:Bacillus"
Actually, what I found, this also lead to an error, so I think this needs to be checked more throughout
> tse_genus_ <- tse_genus
> rowTree(tse_genus_) <- NULL
> pseq_phylum <- makePhyloseqFromTreeSummarizedExperiment(tse_genus_)
[1] "1111111111111111111111111111111111111"
[1] "222222222222222222222222222222"
[1] "333333333333333333333333333"
Error in `taxa_names<-`(`*tmp*`, value = gsub("\"", "", taxa_names(x), :
taxa_names<-: You are attempting to assign duplicated taxa_names
So, this also leads to an error
# Creates a phyloseq object
phyloseq <- do.call(phyloseq::phyloseq, args)
So duplicated rownames seems to be not allowed
> tse_genus <- agglomerateByRank(GlobalPatterns, rank ="Genus", na.rm = TRUE)
> rowData(tse_genus)[duplicated(rownames(tse_genus)),]
DataFrame with 13 rows and 7 columns
Kingdom Phylum Class Order Family Genus Species
<character> <character> <character> <character> <character> <character> <character>
Streptomyces Bacteria Actinobacteria Actinobacteria Actinomycetales Streptomycetaceae Streptomyces NA
Mycoplana Bacteria Proteobacteria Alphaproteobacteria Caulobacterales Caulobacteraceae Mycoplana NA
Clostridium Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae Clostridium NA
Eubacterium Bacteria Firmicutes Clostridia Clostridiales ClostridialesFamilyX.. Eubacterium NA
Clostridium Bacteria Firmicutes Clostridia Clostridiales ClostridialesFamilyX.. Clostridium NA
... ... ... ... ... ... ... ...
Clostridium Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae Clostridium NA
Bacteroides Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae Bacteroides NA
Ruminococcus Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Ruminococcus NA
Eubacterium Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Eubacterium NA
Bacillus Bacteria Firmicutes Bacilli Bacillales Planococcaceae Bacillus NA
I think phyloseq
and tree
do not allow to use duplicated rownames
/ tip labels
. This could work, but what are your thoughts?
I think this should be added to both sides (general and TSE
) of the function
if( any(duplicated(rownames(x))) ){
rownames(x) <- make.unique(rownames(x))
}
How about using addTaxonomyLabel
, if duplicated rownames exist?
The function is quite powerful, but mostly used internally. It is basically a please-make-my-labels-work
kind of function. Maybe give it a try?
Sure, but I couldn't find this addTaxonomyLabel
, which package provides it?
mia
, but it is getTaxonomyLabel
, my mistake.
https://github.com/microbiome/mia/blob/3c46cc4089b78ce9a587150ab38683eab0008a3a/R/taxonomy.R#L246
ok, thanks, I will check that
Hi,
if we first agglomerate data in a tse object to e.g. Phylum level and then try to convert to pseq, we get an error
Error in validObject(.Object): invalid class "phyloseq" object: Component taxa/OTU names do not match.
. The function only works when data is not agglomerated before.Example where tse is any tse object with e.g. a count assay:
I had a look at the function: If I take out the function of just the first part:
and use this function on the agglomerated tse object, then it works fine. Unfortunately, I could not go further than that as I need to read on about the methods that we use in the mia package to understand what is going on...