tse to pseq works only when tse is not agglomerated

HenrikEckermann commented 3 years ago

Hi,

if we first agglomerate data in a tse object to e.g. Phylum level and then try to convert to pseq, we get an error Error in validObject(.Object): invalid class "phyloseq" object: Component taxa/OTU names do not match.. The function only works when data is not agglomerated before.

Example where tse is any tse object with e.g. a count assay:

tse_phylum <- agglomerateByRank(tse, rank = "Phylum")
pseq_phylum <- makePhyloseqFromTreeSummarizedExperiment(tse_phylum)

I had a look at the function: If I take out the function of just the first part:

setMethod("makePhyloseqFromTreeSummarizedExperiment",
     signature = c(x = "SummarizedExperiment"),
  function(x, abund_values = "counts")

and use this function on the agglomerated tse object, then it works fine. Unfortunately, I could not go further than that as I need to read on about the methods that we use in the mia package to understand what is going on...

TuomasBorman commented 3 years ago

It is caused, because by default rowtree is not agglomerated. It still has those OTU level taxa, which do not match with agglomerated taxa. When argument agglomerateTree = TRUE, then it works.

Should the value of agglomerateTree be TRUE by default. The problem is hard to solve by user, but is there any drawbacks if it is TRUE?


> library("mia")
> data("GlobalPatterns")
> tse <- GlobalPatterns
> tse_phylum <- agglomerateByRank(GlobalPatterns, rank ="Phylum", agglomerateTree = TRUE)
Warning message:
In toTree(td) : The root is added with label 'ALL'
> tse_phylum_wont_work <- agglomerateByRank(GlobalPatterns, rank ="Phylum")
> makePhyloseqFromTreeSummarizedExperiment(tse)
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
> makePhyloseqFromTreeSummarizedExperiment(tse_phylum)
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 66 taxa and 26 samples ]
sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
tax_table()   Taxonomy Table:    [ 66 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 66 tips and 3 internal nodes ]
> makePhyloseqFromTreeSummarizedExperiment(tse_phylum_wont_work)
 Error in validObject(.Object) : invalid class “phyloseq” object: 
 Component taxa/OTU names do not match.
 Taxa indices are critical to analysis.
 Try taxa_names()

@antagomir @FelixErnst

antagomir commented 3 years ago

Great. I am not sure why we have chosen to have it FALSE by default, I do not immediately see reasons against changing to TRUE. In a way, that would seem better justified if the user is interested in agglomeration? What are the advantages of having this FALSE by default (if any)?

TuomasBorman commented 3 years ago

When agglomerateTree is TRUE, it causes an error when there is no rowTree. That can be easily handled, e.g. by checking if rowTree exists

Also, pruning causes a warning but I think it's how it works Warning in toTree(td) : The root is added with label 'ALL'

antagomir commented 3 years ago

I think this is ok.

FelixErnst commented 3 years ago

In case you want to agglomerate but keep the tree in its original state (for example a tree generated via FastTree), then the default value is much more user friendly.

I would suggest keeping the default value FALSE, but include a check in makePhyloseqFromTreeSummarizedExperiment. This is where the error occurs, since the input is not validate in the required detail.

FelixErnst commented 3 years ago

In addition you can also view it like this:

makePhyloseqFromTreeSummarizedExperiment has to work with every TSE out there, not the ones only created/modified via mia functions. Since @HenrikEckermann used a valid TSE object as input, the fault lies within makePhyloseqFromTreeSummarizedExperiment and not somewhere else.

TuomasBorman commented 3 years ago

That's ok for me. I was just wondering if user makes an assumption that also the rowTreeis agglomerated when user does the agglomeration.

Also making agglomerateTree = TRUE breaks couple of functions so this might be a better approach

So:

agglomerateTree stays as FALSE

makePhyloseqFromTreeSummarizedExperiment check if rownames and tip labels match

if( any(!rowTree(x)$tip %in% rownames(x)) ) {
"rowTree does not match with rownames. Check rowTree(x)$tip and rownames(x)..."
}

FelixErnst commented 3 years ago

Is that the only assumption a phyloseq object requires? I mean a match between tips and rownames? If yes, pruning the tree with a warning in makePhyloseqFromTreeSummarizedExperiment could also work.

TuomasBorman commented 3 years ago

I think that is the only assumption: tips of rowtree and rownames must match.

I think

if (rowTree is wrong){
rowTree(x) <- addTaxonomyTree(x) 
}

would do that. However, I'm wondering if there could be a situation where that does not work. Should that be, e.g., inside try-catch? Otherwise the error could be hard to solve by user

TuomasBorman commented 3 years ago

Actually as far as I know, rownames(x) do not have to match with rowTree(x)$tip. However, every rowTree(x)$tip must be found from rownames(x), but rownames(x)can contain higher level taxa also. E.g. it can contain Kingdom:Bacteria, when data is agglomerated to, e.g., Phylum level.

antagomir commented 3 years ago

This would mean that agglomerateByRank could, in principle, just add the higher-level agglomerated taxa to the rowData? - at the moment agglomeration replaces rowData entirely.

We have at least three options for storing the agglomerated assay:

Replace rowData with the agglomerated version
Add the agglomerated data to existing rowData, as higher-level nodes
Add the agglomerated data to the tse as altExp (same samples, different features)

We use now option 1. I wonder how explicit decision this has been, and is this choice in line with how these are being treated e.g. single cell sequencing. Ideally, we could support the best practices if these have already been developed in other application areas of the tse class. This seems like a key design aspect.

FelixErnst commented 3 years ago

This is not a discussion about agglomerateByRank, but about makePhyloseqFromTreeSummarizedExperiment. Therefore everything in this issue should be about the latter function.

As Tuomas has said, the assumptions a phyloseq object requires for a tree are quite soft and similar to the one TSE has. The difference is, that TSE matches the relation with rowLinks and does not require name matches. From my point of view the tree tips linked to rows, need to be renamed and then it would work. (Have a look at the TSE documentation)

Otherwise, the solution

if (rowTree is wrong){
rowTree(x) <- addTaxonomyTree(x) 
}

should also work.

FelixErnst commented 3 years ago

@antagomir The workflow you describe is covered by splitByRanks/unsplitByRanks (Have a look at the man page or the pkgdown site). It uses agglomerateByRanks internally, but we cannot add another layer of functionality on top of an already compley function. This would be hell

antagomir commented 3 years ago

Very good, I had missed splitByRanks / unsplitByRanks and indeed great to have that as well, although irrelevant here. We just need a way to convert to phyloseq now, and the latest suggestion on renaming the tree tips linked to rows by @FelixErnst seems good to me.

HenrikEckermann commented 3 years ago

I updated using remotes::install_github("microbiome/mia") With the same dataset I can now convert the Phylum level TSE object to pseq, although that gives a warning:

Warning message:
In makePhyloseqFromTreeSummarizedExperiment(tse_phylum) :
  rowTree is pruned to match rownames.

However, trying to do the same at the genus level gives another error now:

pseq_genus <- makePhyloseqFromTreeSummarizedExperiment(tse_genus)
Error in method(object) : 
rowTree: Duplicated labels are not allowed for leaves.

TuomasBorman commented 3 years ago

That warning is expected, error not.

That error happens when pruning is done.


        if( !is.null(rowTree(x)) && any(!( rowTree(x)$tip) %in% rownames(x)) ){
            # Gets node labels
            node_labs <- rowLinks(x)$nodeLab
            # Gets the corresponding rownames
            node_labs_rownames <- rownames(rowLinks(x))
            # Prunes the tree
            tree_pruned <- ape::keep.tip(rowTree(x), node_labs)
            # Replace tip labels with corresponding rownames 
            # THERE ARE SOME ROWNAMES THAT ARE DUPLICATED
            tree_pruned$tip.label <- node_labs_rownames
            # Assigns the pruned tree back to TSE object
            # ERROR, THERE CANNOT BE DUPLICATED ROWNAMES!!!
            rowTree(x) <- tree_pruned
            warning("rowTree is pruned to match rownames.")
        }

For example, GlobalPatterns has these duplicated, and because there are duplicated names, error occurs

> pruned_tree$tip.label[duplicated(pruned_tree$tip.label)]
 [1] "Genus:Streptomyces"      "Family:Alteromonadaceae" "Family:Thiotrichaceae"   "Family:Sinobacteraceae"  "Family:Rhodobacteraceae"
 [6] "Genus:Mycoplana"         "Genus:Clostridium"       "Genus:Eubacterium"       "Genus:Clostridium"       "Genus:Clostridium"      
[11] "Genus:Bacteroides"       "Genus:Eubacterium"       "Genus:Clostridium"       "Genus:Bacteroides"       "Genus:Ruminococcus"     
[16] "Genus:Eubacterium"       "Genus:Bacillus"

Actually, what I found, this also lead to an error, so I think this needs to be checked more throughout

> tse_genus_ <- tse_genus
> rowTree(tse_genus_) <- NULL
> pseq_phylum <- makePhyloseqFromTreeSummarizedExperiment(tse_genus_)
[1] "1111111111111111111111111111111111111"
[1] "222222222222222222222222222222"
[1] "333333333333333333333333333"
 Error in `taxa_names<-`(`*tmp*`, value = gsub("\"", "", taxa_names(x),  : 
  taxa_names<-: You are attempting to assign duplicated taxa_names

So, this also leads to an error

        # Creates a phyloseq object
        phyloseq <- do.call(phyloseq::phyloseq, args)

So duplicated rownames seems to be not allowed

TuomasBorman commented 3 years ago

> tse_genus <- agglomerateByRank(GlobalPatterns, rank ="Genus", na.rm = TRUE)
> rowData(tse_genus)[duplicated(rownames(tse_genus)),]
DataFrame with 13 rows and 7 columns
                 Kingdom         Phylum               Class           Order                 Family        Genus     Species
             <character>    <character>         <character>     <character>            <character>  <character> <character>
Streptomyces    Bacteria Actinobacteria      Actinobacteria Actinomycetales      Streptomycetaceae Streptomyces          NA
Mycoplana       Bacteria Proteobacteria Alphaproteobacteria Caulobacterales       Caulobacteraceae    Mycoplana          NA
Clostridium     Bacteria     Firmicutes          Clostridia   Clostridiales         Clostridiaceae  Clostridium          NA
Eubacterium     Bacteria     Firmicutes          Clostridia   Clostridiales ClostridialesFamilyX..  Eubacterium          NA
Clostridium     Bacteria     Firmicutes          Clostridia   Clostridiales ClostridialesFamilyX..  Clostridium          NA
...                  ...            ...                 ...             ...                    ...          ...         ...
Clostridium     Bacteria     Firmicutes          Clostridia   Clostridiales        Lachnospiraceae  Clostridium          NA
Bacteroides     Bacteria     Firmicutes          Clostridia   Clostridiales        Lachnospiraceae  Bacteroides          NA
Ruminococcus    Bacteria     Firmicutes          Clostridia   Clostridiales        Ruminococcaceae Ruminococcus          NA
Eubacterium     Bacteria     Firmicutes          Clostridia   Clostridiales        Ruminococcaceae  Eubacterium          NA
Bacillus        Bacteria     Firmicutes             Bacilli      Bacillales         Planococcaceae     Bacillus          NA

TuomasBorman commented 3 years ago

I think phyloseq and tree do not allow to use duplicated rownames / tip labels. This could work, but what are your thoughts?

I think this should be added to both sides (general and TSE) of the function

if( any(duplicated(rownames(x))) ){
  rownames(x) <- make.unique(rownames(x))
}

FelixErnst commented 3 years ago

How about using addTaxonomyLabel, if duplicated rownames exist?

The function is quite powerful, but mostly used internally. It is basically a please-make-my-labels-work kind of function. Maybe give it a try?

TuomasBorman commented 3 years ago

Sure, but I couldn't find this addTaxonomyLabel, which package provides it?

FelixErnst commented 3 years ago

mia, but it is getTaxonomyLabel, my mistake.

https://github.com/microbiome/mia/blob/3c46cc4089b78ce9a587150ab38683eab0008a3a/R/taxonomy.R#L246

TuomasBorman commented 3 years ago

ok, thanks, I will check that

microbiome / mia

tse to pseq works only when tse is not agglomerated #108