microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
45 stars 25 forks source link

Improve Tengeler2020 #545

Closed RiboRings closed 1 month ago

RiboRings commented 1 month ago

Hi! Tengeler2020 now has more informative feature names and unique labels for tree tips and nodes. Related to https://github.com/microbiome/miaViz/issues/123.

library(mia)
data("Tengeler2020", package = "mia")
tse <- Tengeler2020

head(rownames(tse))
# [1] "Bacteroides"     "Bacteroides_1"   "Parabacteroides"
# [4] "Bacteroides_2"   "Akkermansia"     "Bacteroides_3"

head(rowTree(tse)$tip.label)
# [1] "[Eubacterium]_coprostanoligenes_group"
# [2] "[Clostridium]_innocuum_group"         
# [3] "[Clostridium]_innocuum_group_1"       
# [4] "Dielma"                               
# [5] "Unidentified_Lachnospiraceae_12"      
# [6] "Epulopiscium"

head(rowTree(tse)$node.label)
# [1] "node_1" "node_2" "node_3" "node_4" "node_5" "node_6"
antagomir commented 1 month ago

Just checking: would it make sense to have identical rownames and tip.labels? If there is a one-to-one match (?)

RiboRings commented 1 month ago

There is a one-to-one match, but the rownames and the tree tips are in a different order. Not sure why. In GlobalPatterns they have the same order, in Tengeler2020 they don't (see below).

data(GlobalPatterns)
head(rownames(GlobalPatterns))
[1] "549322" "522457" "951"    "244423" "586076" "246140"
head(rowTree(GlobalPatterns)$tip.label)
[1] "549322" "522457" "951"    "244423" "586076" "246140"

data(Tengeler2020)
head(rownames(Tengeler2020))
[1] "1726470"  "1726471"  "17264731" "17264726" "1726472"  "17264724"
head(rowTree(Tengeler2020)$tip.label)
[1] "172647198" "1726478"   "1726479"   "172647201" "172647222" "17264798"

But the situation with different orders seems right when looking at the tree, whereas the same order gives an incorrect tree:

library(miaViz)
plotRowTree(Tengeler2020, edge_colour_by = "Phylum")

Screenshot 2024-05-17 at 10 05 44

rowTree(Tengeler2020)$tip.label <- rownames(Tengeler2020)
plotRowTree(Tengeler2020, edge_colour_by = "Phylum")

Screenshot 2024-05-17 at 10 06 28

antagomir commented 1 month ago

Probably no one paid attention to this when the dataset was constructed.

It might be good for educational purposes to fix the data set and give them the same order?

Also check if TreeSummarizedExperiment constructor throws a warning if these are not in the same order; I think it should. If it doesn't we could at least propose an issue in that package?

TuomasBorman commented 1 month ago

Tips can be in any order / any name. rowLinks links rows and rowTree tips. rowLinks allows more flexible matching; it is not problem or bug. There could be a feature that is not in tips. For example, if dataset is agglomerated to Phylum level, and certain taxon has only kingdom level information. Then this row is not linked to tips but to node.

antagomir commented 1 month ago

A main consideration is that the methods should be sufficientyl intuitive so that users will at least notice if they are trying to do something unexpected. Does the plotRowTree example indicate any need for changes?

TuomasBorman commented 1 month ago

This is related to this issue https://github.com/microbiome/miaViz/issues/123

I checked the problem, and the problem was that tree nodes/tips were not named unique way. I believe we have not taken into account the fact the links between rows and tree. We have just used rows and tree directly

antagomir commented 1 month ago

We should take some measures to ensure that mistakes would not happen too easily with this.

To some extent it should be checked already when constructing TreeSE (ie in the TreeSummarizedExperiment package) but it might not be a bad idea to have some generic extra checks (hidden functions) in place in mia, to throw warning if there is something suspicious going on?

RiboRings commented 1 month ago

I'm not familiar specifically with the construction of phylo objects, but to me it doesn't seem that many things can go wrong when building a TreeSE. If something is out of place, the constructor usually raises an error, or some function later on (like in this case).

Also maybe it's not that bad to have a TreeSE that is working but is not perfect, and realise the problem only when a function (like plotRowTree) specifically needs the component that has a problem. So I would mainly focus on nice messages when errors or warnings arise.

antagomir commented 1 month ago

Yes I agree that in some sense it is nice to have problematic demo data sets, this helps us to constantly test the robustness of the methods and workflows.

Pedagogically it is sometimes useful, sometimes harmful. It depends also on the level of teaching and learners.

RiboRings commented 1 month ago

Is this PR then safe to merge or should we consider changes?

antagomir commented 1 month ago

For me OK to merge (if @TuomasBorman agrees) but could we check the mia & OMA examples that use the Tengeler tree data (if any?) for possible issues? In addition, it would be good to have a clear example demonstrating this somewhat unexpected behavior; for instance in the data set documentation?

TuomasBorman commented 1 month ago

Yes, I think it is ok to merge. I also fixed the issue about plotRowTree https://github.com/microbiome/miaViz/pull/124

TuomasBorman commented 1 month ago

We have not been able to fix rworkflows yet, but I think it is ok to bypass check since we are only modifying the dataset