Fix tree merging - Githubissues

TuomasBorman commented 2 weeks ago

1.

mergeSEs did not merge trees, it just added the trees to rowTree slot. If the datasets included multiple trees that did not share taxa, the output included multiple trees. That is not what user expects when the dataset is merged. Now the trees are merged into single tree.

2.

splitOn added hiereachy tree instead of merging the trees when update.tree = TRUE. That is also not what user expects. Now, the trees are combined (or if all taxa is shared between trees, single unchanged tree is outputted).

3.

I also fixed a bug in unsplitOn() function. If the objects being merged did not have names, this lead to an error in rowData merging step.

TuomasBorman commented 2 weeks ago

https://github.com/microbiome/mia/issues/569

TuomasBorman commented 2 weeks ago

Unit tests should be updated since they expect that multiple trees are returned

TuomasBorman commented 2 weeks ago

@Daenarys8 Can you check unit tests?

Daenarys8 commented 2 weeks ago

The following changes have been made. 1- The test expect_warning(estimateDiversity(tse, index = c("shannon", "faith"))) is removed as it is no longer valid. Before, one of the rowTree did not include all the rows which resulted in a warning, but now the tree is combined, so we remove the test.

2- In merge tests, as a result of combined tree, expect_equal(rowTree(merged), rowTree(merged3)) and expect_true( rowTree(merged, "phylo.1")$Nnode < rowTree(merged2, "phylo.1")$Nnode ) are invalid as one large tree is returned compared to multiple trees added in object as before.

3- In mergeSE tests, left and right join won't merge with different rowLinks. Could be a bug but I'm not sure. Also, failing test of nature expect_equal( rowTree(tse), rowTree(tse1)) become invalid as returned tree used for comparison is one large combined tree compared to multiple added trees before change. We also get warning when multiple nodes are found to have the same label.

TuomasBorman commented 2 weeks ago

The following changes have been made. 1- The test expect_warning(estimateDiversity(tse, index = c("shannon", "faith"))) is removed as it is no longer valid. Before, one of the rowTree did not include all the rows which resulted in a warning, but now the tree is combined, so we remove the test.

2- In merge tests, as a result of combined tree, expect_equal(rowTree(merged), rowTree(merged3)) and expect_true( rowTree(merged, "phylo.1")$Nnode < rowTree(merged2, "phylo.1")$Nnode ) are invalid as one large tree is returned compared to multiple trees added in object as before.

3- In mergeSE tests, left and right join won't merge with different rowLinks. Could be a bug but I'm not sure. Also, failing test of nature expect_equal( rowTree(tse), rowTree(tse1)) become invalid as returned tree used for comparison is one large combined tree compared to multiple added trees before change. We also get warning when multiple nodes are found to have the same label.

3 is a bug. Can you investigate the reason? I can also check later

TuomasBorman commented 2 weeks ago

I fixed the bug. I forgot to order the link data to match TreeSE.
Multiple rows can be linked to single node in tree, so that warning is not a problem.
Can you create unit tests that check that the tree merging is done correctly. The tree includes the nodes of input trees (those that correspond rows). Also you could check that the tree is pruned so that each tip should be found from rows.

TuomasBorman commented 1 week ago

Btw, as there are those warnings ("multiple nodes..."), can you add expect_warning() to those?

Daenarys8 commented 1 week ago

After merging, rows do not match with tips.

Daenarys8 commented 1 week ago

I think I have a problemin calculateUnifrac It's test won't run and tend to crash my rstudio. This goes same for rbiom::unifrac() .
Apparently it is bug in rstudio.

Daenarys8 commented 1 week ago

The test to calculate unweighted unifrac is failing because:

After changes, when the combined tree is parsed, the tree root is randomly assigned to any rowname which gives mismatches each time the distance is calculated.

TuomasBorman commented 1 week ago

Good catch, I will be back to work on Tuesday, let's check that then

Daenarys8 commented 1 week ago

Concerning this random allocation of root, what is the implication if we specify the rowname? it will fix the error but I am not sure of the bugs it will create.

TuomasBorman commented 1 week ago

Concerning this random allocation of root, what is the implication if we specify the rowname? it will fix the error but I am not sure of the bugs it will create.

The problem is not directly related to tree merging. When trees are merged, the root is added if the trees being merged have roots. I have to still check this tree binding

TuomasBorman commented 1 week ago

I found a problem with ape::bind_tree. It does not merge the trees, it just binds them. The following illustrates the problem

data("GlobalPatterns")
tse <- GlobalPatterns[1:10, ]
tse2 <- GlobalPatterns[11:20, ]

# Add hiearchy trees
tree1 <- getHierarchyTree(tse)
tree2 <- getHierarchyTree(tse2)

# Bind
tree <- ape::bind.tree(tree1, tree2)

plot(tree, show.node.label = TRUE)

As you can see, certain phylum is there twice.

The latest commit takes this into account. After binding, the tree is merged based on labels. Nodes with same names are interpreted as same, and they are combined.

TuomasBorman commented 1 week ago

One additional thing that I found was that .prune_tree prunes singletons, i.e. nodes that have only one descendant node.

Below is same tree without singletons

tree |> ape::collapse.singles() |> plot(show.node.labels = TRUE)

Removing singletons might be harmful sometimes. If a node that links to certain row is dropped, that removes also the row, when the tree is put back to TreeSE. However, if singletons are not dropped while pruning, there might be lots of extra nodes (that should be pruned?).

I will think about this but do you @antagomir have idea for this? Should singletons be dropped when tree is pruned?

TuomasBorman commented 1 week ago

Concerning this random allocation of root, what is the implication if we specify the rowname? it will fix the error but I am not sure of the bugs it will create.

Related to this unit test. Can you modify the test so that you just check that the tree.name parameter (if that was the parameter name) works. You can rename the rowTree in TreeSE and the test. No need to merge TreeSEs

Daenarys8 commented 1 week ago

I meant when is.rooted(tse) is FALSE, calculateUnifrac randomly assigns a root and calculates the distance matrix. I am not sure if rbiom::unifrac does something similar but it's not logged in messages.

Contrarily, when we root the tree before calculating the matrices, the results is the same. Should I check in detail rbiom::unifrac

TuomasBorman commented 1 week ago

I meant when is.rooted(tse) is FALSE, calculateUnifrac randomly assigns a root and calculates the distance matrix. I am not sure if rbiom::unifrac does something similar but it's not logged in messages.

Contrarily, when we root the tree before calculating the matrices, the results is the same. Should I check in detail rbiom::unifrac

My comment was unclear, and I mixed things. No need to check rbiom::unifrac(). The unifrac tests tested initially that tree.name parameter worked. My point was to modify the test so that it still tests tree.name parameter. (It can be tested without merging datasets)

About the rooting, does set.seed() work to reproduce the behavior? With that, you could check that rooting is done correctly. (Also test that warning message is thrown)

Daenarys8 commented 6 days ago

Yeah in a way, set.seed() in this case makes sure we choose the same root each time we calculate the distance. The warning message is also thrown. However, the matrices can still be potentially different from that of rbiom::unifrac and I think it is because when we CalculateUnifrac of an objects' tree with no root, we change the root (e.g the new root could be the rowname starting from the point by set.seed()) rbiom::unifrac doesn't raise warnings if the tree is not rooted. Now the only concern is; is there a benchmark to very that the calculations are done correctly? (or specifically which calculation should be the correct one). when tree is parsed in functionality unrooted, we have the distance is same for both methods. So another question is; why do we need to root(or change) the tree during calculation, ?(this is a followup assuming that we are using rbiom::unifrac(doesn't change tree) as benchmark to very our calculations.)

TuomasBorman commented 6 days ago

Alright, thanks! The previous implementation of unifrac required rooted tree. We did not check if the case is also for rbiom::unifrac (apparently not). However, this is not related to this PR.

Can you create new PR that removes rooting in unifrac (should be simple)? You can probably keep the rooting function, if that is needed at some point. Link this PR to that since the discussion were made here.

antagomir commented 6 days ago

Btw is there any potential for problems if the tree is not rooted? i.e. should a warning be thrown

TuomasBorman commented 6 days ago

Btw is there any potential for problems if the tree is not rooted? i.e. should a warning be thrown

Should be investigated. User should be able to run rbiom::unifrac even though the tree is unrooted. I am not sure how the function handles unrooted trees. If the function roots the tree and if the process includes randomness, then I think it would be good to give warning. Otherwise I think warning is not required. Can you check @Daenarys8

Daenarys8 commented 6 days ago

rbiom does not root the tree. There's no warning message when calculating the distance with unrooted trees either.

Daenarys8 commented 6 days ago

e.g load some datasets

data(esophagus, package="mia")
data(GlobalPatterns, package="mia")

before changes on this branch, both methods return same distance in checks because;

> tse <- mergeSEs(GlobalPatterns, esophagus, assay.type="counts", missing.values = 0)
Merging with full join...
2/2
Adding rowTree(s)...
> is.rooted(rowTree(tse))
[1] TRUE

with new changes, we can have unrooted trees.

> tse <- mergeSEs(GlobalPatterns, esophagus, assay.type="counts", missing.values = 0)
Merging with full join...
2/2
Merging rowTree...
> is.rooted(rowTree(tse))
[1] FALSE
> unifrac_mia <- as.matrix(calculateUnifrac(tse, weighted = TRUE))
Warning message:
Randomly assigning root as -- 193852 -- in the phylogenetic tree in the data you provided. 

> unifrac_rbiom <- as.matrix(rbiom::unifrac(assay(tse), weighted = TRUE,
+                                           rowTree(tse)))

from above, rbiom takes unrooted tree and it does not change during calculation while mia roots tree if it is unrooted.

> expect_equal(unifrac_mia, unifrac_rbiom)
Error: `unifrac_mia` not equal to `unifrac_rbiom`.
110/841 mismatches (average diff: 0.000218)
[6]   0.567 - 0.566 == 4.63e-04
[7]   0.580 - 0.580 == 2.64e-06
[35]  0.605 - 0.604 == 4.63e-04
[36]  0.616 - 0.616 == 2.64e-06
[64]  0.552 - 0.551 == 4.64e-04
[65]  0.565 - 0.565 == 2.64e-06
[93]  0.707 - 0.706 == 4.64e-04
[94]  0.713 - 0.713 == 2.64e-06
[122] 0.718 - 0.717 == 4.64e-04
...

remove rooting of unrooted tree in calculatedUnifrac and the result is the same for both methods

> tse <- mergeSEs(GlobalPatterns, esophagus, assay.type="counts", missing.values = 0)
Merging with full join...
2/2
Merging rowTree...
> is.rooted(rowTree(tse))
[1] FALSE
> unifrac_mia <- as.matrix(calculateUnifrac(tse, weighted = TRUE))
> unifrac_rbiom <- as.matrix(rbiom::unifrac(assay(tse), weighted = TRUE,
+                                           rowTree(tse)))
> expect_equal(unifrac_mia, unifrac_rbiom)
>

TuomasBorman commented 6 days ago

In that case, I think the function should not give warnings and removing rooting is enough

codecov[bot] commented 6 days ago

Codecov Report

Attention: Patch coverage is 75.36232% with 17 lines in your changes missing coverage. Please review.

Please upload report for BASE (devel@3b1e907). Learn more about missing BASE report.

:exclamation: Current head 7ab34f0 differs from pull request most recent head c681778

Please upload reports for the commit c681778 to get more accurate results.

Files	Patch %	Lines
R/splitOn.R	42.85%	16 Missing :warning:
R/agglomerate.R	90.00%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## devel #588 +/- ## ======================================== Coverage ? 67.22% ======================================== Files ? 41 Lines ? 4973 Branches ? 0 ======================================== Hits ? 3343 Misses ? 1630 Partials ? 0 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

TuomasBorman commented 6 days ago

@Daenarys8 Can you check that these changes work https://github.com/microbiome/mia/pull/588/commits/3db5b1ddaefd252b2057c75c924dfa4b021c1df0? collapse.singles is passed through and splitOn and unsplitOn works when trees are merged?

I made some tests but would be good to have more.

Just manually test that everything works as expected

Daenarys8 commented 3 days ago

Yes, collapse.singles is passed. splitOn and unsplitOn both return list and tse respectively on a merged tse.

TuomasBorman commented 2 days ago

Yes, collapse.singles is passed. splitOn and unsplitOn both return list and tse respectively on a merged tse.

Thanks!

TuomasBorman commented 2 days ago

Ready after passing tests

microbiome / mia

Fix tree merging #588

Codecov Report