Closed TuomasBorman closed 2 days ago
Unit tests should be updated since they expect that multiple trees are returned
@Daenarys8 Can you check unit tests?
The following changes have been made.
1- The test expect_warning(estimateDiversity(tse, index = c("shannon", "faith")))
is removed as it is no longer valid. Before, one of the rowTree did not include all the rows which resulted in a warning, but now the tree is combined, so we remove the test.
2- In merge tests, as a result of combined tree, expect_equal(rowTree(merged), rowTree(merged3))
and expect_true( rowTree(merged, "phylo.1")$Nnode < rowTree(merged2, "phylo.1")$Nnode )
are invalid as one large tree is returned compared to multiple trees added in object as before.
3- In mergeSE tests, left
and right
join won't merge with different rowLinks. Could be a bug but I'm not sure. Also, failing test of nature expect_equal( rowTree(tse), rowTree(tse1))
become invalid as returned tree used for comparison is one large combined tree compared to multiple added trees before change.
We also get warning when multiple nodes are found to have the same label.
The following changes have been made. 1- The test
expect_warning(estimateDiversity(tse, index = c("shannon", "faith")))
is removed as it is no longer valid. Before, one of the rowTree did not include all the rows which resulted in a warning, but now the tree is combined, so we remove the test.2- In merge tests, as a result of combined tree,
expect_equal(rowTree(merged), rowTree(merged3))
andexpect_true( rowTree(merged, "phylo.1")$Nnode < rowTree(merged2, "phylo.1")$Nnode )
are invalid as one large tree is returned compared to multiple trees added in object as before.3- In mergeSE tests,
left
andright
join won't merge with different rowLinks. Could be a bug but I'm not sure. Also, failing test of natureexpect_equal( rowTree(tse), rowTree(tse1))
become invalid as returned tree used for comparison is one large combined tree compared to multiple added trees before change. We also get warning when multiple nodes are found to have the same label.
3 is a bug. Can you investigate the reason? I can also check later
I fixed the bug. I forgot to order the link data to match TreeSE.
Multiple rows can be linked to single node in tree, so that warning is not a problem.
Can you create unit tests that check that the tree merging is done correctly. The tree includes the nodes of input trees (those that correspond rows). Also you could check that the tree is pruned so that each tip should be found from rows.
Btw, as there are those warnings ("multiple nodes..."), can you add expect_warning() to those?
After merging, rows do not match with tips.
I think I have a problemin calculateUnifrac
It's test won't run and tend to crash my rstudio. This goes same for rbiom::unifrac()
.
Apparently it is bug in rstudio.
The test to calculate unweighted unifrac is failing because:
Good catch, I will be back to work on Tuesday, let's check that then
Concerning this random allocation of root, what is the implication if we specify the rowname? it will fix the error but I am not sure of the bugs it will create.
Concerning this random allocation of root, what is the implication if we specify the rowname? it will fix the error but I am not sure of the bugs it will create.
The problem is not directly related to tree merging. When trees are merged, the root is added if the trees being merged have roots. I have to still check this tree binding
I found a problem with ape::bind_tree. It does not merge the trees, it just binds them. The following illustrates the problem
data("GlobalPatterns")
tse <- GlobalPatterns[1:10, ]
tse2 <- GlobalPatterns[11:20, ]
# Add hiearchy trees
tree1 <- getHierarchyTree(tse)
tree2 <- getHierarchyTree(tse2)
# Bind
tree <- ape::bind.tree(tree1, tree2)
plot(tree, show.node.label = TRUE)
As you can see, certain phylum is there twice.
The latest commit takes this into account. After binding, the tree is merged based on labels. Nodes with same names are interpreted as same, and they are combined.
One additional thing that I found was that .prune_tree prunes singletons, i.e. nodes that have only one descendant node.
Below is same tree without singletons
tree |> ape::collapse.singles() |> plot(show.node.labels = TRUE)
Removing singletons might be harmful sometimes. If a node that links to certain row is dropped, that removes also the row, when the tree is put back to TreeSE. However, if singletons are not dropped while pruning, there might be lots of extra nodes (that should be pruned?).
I will think about this but do you @antagomir have idea for this? Should singletons be dropped when tree is pruned?
Concerning this random allocation of root, what is the implication if we specify the rowname? it will fix the error but I am not sure of the bugs it will create.
Related to this unit test. Can you modify the test so that you just check that the tree.name parameter (if that was the parameter name) works. You can rename the rowTree in TreeSE and the test. No need to merge TreeSEs
I meant when is.rooted(tse) is FALSE, calculateUnifrac
randomly assigns a root and calculates the distance matrix. I am not sure if rbiom::unifrac
does something similar but it's not logged in messages.
Contrarily, when we root the tree before calculating the matrices, the results is the same.
Should I check in detail rbiom::unifrac
I meant when is.rooted(tse) is FALSE,
calculateUnifrac
randomly assigns a root and calculates the distance matrix. I am not sure ifrbiom::unifrac
does something similar but it's not logged in messages.Contrarily, when we root the tree before calculating the matrices, the results is the same. Should I check in detail
rbiom::unifrac
My comment was unclear, and I mixed things. No need to check rbiom::unifrac(). The unifrac tests tested initially that tree.name parameter worked. My point was to modify the test so that it still tests tree.name parameter. (It can be tested without merging datasets)
About the rooting, does set.seed() work to reproduce the behavior? With that, you could check that rooting is done correctly. (Also test that warning message is thrown)
Yeah in a way, set.seed()
in this case makes sure we choose the same root each time we calculate the distance. The warning message is also thrown.
However, the matrices can still be potentially different from that of rbiom::unifrac
and I think it is because when we CalculateUnifrac
of an objects' tree with no root, we change the root (e.g the new root could be the rowname
starting from the point by set.seed()
)
rbiom::unifrac doesn't raise warnings if the tree is not rooted. Now the only concern is; is there a benchmark to very that the calculations are done correctly? (or specifically which calculation should be the correct one).
when tree is parsed in functionality unrooted, we have the distance is same for both methods. So another question is; why do we need to root(or change) the tree during calculation, ?(this is a followup assuming that we are using rbiom::unifrac
(doesn't change tree) as benchmark to very our calculations.)
Alright, thanks! The previous implementation of unifrac required rooted tree. We did not check if the case is also for rbiom::unifrac (apparently not). However, this is not related to this PR.
Can you create new PR that removes rooting in unifrac (should be simple)? You can probably keep the rooting function, if that is needed at some point. Link this PR to that since the discussion were made here.
Btw is there any potential for problems if the tree is not rooted? i.e. should a warning be thrown
Btw is there any potential for problems if the tree is not rooted? i.e. should a warning be thrown
Should be investigated. User should be able to run rbiom::unifrac even though the tree is unrooted. I am not sure how the function handles unrooted trees. If the function roots the tree and if the process includes randomness, then I think it would be good to give warning. Otherwise I think warning is not required. Can you check @Daenarys8
rbiom does not root the tree. There's no warning message when calculating the distance with unrooted trees either.
e.g load some datasets
data(esophagus, package="mia")
data(GlobalPatterns, package="mia")
before changes on this branch, both methods return same distance in checks because;
> tse <- mergeSEs(GlobalPatterns, esophagus, assay.type="counts", missing.values = 0)
Merging with full join...
2/2
Adding rowTree(s)...
> is.rooted(rowTree(tse))
[1] TRUE
with new changes, we can have unrooted trees.
> tse <- mergeSEs(GlobalPatterns, esophagus, assay.type="counts", missing.values = 0)
Merging with full join...
2/2
Merging rowTree...
> is.rooted(rowTree(tse))
[1] FALSE
> unifrac_mia <- as.matrix(calculateUnifrac(tse, weighted = TRUE))
Warning message:
Randomly assigning root as -- 193852 -- in the phylogenetic tree in the data you provided.
> unifrac_rbiom <- as.matrix(rbiom::unifrac(assay(tse), weighted = TRUE,
+ rowTree(tse)))
from above, rbiom takes unrooted tree and it does not change during calculation while mia roots tree if it is unrooted.
> expect_equal(unifrac_mia, unifrac_rbiom)
Error: `unifrac_mia` not equal to `unifrac_rbiom`.
110/841 mismatches (average diff: 0.000218)
[6] 0.567 - 0.566 == 4.63e-04
[7] 0.580 - 0.580 == 2.64e-06
[35] 0.605 - 0.604 == 4.63e-04
[36] 0.616 - 0.616 == 2.64e-06
[64] 0.552 - 0.551 == 4.64e-04
[65] 0.565 - 0.565 == 2.64e-06
[93] 0.707 - 0.706 == 4.64e-04
[94] 0.713 - 0.713 == 2.64e-06
[122] 0.718 - 0.717 == 4.64e-04
...
remove rooting of unrooted tree in calculatedUnifrac and the result is the same for both methods
> tse <- mergeSEs(GlobalPatterns, esophagus, assay.type="counts", missing.values = 0)
Merging with full join...
2/2
Merging rowTree...
> is.rooted(rowTree(tse))
[1] FALSE
> unifrac_mia <- as.matrix(calculateUnifrac(tse, weighted = TRUE))
> unifrac_rbiom <- as.matrix(rbiom::unifrac(assay(tse), weighted = TRUE,
+ rowTree(tse)))
> expect_equal(unifrac_mia, unifrac_rbiom)
>
In that case, I think the function should not give warnings and removing rooting is enough
Attention: Patch coverage is 75.36232%
with 17 lines
in your changes missing coverage. Please review.
Please upload report for BASE (
devel@3b1e907
). Learn more about missing BASE report.:exclamation: Current head 7ab34f0 differs from pull request most recent head c681778
Please upload reports for the commit c681778 to get more accurate results.
Files | Patch % | Lines |
---|---|---|
R/splitOn.R | 42.85% | 16 Missing :warning: |
R/agglomerate.R | 90.00% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@Daenarys8 Can you check that these changes work https://github.com/microbiome/mia/pull/588/commits/3db5b1ddaefd252b2057c75c924dfa4b021c1df0? collapse.singles is passed through and splitOn and unsplitOn works when trees are merged?
I made some tests but would be good to have more.
Just manually test that everything works as expected
Yes, collapse.singles is passed. splitOn and unsplitOn both return list and tse respectively on a merged tse.
Yes, collapse.singles is passed. splitOn and unsplitOn both return list and tse respectively on a merged tse.
Thanks!
Ready after passing tests
1.
mergeSEs did not merge trees, it just added the trees to rowTree slot. If the datasets included multiple trees that did not share taxa, the output included multiple trees. That is not what user expects when the dataset is merged. Now the trees are merged into single tree.
2.
splitOn added hiereachy tree instead of merging the trees when update.tree = TRUE. That is also not what user expects. Now, the trees are combined (or if all taxa is shared between trees, single unchanged tree is outputted).
3.
I also fixed a bug in unsplitOn() function. If the objects being merged did not have names, this lead to an error in rowData merging step.