Closed thpralas closed 3 weeks ago
Great! It might be good to summarize also in OMA the basis for selecting this particular implementation with just 1-3 sentences, when you update relevant part there.
Without nodeLab
parameter, calculateUnifrac
could not handle TreeSummarizedExperiments with tip.label
values differing from their rownames. Even though the function worked well on GlobalPatterns
and esophagus
TSEs, it did not work for aggregated TSEs for instance.
I selected the lines from the runUnifrac
function that modify the community matrix and the tree with nodeLab parameter and moved them to calculateUnifrac
. Now calculateUnifrac
uses nodeLab
parameter and rbiom::unifrac
instead of using runUnifrac.
Looks nice! However, I think it would be simpler to modify runUnifrac function --> keep the beginning of the function and replace the end with rbiom::unifrac
Is that possible?
One thing to note. Our (fast) Unifrac implementation includes two versions of weighted unifrac; the other one is normalized and the other not. In this context, normalization refers to adjusting weight difference between sample A and B in edge x with total weights in sample A and B in edge x. https://github.com/microbiome/mia/blob/b627edce620807af20d3ed85c6667f4b8ef8f2ea/R/calculateUnifrac.R#L446
rbiom implementation calculates non-normalized weighted unifrac https://github.com/cmmr/rbiom/blob/64f1f12839b32406ff99e0b36bbb271913259295/alt/rcpp_unifrac.cpp#L154
This means that we cannot just replace our implementation with rbiom, since they are not exactly same. One option is to add hidden fast = FALSE parameter that controls whether to call rbiom::unifrac or our own implementation.
@antagomir
That's a good point.
Would it be good to check what options are available for the other one, too, and choose the optimal one if there are relevant differences?
It should be also clear for user if they use the weighted or unweighted version. Ideally, both will be available.
Weighted and normalized are different things. We have now
rbiom::unifrac is written with c++ which makes it fast to compute
We have this https://www.nature.com/articles/ismej200997 And rbiom has this https://journals.asm.org/doi/10.1128/aem.71.12.8228-8235.2005
Ok.. should we then provide all available options by using arguments weighted
(TRUE/FALSE) and normalized
(TRUE/FALSE)? Then the wrapper will internally choose the right method for the given combination (and throw warning for unweighted, unnormalized unless that exists).
What do the published benchmarks suggest about a good default? Ideally the default would be fast as well, if the performance is not too much compromised.
At least this PR https://github.com/joey711/phyloseq/issues/956 that you linked before shows that rbiom gives similar results than others (only phyloseq differs). phyloseq uses fast unifrac. I assume that Felix used phyloseq as a template for our implementation.
It should be checked that rbiom::unifrac
gives similar results than our old fast unifrac implementation. (@thpralas can you check when time allows, not top priority).
I tend to think that fast = FALSE
might be good solution. If FALSE
, then rbiom::unifrac
is called. Otherwise the function calls our fast unifrac implementation that could have normalized = TRUE
as hidden parameter. That would be simple solution, I think
In that link phyloseq seems to yield systematically different methods than others. We can do more benchmarks but I am not sure if that is needed at this point. If those are done and this is still being considered, then the unit tests should cover many cases (aggreated ranks, trees with even & odd number of nodes and perhaps other nonconventional features).
Not sure if I follow the reasoning for having the argument fast
instead or weighted
& normalized
as the latter ones are more direct descriptions of the type of algorithm, and if rbiom is also rather fast and includes the normalized=TRUE as an option.
If the implementation of fast unifrac derived from phyloseq is not reliable based on earlier benchmarks, then what would be the reasoning for including it with the option fast=TRUE
?
1.
The reasoning for having fast
instead of weighted
and & normalized
is that they are different algorithms. Also if we decide that rbiom
is used if wieghted=TRUE & normalized=FALSE
is specified, then user cannot choose to use fast algorithm anymore.
2.
If we decide to switch torbiom
, it means that we drop off some of the functionality without deprecation. I am not sure how widely the wieghted=TRUE & normalized=TRUE
option is used.
The only thing is that our implementation gives possibility to choose whether to normalize or not. rbiom does not have normalized
argument. It always calculates unnormalized weighted unifrac or normalized unweighted unifrac.
No, but then we need option to choose between these algorithms. They are not working exactly the same, and they give different results apparently. So there should be option to choose between these algorithms.
User might want to calculate unnormalized weighted unifrac with our implementation rather than rbiom.
We decided to completely switch using rbiom::unifrac, and remove our own implementation. Our own implementation is based on phyloseq implementation of Fast Unifrac, and phyloseq's implementation has been showed to give different results than other implementations (see the discussion above).
This means that normalized
option is not supported anymore (it is removed without deprecation). I think the code is finished, but tests are failing because they are testing that unifrac values equal to values that are calculated based on phyloseq's implementation. Moreover, tests include tests for normalized
option which is now removed.
@thpralas Could you be able to fix the unit tests. Test that calculations match with manually calculated values (use rbiom::unifrac
manually).
Tests fail because weighted rbiom::unifrac gives different result every time (<1e-5). Check why and add tolerance to tests
It seems that rownames does not atch with tip labels. They should match and can be matched. Check if that is the case by debugging the lines just before rbiom::unifrac call
Also, check what is wrong with the other error (that is not even related to this PR?)
Alright, now the tests should be ok. I believe the stochastic nature comes from multithreading and complex tree (loops); distances between tips can be calculated multiple ways which slightly differs. After pruning, the results are the same every time.
This PR is related to issue https://github.com/microbiome/mia/issues/336. It replaces the function
runUnifrac
with the unifrac implementation from rbiom.The rbiom implementation does not require to aggregate the community matrix based on
nodeLab
, thus thecalculateUnifrac
method for TreeSummarizedExperiment becomes simpler.I have deleted the arguments
transposed
,BPPARAM,
andnormalized
for now because they were used byrunUnifrac
and they are not handled byrbiom::unifrac
. It may be possible to handletransposed
while using the rbiom implementation though.When running the following code on branch devel and this branch, I obtain the same unifrac values.