microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
45 stars 25 forks source link

Use Rbiom unifrac implementation instead of runUnifrac #523

Closed thpralas closed 3 weeks ago

thpralas commented 2 months ago

This PR is related to issue https://github.com/microbiome/mia/issues/336. It replaces the function runUnifrac with the unifrac implementation from rbiom.

The rbiom implementation does not require to aggregate the community matrix based on nodeLab, thus the calculateUnifrac method for TreeSummarizedExperiment becomes simpler.

I have deleted the arguments transposed, BPPARAM, and normalized for now because they were used by runUnifrac and they are not handled by rbiom::unifrac. It may be possible to handle transposed while using the rbiom implementation though.

When running the following code on branch devel and this branch, I obtain the same unifrac values.

data("esophagus")
data("GlobalPatterns")
calculateUnifrac(esophagus)
calculateUnifrac(GlobalPatterns)
antagomir commented 2 months ago

Great! It might be good to summarize also in OMA the basis for selecting this particular implementation with just 1-3 sentences, when you update relevant part there.

thpralas commented 2 months ago

Without nodeLab parameter, calculateUnifrac could not handle TreeSummarizedExperiments with tip.label values differing from their rownames. Even though the function worked well on GlobalPatterns and esophagus TSEs, it did not work for aggregated TSEs for instance.

I selected the lines from the runUnifrac function that modify the community matrix and the tree with nodeLab parameter and moved them to calculateUnifrac. Now calculateUnifrac uses nodeLab parameter and rbiom::unifrac instead of using runUnifrac.

TuomasBorman commented 2 months ago

Looks nice! However, I think it would be simpler to modify runUnifrac function --> keep the beginning of the function and replace the end with rbiom::unifrac

Is that possible?

TuomasBorman commented 1 month ago

One thing to note. Our (fast) Unifrac implementation includes two versions of weighted unifrac; the other one is normalized and the other not. In this context, normalization refers to adjusting weight difference between sample A and B in edge x with total weights in sample A and B in edge x. https://github.com/microbiome/mia/blob/b627edce620807af20d3ed85c6667f4b8ef8f2ea/R/calculateUnifrac.R#L446

rbiom implementation calculates non-normalized weighted unifrac https://github.com/cmmr/rbiom/blob/64f1f12839b32406ff99e0b36bbb271913259295/alt/rcpp_unifrac.cpp#L154

This means that we cannot just replace our implementation with rbiom, since they are not exactly same. One option is to add hidden fast = FALSE parameter that controls whether to call rbiom::unifrac or our own implementation.

TuomasBorman commented 1 month ago

@antagomir

antagomir commented 1 month ago

That's a good point.

Would it be good to check what options are available for the other one, too, and choose the optimal one if there are relevant differences?

antagomir commented 1 month ago

It should be also clear for user if they use the weighted or unweighted version. Ideally, both will be available.

TuomasBorman commented 1 month ago

Weighted and normalized are different things. We have now

rbiom::unifrac is written with c++ which makes it fast to compute

TuomasBorman commented 1 month ago

We have this https://www.nature.com/articles/ismej200997 And rbiom has this https://journals.asm.org/doi/10.1128/aem.71.12.8228-8235.2005

antagomir commented 1 month ago

Ok.. should we then provide all available options by using arguments weighted (TRUE/FALSE) and normalized (TRUE/FALSE)? Then the wrapper will internally choose the right method for the given combination (and throw warning for unweighted, unnormalized unless that exists).

What do the published benchmarks suggest about a good default? Ideally the default would be fast as well, if the performance is not too much compromised.

TuomasBorman commented 1 month ago

At least this PR https://github.com/joey711/phyloseq/issues/956 that you linked before shows that rbiom gives similar results than others (only phyloseq differs). phyloseq uses fast unifrac. I assume that Felix used phyloseq as a template for our implementation.

It should be checked that rbiom::unifrac gives similar results than our old fast unifrac implementation. (@thpralas can you check when time allows, not top priority).

I tend to think that fast = FALSE might be good solution. If FALSE, then rbiom::unifrac is called. Otherwise the function calls our fast unifrac implementation that could have normalized = TRUE as hidden parameter. That would be simple solution, I think

antagomir commented 1 month ago

In that link phyloseq seems to yield systematically different methods than others. We can do more benchmarks but I am not sure if that is needed at this point. If those are done and this is still being considered, then the unit tests should cover many cases (aggreated ranks, trees with even & odd number of nodes and perhaps other nonconventional features).

Not sure if I follow the reasoning for having the argument fast instead or weighted & normalized as the latter ones are more direct descriptions of the type of algorithm, and if rbiom is also rather fast and includes the normalized=TRUE as an option.

If the implementation of fast unifrac derived from phyloseq is not reliable based on earlier benchmarks, then what would be the reasoning for including it with the option fast=TRUE?

TuomasBorman commented 1 month ago

1.

The reasoning for having fast instead of weighted and & normalized is that they are different algorithms. Also if we decide that rbiom is used if wieghted=TRUE & normalized=FALSE is specified, then user cannot choose to use fast algorithm anymore.

2.

If we decide to switch torbiom, it means that we drop off some of the functionality without deprecation. I am not sure how widely the wieghted=TRUE & normalized=TRUE option is used.

antagomir commented 1 month ago
  1. if the fast algorithm yields the same results than slow algorithm then the only thing with practical importance is the choices on normalization & weighting? Why should we support the slow option in the first place?
  2. it would be good to have also the other options available. Is it because of too many dependencies that they should be dropped immediately then?
TuomasBorman commented 1 month ago
  1. The only thing is that our implementation gives possibility to choose whether to normalize or not. rbiom does not have normalized argument. It always calculates unnormalized weighted unifrac or normalized unweighted unifrac.

  2. No, but then we need option to choose between these algorithms. They are not working exactly the same, and they give different results apparently. So there should be option to choose between these algorithms.

User might want to calculate unnormalized weighted unifrac with our implementation rather than rbiom.

TuomasBorman commented 1 month ago

We decided to completely switch using rbiom::unifrac, and remove our own implementation. Our own implementation is based on phyloseq implementation of Fast Unifrac, and phyloseq's implementation has been showed to give different results than other implementations (see the discussion above).

This means that normalized option is not supported anymore (it is removed without deprecation). I think the code is finished, but tests are failing because they are testing that unifrac values equal to values that are calculated based on phyloseq's implementation. Moreover, tests include tests for normalized option which is now removed.

@thpralas Could you be able to fix the unit tests. Test that calculations match with manually calculated values (use rbiom::unifrac manually).

TuomasBorman commented 1 month ago

Tests fail because weighted rbiom::unifrac gives different result every time (<1e-5). Check why and add tolerance to tests

TuomasBorman commented 1 month ago

It seems that rownames does not atch with tip labels. They should match and can be matched. Check if that is the case by debugging the lines just before rbiom::unifrac call

TuomasBorman commented 1 month ago

Also, check what is wrong with the other error (that is not even related to this PR?)

TuomasBorman commented 3 weeks ago

Alright, now the tests should be ok. I believe the stochastic nature comes from multithreading and complex tree (loops); distances between tips can be calculated multiple ways which slightly differs. After pruning, the results are the same every time.