ms609 / TreeTools

Create, modify and analyse phylogenetic trees in R
https://ms609.github.io/TreeTools/
18 stars 7 forks source link

Error: This many leaves cannot be supported #141

Open noranekonobokkusu opened 1 year ago

noranekonobokkusu commented 1 year ago

Hi, I am trying to measure distances between two trees, and getting this error message: > TreeDistance(t1, t2) Error: This many leaves cannot be supported. Please contact the TreeTools maintainer if you need to use more! I tried decreasing the number of tips to 4096 (as mentioned in some part of the TreeDist manual), but I still get this error. Is there a workaround for this, and how much tips are allowed by default? Somehow I cannot find it in the documentation. Thank you!

ms609 commented 1 year ago

Thanks for getting in touch. The present limit is 2048 leaves (which I'll document); I'm looking into a workaround but it's less straightforward that I'd hoped. I'll post an update once I get somewhere with this.

ms609 commented 1 year ago

To calculate distances between trees with <8192 tips, you can now:

  1. Uninstall TreeDist and TreeTools

    remove.packages("TreeDist") remove.packages("TreeTools")

    • Check the console output to be sure that the packages are fully uninstalled.
  2. Install a modified TreeTools

    devtools::install_github("ms609/TreeTools", ref = "more-leaves")

  3. Re-install TreeDist from source

    devtools::install_github("ms609/TreeDist", ref = "more-leaves") -- not install.packages("TreeDist"), which installs pre-compiled binaries that will not link to the customized TreeTools.

Note that distance computation scales with the square of the number of tips. In other words, comparing two 8000 leaf trees will take a couple of minutes.

I've updated the documentation with this information. Please let me know how you get on; I had a bit of trouble getting this running locally, but hopefully the above instructions will avoid these problems.

noranekonobokkusu commented 1 year ago

Hi Martin,

thanks a lot for such a rapid reply! It aborts my RStudio session the moment I run this command now 😅 But I guess that means I did re-installed it successfully and this will work on a computational cluster!

ms609 commented 1 year ago

Drat – this is the issue I was running into as well. My diagnosis was that the crash occurred when the modified TreeTools was reinstalled without uninstalling and re-installing TreeDist. Could you confirm that you uninstalled both packages before installing both from source, using install_github()? I'll also be interested to hear whether it runs successfully on a cluster!

noranekonobokkusu commented 1 year ago

I can confirm I did all that. When I try running it from command line, I am getting > TreeDistance(t_large, t_large) Error: segfault from C stack overflow Even for two trees with 10 leaves each!

On a cluster, it works with 8GB (which is less than on my laptop) for 8,000 leaves 🤔

ms609 commented 1 year ago

Weird – sorry it's not proving straightforward! I've reproduced this issue on a second PC. My suspicion is that this is related to the (un)installation of the packages. I'll investigate.

ms609 commented 1 year ago

Okay, I think I've got to the bottom of the issue – which is that the stack overflow error should be taken literally; there is not enough space in the stack to create two SplitList() objects of the required dimensions, used to compute the distances.

In summary, this means that a significant re-coding will be required for larger trees to be handled – and that the computation for larger trees will be significantly slower (as it will need to make more use of the heap, rather than fast stack memory). That's a bigger job than I am able to attempt right now. Sorry.


More details for my own future reference:

noranekonobokkusu commented 1 year ago

What I still don't understand is why it stopped working locally even for two tiny trees with 10 leaves each but actually worked on a cluster for a huge tree.

Thanks a lot for looking into this anyhow!

ms609 commented 1 year ago

A fixed amount of memory is allocated as soon as the underlying C++ function is called; because this is allocated on the stack, the amount of memory to allocate is pre-determined and is independent of the variables actually passed. So whatever size of tree is passed, the software requests enough stack memory to compare two 8192-leaf trees.

Differences between a local PC and a cluster will reflect how much memory is available on the stack, which will reflect aspects of memory management that are context-dependent: for instance, I see a crash when using RStudio, but not when running a standalone R session, presumably because Windows allocates memory differently in these contexts.

pterzian commented 11 months ago

Hi Martin,

I've been trying to compare two trees with around 5k leaves (both have the same number of leaves) but I couldn't pass the error : This many leaves cannot be supported. Please contact the TreeTools maintainer if you need to use more!. I first tried following the above process (uninstalling previous versions) as you suggest but it still gives me the error on my local computer. Then I build a fresh R conda env on a distant server with more resources but I still get the same error. Any idea what could cause this issue ?

Also, thanks a lot for your tools, they have been very useful so far! Paul.

ms609 commented 11 months ago

Glad you have been finding the tools useful, @pterzian. Not clear why you would be seeing the "This many leaves" with ~5000 leaves if you are using the BigTreeTools and BigTreeDist packages; maybe worth checking that you are using the functions from these modified packages (which have different names, so need loading with e.g. library("BigTreeDist")) rather than TreeDist?

pterzian commented 11 months ago

You are absolutely right! I saw the BigTreeTools package was installed. However I don't see any BigTreeDist package, should it be installed along with the devtools::install_github("ms609/TreeDist") command ?

Checking conda logs :

ms609 commented 11 months ago

Looks like the ref = "more-leaves" argument is missing from your second command. (Note updated above.)