rmcar17 / SpectralClusterSupertree

A supertree method for rooted source trees

BSD 3-Clause "New" or "Revised" License

5 stars 0 forks source link

Bad Clade Deletion #17

Closed rmcar17 closed 11 months ago

rmcar17 commented 1 year ago

Another supertree paper that I found (Tandy Warnow later published an overview of supertree methods and said this one was pretty goo, have to refind that paper...).

Paper: https://academic.oup.com/mbe/article/34/9/2408/3925278?login=false

Also contains more datasets, and performs some experiments which look relatively thorough. Could in the paper aim to replicate something like that.

rmcar17 commented 1 year ago

Finished setting it up - though it appears to be semi-consistently generating supertrees with missing tips? Does it do that when there is not enough overlap?

rmcar17 commented 1 year ago

Above was an issue with my code, it had all the tips but just an unexpected number of internal nodes at times (multifurcating)

rmcar17 commented 1 year ago

BCD ends up scaling quite well for large numbers of taxa (given a relatively low number of input trees). Below image is on 1800 taxa with on the order of tens of input trees. Performs better than SCS there. It does not appear to scale well for large numbers of input trees however. Have a folder for birth-death trees of size 2000 taxa (hundreds of source trees) and SCS was able to handle it relatively quickly whereas BCD didn't return anything within approximately an hour time period.

rmcar17 commented 1 year ago

Part of the reason for BCD's performance is the use of GSCM as a preprocessing step. This obtains a rough supertree which the method then refines. When this is disabled it usually goes faster (is it the or part of the cause of issues for many many trees with BCD?) but with a much worse matching distance. Currently setting up an experiment for 10,000 taxa with source trees of 1000-2000 in size.

rmcar17 commented 1 year ago

On second thought, memory limitations - perhaps 4,000

rmcar17 commented 12 months ago

19 for more results comparing SCS and BCD after adding branch lengths.

Finished setting up the other datasets. Which one performs better seems largely dependent on the dataset itself in all honesty. For the datasets including up to 1,000 taxa, BCD outperforms SCS in terms of time. However for the up to 10,000 taxa (5,500 on average) dataset with approximately 500 trees per test, SCS finishes in a matter of minutes whereas I've been waiting for BCD to terminate for over an hour and it is still going.

rmcar17 commented 12 months ago

I think the number of trees is playing a significant role here.

rmcar17 commented 12 months ago

Re-reading the paper the best parameterisations for BCD on this dataset took 2-16 hours to solve each task. And the worst took almost 3 days. Wonder how the distance from the generated trees to the model compares for BCD vs SCS considering the time difference - if it finishes running (then the matching distance is relatively computationally expensive as well).

rmcar17 commented 12 months ago

19 for more results comparing SCS and BCD after adding branch lengths.

Finished setting up the other datasets. Which one performs better seems largely dependent on the dataset itself in all honesty. For the datasets including up to 1,000 taxa, BCD outperforms SCS in terms of time. However for the up to 10,000 taxa (5,500 on average) dataset with approximately 500 trees per test, SCS finishes in a matter of minutes whereas I've been waiting for BCD to terminate for over an hour and it is still going.

and 45min later it is still going