Closed rmcar17 closed 11 months ago
Finished setting it up - though it appears to be semi-consistently generating supertrees with missing tips? Does it do that when there is not enough overlap?
Above was an issue with my code, it had all the tips but just an unexpected number of internal nodes at times (multifurcating)
BCD ends up scaling quite well for large numbers of taxa (given a relatively low number of input trees). Below image is on 1800 taxa with on the order of tens of input trees. Performs better than SCS there. It does not appear to scale well for large numbers of input trees however. Have a folder for birth-death trees of size 2000 taxa (hundreds of source trees) and SCS was able to handle it relatively quickly whereas BCD didn't return anything within approximately an hour time period.
Part of the reason for BCD's performance is the use of GSCM as a preprocessing step. This obtains a rough supertree which the method then refines. When this is disabled it usually goes faster (is it the or part of the cause of issues for many many trees with BCD?) but with a much worse matching distance. Currently setting up an experiment for 10,000 taxa with source trees of 1000-2000 in size.
On second thought, memory limitations - perhaps 4,000
Finished setting up the other datasets. Which one performs better seems largely dependent on the dataset itself in all honesty. For the datasets including up to 1,000 taxa, BCD outperforms SCS in terms of time. However for the up to 10,000 taxa (5,500 on average) dataset with approximately 500 trees per test, SCS finishes in a matter of minutes whereas I've been waiting for BCD to terminate for over an hour and it is still going.
I think the number of trees is playing a significant role here.
Re-reading the paper the best parameterisations for BCD on this dataset took 2-16 hours to solve each task. And the worst took almost 3 days. Wonder how the distance from the generated trees to the model compares for BCD vs SCS considering the time difference - if it finishes running (then the matching distance is relatively computationally expensive as well).
19 for more results comparing SCS and BCD after adding branch lengths.
Finished setting up the other datasets. Which one performs better seems largely dependent on the dataset itself in all honesty. For the datasets including up to 1,000 taxa, BCD outperforms SCS in terms of time. However for the up to 10,000 taxa (5,500 on average) dataset with approximately 500 trees per test, SCS finishes in a matter of minutes whereas I've been waiting for BCD to terminate for over an hour and it is still going.
and 45min later it is still going
Another supertree paper that I found (Tandy Warnow later published an overview of supertree methods and said this one was pretty goo, have to refind that paper...).
Paper: https://academic.oup.com/mbe/article/34/9/2408/3925278?login=false
Also contains more datasets, and performs some experiments which look relatively thorough. Could in the paper aim to replicate something like that.