kgori / treeCl

Clustering phylogenetic trees with python
MIT License
25 stars 12 forks source link

Spectral clustering is inconsistent #20

Closed alexweisberg closed 5 years ago

alexweisberg commented 5 years ago

I am analyzing a dataset using geodesic distances and spectral clustering. I am running multiple tree searches (10 each) for each partition size and selecting the best likelihood tree out of each of 10 searches for that partition.

When I run treeCL with wards clustering of rf distances this works fine, however when I use spectral clustering of geodesic distances I get slightly different partitions between replicates. This is probably because after a certain number of partitions (ie ~5) some gene trees fit equally well in different partitions. I am running into the same problem with bootstrap trees.

My question is, is there any way to set the random seed for spectral clustering if it exists? Alternatively would there be a way to adjust the raxml task manager so that it performs 10 tree searches and selects the best tree rather than doing this by hand?

This is an aside, but documentation for running parametric bootstraps would be great as well. I can generate simulated alignments using the simulate() function and a partition, however it is unclear how to proceed from there within treeCL. Thanks!

kgori commented 5 years ago

Hi, I use the spectral clustering implementation from scikit learn. This uses numpy for its random number generation, so setting a seed via numpy.random.seed(N) will give reproducible cluster assignments. RAxML has the "-#/-N" option for running multiple tree searches. The treeCl RAxML wrapper is very much simplified and doesn't give access to this option right now. However, it would be very simple to add it, so I will put it in the next release. Adding documentation for the parametric and nonparametric bootstraps is on my to do list, and I will try to get to it in the next couple of days. Best, Kevin

kgori commented 5 years ago

From version 0.1.35 you can run raxml (e.g. through Collection.calc_trees(task_interface=RaxmlTaskInterface(), ...)) with the option n_starts=N to have RAxML search for the best tree from N random starting points. The best result among the N will be returned.

alexweisberg commented 5 years ago

Thank you! I will update and try those things. TreeCL has been a great help in my research and I am currently writing several manuscripts using data from it.