niemasd / TreeCluster

Efficient phylogenetic clustering of viral sequences
GNU General Public License v3.0
76 stars 10 forks source link

New functionality requested for TreeCluster #11

Closed drdna closed 8 months ago

drdna commented 8 months ago

Please add some functionality to provide information on how DIFFERENT each cluster is from one another. Currently, the program only identifies taxa that are within a certain distance from one another but does not reveal if other groups would be pulled in with just a small change in bootstrap/distance.

niemasd commented 8 months ago

Thanks for reaching out! We don't have any plans to add any type of clustering comparison functionality directly into TreeCluster (the goal of TreeCluster is to go from Input = Phylogeny to Output = Clusters), but if you have two different clusterings, you can compare them with standard clustering comparison metrics. I highly recommend the ones implemented in scikit-learn:

https://scikit-learn.org/stable/modules/classes.html#clustering-metrics

The following script I wrote for another project should work for your purposes:

https://github.com/niemasd/FAVITES/blob/master/helper_scripts/score_clusters.py

You give it two clustering files (via the -q and -r arguments) and a clustering comparison metric (via the -m argument), and it calculates it for you. I think it should support the TreeCluster file format out-of-the-box, but hopefully it at least gives some example code of how to use the clustering comparison metrics in scikit-learn

niemasd commented 8 months ago

I went ahead and copied that script to this repository and made some minor cosmetic updates:

https://github.com/niemasd/TreeCluster/blob/master/helper_scripts/score_clusters.py

I also added an example dataset and tested it successfully on the example:

$ ./TreeCluster.py -i example/example_hiv.nwk -t 0.05 -m max -o clustering1.tsv
$ ./TreeCluster.py -i example/example_hiv.nwk -t 0.01 -m max -o clustering2.tsv
$ ./helper_scripts/score_clusters.py -q clustering1.tsv -r clustering2.tsv -m HCV
HOM: 0.914424
COM: 1.000000
VM: 0.955299

You'll still want to read through the scikit-learn documentation to make sure you use the appropriate clustering metric, but once you pick which metric(s) make the most sense, you should be able to use this script to compare your clusterings