Open agitter opened 6 years ago
Here are some visualizations of the hierarchical clustering with different distance functions and the corresponding silhouette coefficient.
yule
gives the highest score.efbd7d50ec65077399c503a20511f23689c57923 also gives an example to check hierarchical clustering and our UMAP visualization.
Nice plottings.
For the ultimate paper/report/presentation, I would suggest doing clustering on all data points, but only plot part of them for visualization. Make sure you are choosing i.i.d. data points once, and use them for comparison among different metrics.
Besides, how to evaluate the clustering is another issue. I guess what @agitter suggest now is just to try to see which metric best align with using fingerprints. The best evaluation method is always putting it back into the problem setting and see which metric/algorithm best fits the goal.
There are also cases people don't have specific problem setting, and they just want to check the clustering performance. sklearn
has some useful packages, like silhouette coefficient and calinski-harabaz index.
We agree that the metrics that produce >= 2 clusters and have silhouette score > 0.37 look reasonable for the most part. There are some exceptions (e.g. sokalsneath, which produces many clusters). We can used the adjusted rand index to assess whether the other metrics actually produce the same clusters. That is, are the purple cells in yule the same as the red cells in cosine.
Once we choose an image clustering and distance metric, we can compare that clustering to the clustering of chemicals. Computational chemists traditionally cluster chemicals by computing the ECFP fingerprint (bit vector) and using the Tanimoto similarity, which is either similar to or equivalent to the Jaccard index. jaccard
is an option in scipy.spatial.distance
.
The chemicals will cluster into more, and smaller, groups. We may not be able to directly compare the two clusterings with the adjusted rand index.
cfa90779c46fcda5f43e00d512f80d755515cfda adds the adjusted rand index comparison of some potential distance functions.
braycurtis
, all smaller clusters are subsets of the smaller cluster of cosine
.cosine
as our final distance function.See the notebook for more details.
0a70e259a2a21e8c3005c1d12d8bd24493adbd1b adds the distance function comparison for the ECFP of all compounds in the dataset.
Some metrics give small groups while some give bigger ones. Jaccard
looks reasonable here.
Start by clustering the transformed images with http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Check whether compounds cluster together. Then try comparing image clustering with fingerprint clustering.