Hierarchical clustering on the latent image representations

agitter commented 6 years ago

Start by clustering the transformed images with http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

Check whether compounds cluster together. Then try comparing image clustering with fingerprint clustering.

xiaohk commented 6 years ago

Here are some visualizations of the hierarchical clustering with different distance functions and the corresponding silhouette coefficient.

distance

Distance functions which cannot partition dataset to at least 2 clusters tend to have higher silhouette scores.
Among the distances which have at least 2 clear clusters, yule gives the highest score.

efbd7d50ec65077399c503a20511f23689c57923 also gives an example to check hierarchical clustering and our UMAP visualization.

hierarchical_cluster_1

chao1224 commented 6 years ago

Nice plottings.

For the ultimate paper/report/presentation, I would suggest doing clustering on all data points, but only plot part of them for visualization. Make sure you are choosing i.i.d. data points once, and use them for comparison among different metrics.

Besides, how to evaluate the clustering is another issue. I guess what @agitter suggest now is just to try to see which metric best align with using fingerprints. The best evaluation method is always putting it back into the problem setting and see which metric/algorithm best fits the goal.

There are also cases people don't have specific problem setting, and they just want to check the clustering performance. sklearn has some useful packages, like silhouette coefficient and calinski-harabaz index.

agitter commented 6 years ago

We agree that the metrics that produce >= 2 clusters and have silhouette score > 0.37 look reasonable for the most part. There are some exceptions (e.g. sokalsneath, which produces many clusters). We can used the adjusted rand index to assess whether the other metrics actually produce the same clusters. That is, are the purple cells in yule the same as the red cells in cosine.

agitter commented 6 years ago

Once we choose an image clustering and distance metric, we can compare that clustering to the clustering of chemicals. Computational chemists traditionally cluster chemicals by computing the ECFP fingerprint (bit vector) and using the Tanimoto similarity, which is either similar to or equivalent to the Jaccard index. jaccard is an option in scipy.spatial.distance.

The chemicals will cluster into more, and smaller, groups. We may not be able to directly compare the two clusterings with the adjusted rand index.

xiaohk commented 6 years ago

cfa90779c46fcda5f43e00d512f80d755515cfda adds the adjusted rand index comparison of some potential distance functions.

The adjusted rand score is related to the size difference of the smaller cluster.
Except there are 3 variants in braycurtis, all smaller clusters are subsets of the smaller cluster of cosine.
To be conservative, we can choose cosine as our final distance function.

See the notebook for more details.

xiaohk commented 6 years ago

0a70e259a2a21e8c3005c1d12d8bd24493adbd1b adds the distance function comparison for the ECFP of all compounds in the dataset.

distance_fp

Some metrics give small groups while some give bigger ones. Jaccard looks reasonable here.

gitter-lab / pharmaco-image

Hierarchical clustering on the latent image representations #4