lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.39k stars 803 forks source link

Lackluster clustering performance #706

Open joaogui1 opened 3 years ago

joaogui1 commented 3 years ago

Hi! I tried following the clustering guide, but comparing tsne vs umap and k-means vs hdbscan, but in the end umap and hdbscan had some pretty awful performances, can you help me understand why? The dataset used is Imagenette (an easier subset of imagenet with only 10 classes) and I passed it through a pretrained resnet to get the features.

Link to the colab Thanks in advance for the help

lmcinnes commented 3 years ago

I think the short answer is that, unfortunately, this approach isn't magic and in this case it seems like it can't match the labelling. It should also be noted that clusters matching labels is not always a guaranteed thing. Now passing through a resnet to generate features should mean that cluster structure at least somewhat resembles the class/label structure, but it seems that's not the case: none of the techniques are doing that well. I would suggest looking at a visualization of a 2D or 3D UMAP coloured by the labels and see what sort of correlation there actually is between qualitative clusters and labels. That may give you some ideas about how things could be improved.