pleonard212 / pix-plot

A WebGL viewer for UMAP or TSNE-clustered images
MIT License
597 stars 139 forks source link

select "best" UMAP layout for clustering #199

Open kruus opened 3 years ago

kruus commented 3 years ago

Suggestion

Current pixplot clustering uses features from ...['variants][0][... Often this is the clustering that looks the worst (often lowest n_neighbors embedding), and in practice rarely agrees clusters I'd like to lasso.

UMAP clustering docs and examples (and experience) suggest a reasonable approach would be to cluster based on

() the layout with highest-n-neighbors (and then lowest-min-dist*)

So what I do is

  default_hotspots = ""
  umap_vecs = best_umap_clustering_json(layouts=layouts,**kwargs)
  if umap_vecs is not None:
    default_hotspots = get_hotspots(vecs=read_json(umap_vecs, **kwargs), **kwargs)

where best_umap_clustering_json simply does search (*) and returns the filename

https://github.com/kruus/pix-plot/blob/8a1cd231ce20cc075b9cb72c8ebeda97fdfb335c/pixplot/pixplot.py#L1219-L1244

Erik

kruus commented 3 years ago

P.S. hotspot scrollbar was not appearing in cases where it should. But my styles.css hacks to get the scrollbar to reappear nicely are kinda' ugly :(

pleonard212 commented 3 years ago

Hi Erik, thanks for your thoughtful comment! This is in line with a discussion we've been having internally about where and when to cluster, given the newly-landed optional hyperparameter arrays (n_neighbors and min_dist) that you can pass at analysis time.

One idea we were kicking around was doing the clustering in the original high-dimensional space (2048). Then each resulting UMAP projection would visually represent, via the hover-on-mouseover affordance, how well that layout captured the clustering that hdbscan saw in the original space. The user could then make a determination, via the two hyperparameter sliders, which projection worked best as a basis to start editing and curating.

I have to admit the above is typed without any personal experience in clustering in such a high-dimensional space, and so it's possible this would take way too long, or would produce nonsense in any 2d projection, etc. A possible hybrid model would be to run a special umap reduction purely for the purposes of clustering, to give hdbscan like 10dims to work with... the points McInnes makes about the differing needs of visualization vs clustering in the documentation you point out are great ones and we should really take that into consideration too. I'm sure @duhaime will chime in here shortly!

kruus commented 3 years ago

Clustering the hi-D space might be a good option to have available, at least just for comparisons. So far using the highest-D UMAP space "works well for me", as I'm actually interested in seeing the connectivity of the hi-D manifold. For novice users, auto-adding a "special umap reduction", in case a reasonable one cannot be found, seems a nice touch @pleonard212! I emit a warning (that didn't even check min_dist), a less good idea. Related: Issue #36.

I guess a visual consequence of adding a potentially off-grid umap layout would be amalgamating the neighbors and min_dist info into a single layouts slider. The slider would report the 2 values of the (somehow sorted) layout index.