Open kruus opened 3 years ago
P.S. hotspot scrollbar was not appearing in cases where it should. But my styles.css
hacks to get the scrollbar to reappear nicely are kinda' ugly :(
Hi Erik, thanks for your thoughtful comment! This is in line with a discussion we've been having internally about where and when to cluster, given the newly-landed optional hyperparameter arrays (n_neighbors
and min_dist
) that you can pass at analysis time.
One idea we were kicking around was doing the clustering in the original high-dimensional space (2048). Then each resulting UMAP projection would visually represent, via the hover-on-mouseover affordance, how well that layout captured the clustering that hdbscan saw in the original space. The user could then make a determination, via the two hyperparameter sliders, which projection worked best as a basis to start editing and curating.
I have to admit the above is typed without any personal experience in clustering in such a high-dimensional space, and so it's possible this would take way too long, or would produce nonsense in any 2d projection, etc. A possible hybrid model would be to run a special umap reduction purely for the purposes of clustering, to give hdbscan like 10dims to work with... the points McInnes makes about the differing needs of visualization vs clustering in the documentation you point out are great ones and we should really take that into consideration too. I'm sure @duhaime will chime in here shortly!
Clustering the hi-D space might be a good option to have available, at least just for comparisons. So far using the highest-D UMAP space "works well for me", as I'm actually interested in seeing the connectivity of the hi-D manifold. For novice users, auto-adding a "special umap reduction", in case a reasonable one cannot be found, seems a nice touch @pleonard212! I emit a warning (that didn't even check min_dist
), a less good idea. Related: Issue #36.
I guess a visual consequence of adding a potentially off-grid umap layout would be amalgamating the neighbors and min_dist info into a single layouts slider. The slider would report the 2 values of the (somehow sorted) layout index.
Suggestion
Current pixplot clustering uses features from ...
['variants][0][
... Often this is the clustering that looks the worst (often lowest n_neighbors embedding), and in practice rarely agrees clusters I'd like to lasso.UMAP clustering docs and examples (and experience) suggest a reasonable approach would be to cluster based on
So what I do is
where
best_umap_clustering_json
simply does search (*) and returns the filenamehttps://github.com/kruus/pix-plot/blob/8a1cd231ce20cc075b9cb72c8ebeda97fdfb335c/pixplot/pixplot.py#L1219-L1244
Erik