huggingface / text-clustering

Easily embed, cluster and semantically label text datasets
Apache License 2.0
453 stars 37 forks source link

Datamap plot for plotting; HDBSCAN for clustering #6

Open lmcinnes opened 7 months ago

lmcinnes commented 7 months ago

The DataMapPlot library provides both static plots and interactive plots (backed by matplotlib + datashader and deck.gl respectively) it allows for rich plotting very easily, and takes care of many of the details handled here, as well as several other aspects (palette handling, label placement, interactive search, titles, etc.).

Using HDBSCAN instead of DBSCAN for clustering allows for a single relatively intuitive clustering parameter (min_cluster_size) while still producing good clusterings.

Still cleaning up a few things, but I wanted to open the PR for discussion purposes. Would this be of interest?

lvwerra commented 7 months ago

Awesome, yes very open to improvements! Happy to replace DBSCAN with HDSCAN. For the plotting maybe we can just have three backends?Something like a show(PLOT_LIB, **lib_kwargs) method interface where the PLOT_LIB is any of ["mpl", "plotly", "dmp"]?

At the moment the repo is mostly intended as a template people can copy and modify rather than a full ibrary, so the different plotting methods would also serve to show how to customize the plots.

What do you think?

lmcinnes commented 7 months ago

That's a reasonable option. I'll see if I can rework this to work that way.

lmcinnes commented 7 months ago

Sorry for the delay, I had to step away from this for a little bit. I've moved things around so we now support multiple backends. Potentially we could set a default, but forcing the user to decide is also a reasonable option.