Functions and utilities

verri / sledge

SLEDge: semantic evaluation of clustering results

MIT License

2 stars 0 forks source link

Functions and utilities #1

Open verri opened 3 years ago

verri commented 3 years ago

Let's discuss here which functions the package should provide.

Based on the paper, I think the following are a must-have:

[ ] sledge_descriptors(X, labels, minimum_support=0): given a data frame X, where each column corresponds to a pattern and each value x_ij the presence/absence of the pattern j in the sample i, and the clustering results labels, this function returns a list of sets, each of them containing the 1-itemsets that describe a cluster (after particularization.)
[ ] sledge_score(X, labels, minimum_support=0): same inputs, returns a list containing the SLEDge score for each cluster.

Another optional function that I think of, is something like sledge_matrix that returns the values of S, L, E, and D for each cluster (that is, before aggregation with the harmonic mean.) However, this functionality can be incorporated in sledge_score with some trigger.

verri commented 3 years ago

Since, conceptually, sledge_score would call sledge_descriptors, it also makes sense to provide a variation of the sledge_score function that has as arguments the output of sledge_descriptors. Maybe something like:

sledge_descriptors(X, labels, ...) -> [ [...], [...], ... ]
sledge_score_from_descriptors([ [...], [...], ... ], ...) -> [ ... ]
sledge_score(X, labels, ...) == sledge_score_from_descriptors(sledge_descriptors(X, labels, ...))

verri commented 3 years ago

Moreover, given the desired smooth integration with sklearn, while implementing/documenting we should expect arguments compatible with the outputs of the clustering techniques implemented there. Also, it is wise to use similar parameter names.

verri commented 3 years ago

I think we should provide at least auxiliary functions to plot the “SLEDge curve” for a given cluster.

I vote for producing some output that is easily fed by matplotlib. Then, we write some tutorials to create the plot itself (like sklearn does.)

verri commented 3 years ago

Taking a look at sklearn's documentation, I suggest a minor change:

sledge_descriptors: returns a sparse support matrix for clusters vs patterns.
sledge_clusters: returns the SLEDge score for each cluster.
sledge_score: returns the average (or min, or max too?) SLEDge score.
sledge_curve: returns the SLEDge curve.

Rationale:

All *_score functions in sklearn return a single value.
sklearn.metrics.silhouette_samples returns a silhouette score per sample. It makes sense to provide a SLEDge variation per cluster with similar naming.
There are some *_curve functions in sklearn.metrics.
All sklearn.metrics.*_matrix functions return a square matrix, so although sledge_descriptors returns a matrix, the name descriptors is more informative and may avoid misinterpretations.

verri commented 3 years ago

Even better:

sledge.sematic_descriptors
sledge.sledge_score_clusters
sledge.sledge_score
sledge.sledge_curve