Open verri opened 3 years ago
Since, conceptually, sledge_score
would call sledge_descriptors
, it also makes sense to provide a variation of the sledge_score
function that has as arguments the output of sledge_descriptors
. Maybe something like:
sledge_descriptors(X, labels, ...) -> [ [...], [...], ... ]
sledge_score_from_descriptors([ [...], [...], ... ], ...) -> [ ... ]
sledge_score(X, labels, ...) == sledge_score_from_descriptors(sledge_descriptors(X, labels, ...))
Moreover, given the desired smooth integration with sklearn, while implementing/documenting we should expect arguments compatible with the outputs of the clustering techniques implemented there. Also, it is wise to use similar parameter names.
I think we should provide at least auxiliary functions to plot the “SLEDge curve” for a given cluster.
I vote for producing some output that is easily fed by matplotlib. Then, we write some tutorials to create the plot itself (like sklearn does.)
Taking a look at sklearn's documentation, I suggest a minor change:
sledge_descriptors
: returns a sparse support matrix for clusters vs patterns.sledge_clusters
: returns the SLEDge score for each cluster.sledge_score
: returns the average (or min, or max too?) SLEDge score.sledge_curve
: returns the SLEDge curve.Rationale:
*_score
functions in sklearn return a single value.sklearn.metrics.silhouette_samples
returns a silhouette score per sample. It makes sense to provide a SLEDge variation per cluster with similar naming. *_curve
functions in sklearn.metrics
. sklearn.metrics.*_matrix
functions return a square matrix, so although sledge_descriptors
returns a matrix, the name descriptors is more informative and may avoid misinterpretations. Even better:
sledge.sematic_descriptors
sledge.sledge_score_clusters
sledge.sledge_score
sledge.sledge_curve
Let's discuss here which functions the package should provide.
Based on the paper, I think the following are a must-have:
sledge_descriptors(X, labels, minimum_support=0)
: given a data frameX
, where each column corresponds to a pattern and each valuex_ij
the presence/absence of the patternj
in the samplei
, and the clustering resultslabels
, this function returns a list of sets, each of them containing the 1-itemsets that describe a cluster (after particularization.)sledge_score(X, labels, minimum_support=0)
: same inputs, returns a list containing the SLEDge score for each cluster.Another optional function that I think of, is something like
sledge_matrix
that returns the values of S, L, E, and D for each cluster (that is, before aggregation with the harmonic mean.) However, this functionality can be incorporated insledge_score
with some trigger.