Open mikhailsirenko opened 1 year ago
Can you give a slightly more elaborate example showing both the current and the desired behavior?
E.g.
def plot_score(data:pd.DataFrame, metric:str, linkage:str, max_clusters:int, score:str='silhouette'):
"""Plot clustering perfomance score for different number of clusters.
Args:
data (pd.DataFrame): Data to cluster.
metric (str): Metric to use for clustering.
linkage (str): Linkage method to use for clustering.
max_clusters (int): Maximum number of clusters to try.
score (str, optional): Score to use. Defaults to 'silhouette'.
Raises:
ValueError: If the score is unknown.
Returns:
None
"""
if score == 'silhouette':
score_function = silhouette_score
elif score == 'calinski_harabasz':
score_function = calinski_harabasz_score
elif score == 'davies_bouldin':
score_function = davies_bouldin_score
else:
raise ValueError(f'Unknown score: {score}')
scores = {}
for i in range(2, max_clusters + 1):
labels = clusterer.apply_agglomerative_clustering(data, i, metric=metric, linkage=linkage)
scores[i] = score_function(data, labels, metric=metric)
fig, ax = plt.subplots(figsize=(6, 4))
ax.plot(list(scores.keys()), list(scores.values()))
ax.set_xlabel('Number of clusters')
ax.set_ylabel(f'{score.capitalize()} score')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
Maybe, it could be a good addition to allow for printing/plotting permanence metrics, e.g.
silhouette_score
or any other that is more applicable in the case ofAgglomerativeClustering
. Or you prefer to keep those apart?