scikit-learn-contrib / scikit-learn-extra

scikit-learn contrib estimators
https://scikit-learn-extra.readthedocs.io
BSD 3-Clause "New" or "Revised" License
185 stars 42 forks source link

CLARA - wrong medoid_indices_ ? #126

Closed mglowacki100 closed 2 years ago

mglowacki100 commented 2 years ago

For KMedoids medois_indices_ works fine, namely every medoid is assigned to separate cluster, but if I use CLARA, I've got all medoids assigned to just one cluster. In example below, see z (every medoid is assigned to cluster 1, (3,7,8,42,43 - those indicesmedoids I get).

To reproduce issue, I use following input saved as csv :

https://gist.github.com/netj/8836201

with code:

import pandas as pd
import numpy as np
from sklearn_extra.cluster import CLARA, KMedoids

df = pd.read_csv('iris.csv')
df = df.drop(columns='variety')

mdl = CLARA(n_clusters=5, random_state=42)
#mdl = KMedoids(n_clusters=5, random_state=42)

mdl.fit(df)

df['cluster'] = mdl.labels_
df['medoid'] = np.where(df.index.isin(mdl.medoid_indices_),1, 0)

z = df.loc[df['medoid]==1, 'cluster']

sklearn_extra version 0.2.0 pandsas version 1.3.1 numpy version 1.22.1

Problem occurs also on much larger dataset (90k rows), but I can't share it, for which KMedoid is too slow.

TimotheeMathieu commented 2 years ago

Thanks for reporting this, yes this is a bug. CLARA use sub-sampling and the medoid_indices returned are the indices in the sub-sample and not in the whole dataset. I will make a PR to correct this.