Open qetdr opened 1 year ago
Thanks for the reproducible example @qetdr. Assigning @hoanganhngo610 who owns the cluster
module.
Thank you so much for raising the issue @qetdr. Would you mind giving some insights on the data that you're working on, or if possible, provide the data as a whole so that I can test it on my system?
Of course. The data set is the S1 set from the S-Sets from the Clustering basic benchmark (http://cs.uef.fi/sipu/datasets/). In the code I posted, you can also just use all of the data (set n_samples = 5000
).
@qetdr Thank you so much! I will have a look at it as soon as I am available. Sorry for such a late response, since I have been relocating back to Vietnam recently, and is also currently on a business trip.
A temporary workaround is to skip the metric update if there are less than 2 clusters:
from river.cluster import DBSTREAM
# Taking a random sample for a smaller batch of the data
n_samples = 500
df_first_batch = s1.sample(n_samples).reset_index(drop = True)
clusterer = DBSTREAM()
metric = Silhouette()
for i, (x, _) in enumerate(stream.iter_pandas(df_first_batch)):
clusterer = clusterer.learn_one(x)
y_pred = clusterer.predict_one(x)
if len(clusterer.centers) < 2:
continue
metric = metric.update(x = x,
y_pred = y_pred,
centers = clusterer.centers)
Hi @qetdr. First of all, I'm really sorry that it's been quite some time since this issue has been raised, and I totally forgot about this until now.
Regarding your issue, first of all, Silhouette is an internal metric that requires at least two clusters to have been formed before this metric can even be calculated, since the distance to the closest cluster and distance to the second closest cluster must be found and be $>0$.
As such, in this case, what we usually do is we will use a first few hundreds (or even a thousand) samples as a burn-in. After that, we will start calculating the metric. This will prevent any error of this type from happening.
Regarding the issue itself, I'm afraid there is no way that we can actually fully resolve this. A way to bypass this can be similar to what @MaxHalford has done, or what I have said previously.
Hope that this answers your concerns. If there is anything persists, please no not hesitate to let me know!
Versions
river version: 0.15.0 Python version: 3.10.4 Operating system: macOS Ventura 13.2
Describe the bug
Getting an IndexError when running the DBSTREAM and trying to update the Silhouette score (does not seem to matter whether I changed the parameter values or not; the included example is with default parameter values).
Steps/code to reproduce
Example code:
The output: