online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
5k stars 540 forks source link

IndexError: list index out of range with DBSTREAM when updating the metric (Silhouette) #1186

Open qetdr opened 1 year ago

qetdr commented 1 year ago

Versions

river version: 0.15.0 Python version: 3.10.4 Operating system: macOS Ventura 13.2

Describe the bug

Getting an IndexError when running the DBSTREAM and trying to update the Silhouette score (does not seem to matter whether I changed the parameter values or not; the included example is with default parameter values).

Steps/code to reproduce

Example code:

import pandas as pd
from river.cluster import DBSTREAM
from river import stream
from river.metrics import Silhouette

# Import the data
s1 = pd.read_table('http://cs.uef.fi/sipu/datasets/s1.txt', 
                   sep = "\s+", 
                   names = ['x1', 'x2']).sample(5000, random_state = 42).reset_index(drop = True)

# Taking a random sample for a smaller batch of the data
n_samples = 500
df_first_batch = s1.sample(n_samples).reset_index(drop = True)

clusterer = DBSTREAM()
metric = Silhouette()

for x, _ in stream.iter_pandas(df_first_batch):
    clusterer = clusterer.learn_one(x)
    y_pred = clusterer.predict_one(x)
    metric = metric.update(x = x, 
                           y_pred = y_pred, 
                           centers = clusterer.centers)

The output:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[1], line 21
     19 clusterer = clusterer.learn_one(x)
     20 y_pred = clusterer.predict_one(x)
---> 21 metric = metric.update(x = x, 
     22                        y_pred = y_pred, 
     23                        centers = clusterer.centers)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/river/metrics/silhouette.py:74, in Silhouette.update(self, x, y_pred, centers, sample_weight)
     71 distance_closest_centroid = math.sqrt(utils.math.minkowski_distance(centers[y_pred], x, 2))
     72 self._sum_distance_closest_centroid += distance_closest_centroid
---> 74 distance_second_closest_centroid = self._find_distance_second_closest_center(centers, x)
     75 self._sum_distance_second_closest_centroid += distance_second_closest_centroid
     77 return self

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/river/metrics/silhouette.py:67, in Silhouette._find_distance_second_closest_center(centers, x)
     64 @staticmethod
     65 def _find_distance_second_closest_center(centers, x):
     66     distances = {i: math.sqrt(utils.math.minkowski_distance(centers[i], x, 2)) for i in centers}
---> 67     return sorted(distances.values())[-2]

IndexError: list index out of range
MaxHalford commented 1 year ago

Thanks for the reproducible example @qetdr. Assigning @hoanganhngo610 who owns the cluster module.

hoanganhngo610 commented 1 year ago

Thank you so much for raising the issue @qetdr. Would you mind giving some insights on the data that you're working on, or if possible, provide the data as a whole so that I can test it on my system?

qetdr commented 1 year ago

Of course. The data set is the S1 set from the S-Sets from the Clustering basic benchmark (http://cs.uef.fi/sipu/datasets/). In the code I posted, you can also just use all of the data (set n_samples = 5000).

hoanganhngo610 commented 1 year ago

@qetdr Thank you so much! I will have a look at it as soon as I am available. Sorry for such a late response, since I have been relocating back to Vietnam recently, and is also currently on a business trip.

MaxHalford commented 1 year ago

A temporary workaround is to skip the metric update if there are less than 2 clusters:

from river.cluster import DBSTREAM

# Taking a random sample for a smaller batch of the data
n_samples = 500
df_first_batch = s1.sample(n_samples).reset_index(drop = True)

clusterer = DBSTREAM()
metric = Silhouette()

for i, (x, _) in enumerate(stream.iter_pandas(df_first_batch)):
    clusterer = clusterer.learn_one(x)
    y_pred = clusterer.predict_one(x)
    if len(clusterer.centers) < 2:
        continue
    metric = metric.update(x = x, 
                           y_pred = y_pred, 
                           centers = clusterer.centers)
hoanganhngo610 commented 9 months ago

Hi @qetdr. First of all, I'm really sorry that it's been quite some time since this issue has been raised, and I totally forgot about this until now.

Regarding your issue, first of all, Silhouette is an internal metric that requires at least two clusters to have been formed before this metric can even be calculated, since the distance to the closest cluster and distance to the second closest cluster must be found and be $>0$.

As such, in this case, what we usually do is we will use a first few hundreds (or even a thousand) samples as a burn-in. After that, we will start calculating the metric. This will prevent any error of this type from happening.

Regarding the issue itself, I'm afraid there is no way that we can actually fully resolve this. A way to bypass this can be similar to what @MaxHalford has done, or what I have said previously.

Hope that this answers your concerns. If there is anything persists, please no not hesitate to let me know!