scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.95k stars 25.38k forks source link

silhouette_score returning NaN #960

Closed aflag closed 12 years ago

aflag commented 12 years ago

Hello, the following code makes silhoutte_score return NaN. That looks like a bug to me. If it's impossible to generate a score, then I expected silhouette_score to raise a meaningful exception instead of returning NaN. The following code prints -nan.

import numpy as np
from sklearn import metrics
from sklearn.cluster import KMeans

data = np.array([
    [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0., 1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1., 0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,],
    [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0., 0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  0., 2.,  0.,  0.,  0.,  2.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  2.,  0.,  0.,  0.,],
    [ 0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1., 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,  0., 0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,],
    [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  2.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0., 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  2.,  0.,  0.,  1.,],
    [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0., 0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,]
], dtype=np.float)

kmeans = KMeans(init='k-means++', k=2)
kmeans.fit(data)

print '%f' % metrics.silhouette_score(data, kmeans.labels_, metric='euclidean')

The first matrix I've tried came from a bigger sample of 14k items (and many more features for each sample). I reduced the number of samples to 5 and k to 2 so that it's easier to test. In the original sample silhoutte score would fail if k was greater than 80. On that test it would print nan (instead of -nan),.

ogrisel commented 12 years ago

Thanks for the report @aflag.

@robertlayton can you please have a look?

robertlayton commented 12 years ago

Will do.

FYI, this code does indeed produce nan on my system as well. Thanks @aflag!

robertlayton commented 12 years ago

Merge Ready to Go

amueller commented 12 years ago

Fixed by @robertlayton in #962. Thanks :)