scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.94k stars 25.38k forks source link

Cluster comparison example point coloring issue #10874

Closed amueller closed 6 years ago

amueller commented 6 years ago

http://scikit-learn.org/dev/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

It looks to me like in the third row for DBSCAN outliers are assigned the same color as one of the clusters (orange). That seems pretty confusing and is likely an error.

maskani-moh commented 6 years ago

On that dataset, DBSCAN finds two clusters and some outliers: set(dbscan.labels_) = {-1, 0, 1}. To color the data, we define: colors = np.array(list(islice(cycle([col1, col2, col3, ...]), int(max(y_pred) + 1)))) and then do: plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred]) So here we have max(y_pred) = 1, which implies that colors has length 2.

The outliers being labeled -1, colors[y_pred] takes the last color in the array colors and hence the outliers are colored as of one of the clusters found.

Doing int(max(y_pred) + 2) in colors to account for outliers solves the problem 👍

jmloyola commented 6 years ago

Another way of doing the same that I personally liked a bit more was to use this:

palette = sns.color_palette('deep', np.unique(y_pred).max() + 1)
colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in y_pred]

plt.scatter(X[:, 0], X[:, 1], color=colors, s=10)

HDBSCAN - Comparing Python Clustering Algorithms