Cluster comparison example point coloring issue

amueller commented 6 years ago

http://scikit-learn.org/dev/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

It looks to me like in the third row for DBSCAN outliers are assigned the same color as one of the clusters (orange). That seems pretty confusing and is likely an error.

maskani-moh commented 6 years ago

On that dataset, DBSCAN finds two clusters and some outliers: set(dbscan.labels_) = {-1, 0, 1}. To color the data, we define: colors = np.array(list(islice(cycle([col1, col2, col3, ...]), int(max(y_pred) + 1)))) and then do: plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred]) So here we have max(y_pred) = 1, which implies that colors has length 2.

The outliers being labeled -1, colors[y_pred] takes the last color in the array colors and hence the outliers are colored as of one of the clusters found.

Doing int(max(y_pred) + 2) in colors to account for outliers solves the problem 👍

jmloyola commented 6 years ago

Another way of doing the same that I personally liked a bit more was to use this:

palette = sns.color_palette('deep', np.unique(y_pred).max() + 1)
colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in y_pred]

plt.scatter(X[:, 0], X[:, 1], color=colors, s=10)

HDBSCAN - Comparing Python Clustering Algorithms

scikit-learn / scikit-learn

Cluster comparison example point coloring issue #10874