Closed amueller closed 6 years ago
On that dataset, DBSCAN finds two clusters and some outliers: set(dbscan.labels_) = {-1, 0, 1}
.
To color the data, we define:
colors = np.array(list(islice(cycle([col1, col2, col3, ...]), int(max(y_pred) + 1))))
and then do: plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred])
So here we have max(y_pred) = 1
, which implies that colors
has length 2.
The outliers being labeled -1
, colors[y_pred]
takes the last color in the array colors
and hence the outliers are colored as of one of the clusters found.
Doing int(max(y_pred) + 2)
in colors
to account for outliers solves the problem 👍
Another way of doing the same that I personally liked a bit more was to use this:
palette = sns.color_palette('deep', np.unique(y_pred).max() + 1)
colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in y_pred]
plt.scatter(X[:, 0], X[:, 1], color=colors, s=10)
http://scikit-learn.org/dev/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py
It looks to me like in the third row for DBSCAN outliers are assigned the same color as one of the clusters (orange). That seems pretty confusing and is likely an error.