rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.2k stars 525 forks source link

[BUG] cosine distance is broken on DBSCAN #4938

Open royinx opened 1 year ago

royinx commented 1 year ago

Describe the bug

DBSCAN with cosine distance cannot perform clustering. Line print(DBSCAN_cosine.core_sample_indices_) returns [] Please reproduce the issues with the following code.


Also, thanks to #212 giving me the idea to debug. Since Euclidean distance is used currently, cosine distance can be supported today by normalizing your vectors to unit norm. In the meantime, we can certainly work to add cosine & L1 distance.

While i don't know what eps is the best when unit norm is used. After several attempts , seems eps=0.7 for euclidean is the closest result to DBSCAN_cosine. anyone can tell me the true value of eps transforming from cosine distance to euclidean distance ?


Steps/Code to reproduce bug Download testing data data.npy data_(84,512)_3clusters.npy

from sklearn import cluster
import numpy as np
import cuml
import cupy as cp

def main():
    # init data
    arr= np.load("data.npy")
    arr_norm = arr/np.linalg.norm(arr, axis=1, keepdims=True)
    arr_norm[np.isnan(arr_norm)] = 0

    # =================================== DBSCAN ===================================
    DBSCAN_cosine = cluster.DBSCAN(min_samples=5, eps=0.25, metric="cosine").fit(arr)
    labels = DBSCAN_cosine.labels_.tolist()
    print("sklean \t DBSCAN \t cosine \t", labels)

    # =================================== Optics ===================================
    optics_cosine = cluster.OPTICS(min_samples=5, max_eps=0.25, metric="cosine", cluster_method="dbscan").fit(arr)
    labels = optics_cosine.labels_.tolist()
    print("sklean \t optics \t cosine \t", labels)

    optics_euclidean = cluster.OPTICS(min_samples=5, max_eps=0.7, metric="euclidean", cluster_method="dbscan").fit(arr_norm)
    labels = optics_euclidean.labels_.tolist()
    print("sklean \t optics \t euclidean \t", labels)

    # =================================== cuML ===================================

    arr_norm = cp.array(arr_norm)
    DBSCAN_euclidean = cuml.DBSCAN(eps=0.7, min_samples=5, metric="euclidean", output_type="cupy")
    DBSCAN_euclidean.fit(arr_norm)
    labels_ = DBSCAN_euclidean.labels_
    labels_ = list(cp.asnumpy(labels_))
    print("cuML \t DBSCAN \t euclidean \t", labels_)

    arr = cp.asarray(arr)
    DBSCAN_cosine = cuml.DBSCAN(eps=0.25, min_samples=5, metric="cosine", output_type="cupy")
    DBSCAN_cosine.fit(arr)
    labels_ = DBSCAN_cosine.labels_
    labels_ = list(cp.asnumpy(labels_))
    print("cuML \t DBSCAN \t cosine \t", labels_)

    print(DBSCAN_euclidean.core_sample_indices_)
    print(DBSCAN_cosine.core_sample_indices_)

if __name__ == "__main__":
    main()
georgeliu95 commented 1 year ago

Hey @royinx, I think this issue is caused by #5360. I reproduce it as you described:

sklean   DBSCAN      cosine      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
sklean   optics      cosine      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
sklean   optics      euclidean   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
cuML     DBSCAN      euclidean   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
cuML     DBSCAN      cosine      [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 20 21 22 23 24 25 26
 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 51 52 53 54 55 56 57 58 59]
[]

After fixing it, you will get:

sklean   DBSCAN      cosine      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
sklean   optics      cosine      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
sklean   optics      euclidean   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
cuML     DBSCAN      euclidean   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
cuML     DBSCAN      cosine      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 20 21 22 23 24 25 26
 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 51 52 53 54 55 56 57 58 59]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 20 21 22 23 24 25 26
 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 51 52 53 54 55 56 57 58 59]