rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.2k stars 525 forks source link

[BUG] DBScan output differs for cosine metric if input vectors are normalized #4930

Open robertclancy opened 2 years ago

robertclancy commented 2 years ago

Describe the bug DBScan will produce different results when using the cosine metric if the input vectors are normalized or not.

Steps/Code to reproduce bug

import numpy as np
from cuml.cluster import DBSCAN

x = np.array([[1, 0],[2, 0],[3, 0],[4, 0],[5, 0]]).astype('float64')
normalized_x = x / np.linalg.norm(x, axis=1, keepdims=True)

labels = DBSCAN(eps=1, metric='cosine').fit(x).labels_
normalized_labels = DBSCAN(eps=1, metric='cosine').fit(normalized_x).labels_

assert np.all(labels == normalized_labels)

Expected behavior I expect that the output should not depend on normalization since the cosine similarity does not depend on the length of the two vectors.

Environment details (please complete the following information):

Click here to see environment details

     **git***
     Not inside a git repository

     ***OS Information***
     NAME="Amazon Linux"
     VERSION="2"
     ID="amzn"
     ID_LIKE="centos rhel fedora"
     VERSION_ID="2"
     PRETTY_NAME="Amazon Linux 2"
     ANSI_COLOR="0;33"
     CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
     HOME_URL="https://amazonlinux.com/"
     Amazon Linux release 2 (Karoo)
     Linux ip-172-16-6-159.ec2.internal 5.10.102-99.473.amzn2.x86_64 #1 SMP Wed Mar 2 19:14:12 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

     ***GPU Information***
     Fri Oct 14 12:12:36 2022
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |                               |                      |               MIG M. |
     |===============================+======================+======================|
     |   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
     | N/A   29C    P0    24W /  70W |   1513MiB / 15360MiB |      0%      Default |
     |                               |                      |                  N/A |
     +-------------------------------+----------------------+----------------------+

     +-----------------------------------------------------------------------------+
     | Processes:                                                                  |
     |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
     |        ID   ID                                                   Usage      |
     |=============================================================================|
     |  No running processes found                                                 |
     +-----------------------------------------------------------------------------+

     ***CPU***
     Architecture:        x86_64
     CPU op-mode(s):      32-bit, 64-bit
     Byte Order:          Little Endian
     CPU(s):              8
     On-line CPU(s) list: 0-7
     Thread(s) per core:  2
     Core(s) per socket:  4
     Socket(s):           1
     NUMA node(s):        1
     Vendor ID:           GenuineIntel
     CPU family:          6
     Model:               85
     Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
     Stepping:            7
     CPU MHz:             3100.103
     BogoMIPS:            4999.99
     Hypervisor vendor:   KVM
     Virtualization type: full
     L1d cache:           32K
     L1i cache:           32K
     L2 cache:            1024K
     L3 cache:            36608K
     NUMA node0 CPU(s):   0-7
     Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

     ***CMake***
     /usr/local/bin/cmake
     cmake version 3.22.3

     CMake suite maintained and supported by Kitware (kitware.com/cmake).

     ***g++***
     /usr/bin/g++
     g++ (GCC) 7.3.1 20180712 (Red Hat 7.3.1-15)
     Copyright (C) 2017 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

     ***nvcc***
which: no nvcc in (/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/home/ec2-user/anaconda3/condabin:/home/ec2-user/.dl_binaries/bin:/opt/aws/neuron/bin:/usr/libexec/gcc/x86_64-redhat-linux/7:/opt/aws/bin:/home/ec2-user/SageMaker/persisted_conda_envs/rapids/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/home/ec2-user/anaconda3/condabin:/home/ec2-user/.dl_binaries/bin:/opt/aws/neuron/bin:/usr/libexec/gcc/x86_64-redhat-linux/7:/opt/aws/bin:/home/ec2-user/anaconda3/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ec2-user/.local/bin:/home/ec2-user/bin)

     ***Python***
     /home/ec2-user/SageMaker/persisted_conda_envs/rapids/bin/python
     Python 3.8.13

     ***Environment Variables***
     PATH                            : /opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/home/ec2-user/anaconda3/condabin:/home/ec2-user/.dl_binaries/bin:/opt/aws/neuron/bin:/usr/libexec/gcc/x86_64-redhat-linux/7:/opt/aws/bin:/home/ec2-user/SageMaker/persisted_conda_envs/rapids/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/home/ec2-user/anaconda3/condabin:/home/ec2-user/.dl_binaries/bin:/opt/aws/neuron/bin:/usr/libexec/gcc/x86_64-redhat-linux/7:/opt/aws/bin:/home/ec2-user/anaconda3/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ec2-user/.local/bin:/home/ec2-user/bin
     LD_LIBRARY_PATH                 : /opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/local/lib:/usr/lib:/lib:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/local/lib:/usr/lib:/lib:
     NUMBAPRO_NVVM                   :
     NUMBAPRO_LIBDEVICE              :
     CONDA_PREFIX                    : /home/ec2-user/SageMaker/persisted_conda_envs/rapids
     PYTHON_PATH                     :

     ***conda packages***
     /home/ec2-user/anaconda3/condabin/conda
     # packages in environment at /home/ec2-user/SageMaker/persisted_conda_envs/rapids:
     #
     # Name                    Version                   Build  Channel
     _libgcc_mutex             0.1                 conda_forge    conda-forge
     _openmp_mutex             4.5                       2_gnu    conda-forge
     arrow-cpp                 8.0.1           py38h998ac4b_2_cpu    conda-forge
     aws-c-cal                 0.5.11               h95a6274_0    conda-forge
     aws-c-common              0.6.2                h7f98852_0    conda-forge
     aws-c-event-stream        0.2.7               h3541f99_13    conda-forge
     aws-c-io                  0.10.5               hfb6a706_0    conda-forge
     aws-checksums             0.1.11               ha31a3da_7    conda-forge
     aws-sdk-cpp               1.8.186              hb4091e7_3    conda-forge
     bokeh                     2.4.3              pyhd8ed1ab_3    conda-forge
     brotlipy                  0.7.0           py38h0a891b7_1004    conda-forge
     bzip2                     1.0.8                h7f98852_4    conda-forge
     c-ares                    1.18.1               h7f98852_0    conda-forge
     ca-certificates           2022.9.24            ha878542_0    conda-forge
     cachetools                5.2.0              pyhd8ed1ab_0    conda-forge
     certifi                   2022.9.24          pyhd8ed1ab_0    conda-forge
     cffi                      1.15.1           py38h4a40e3a_0    conda-forge
     click                     8.1.3            py38h578d9bd_0    conda-forge
     cloudpickle               2.2.0              pyhd8ed1ab_0    conda-forge
     cryptography              38.0.2           py38h2b5fc30_0    conda-forge
     cuda-python               11.7.0           py38h3fd9d12_0    nvidia
     cudatoolkit               11.5.1               hcf5317a_9    nvidia
     cudf                      22.08.00        cuda_11_py38_gb71873c701_0    rapidsai
     cuml                      22.08.00        cuda11_py38_g1e2f8a9aa_0    rapidsai
     cupy                      10.6.0           py38h405e1b6_0    conda-forge
     cytoolz                   0.12.0           py38h0a891b7_0    conda-forge
     dask                      2022.7.1           pyhd8ed1ab_0    conda-forge
     dask-core                 2022.7.1           pyhd8ed1ab_0    conda-forge
     dask-cuda                 22.08.00        py38_g9a61ce5_0    rapidsai
     dask-cudf                 22.08.00        cuda_11_py38_gb71873c701_0    rapidsai
     distributed               2022.7.1           pyhd8ed1ab_0    conda-forge
     dlpack                    0.5                  h9c3ff4c_0    conda-forge
     faiss-proc                1.0.0                      cuda    rapidsai
     fastavro                  1.6.1            py38h0a891b7_0    conda-forge
     fastrlock                 0.8              py38hfa26641_2    conda-forge
     freetype                  2.12.1               hca18f0e_0    conda-forge
     fsspec                    2022.8.2           pyhd8ed1ab_0    conda-forge
     gflags                    2.2.2             he1b5a44_1004    conda-forge
     glog                      0.6.0                h6f12383_0    conda-forge
     grpc-cpp                  1.46.4               hbad87ad_7    conda-forge
     heapdict                  1.0.1                      py_0    conda-forge
     idna                      3.4                pyhd8ed1ab_0    conda-forge
     importlib-metadata        4.11.4           py38h578d9bd_0    conda-forge
     jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
     joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
     jpeg                      9e                   h166bdaf_2    conda-forge
     keyutils                  1.6.1                h166bdaf_0    conda-forge
     krb5                      1.19.3               h3790be6_0    conda-forge
     lcms2                     2.12                 hddcbb42_0    conda-forge
     ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
     lerc                      4.0.0                h27087fc_0    conda-forge
     libabseil                 20220623.0      cxx17_h48a1fff_4    conda-forge
     libblas                   3.9.0           16_linux64_openblas    conda-forge
     libbrotlicommon           1.0.9                h166bdaf_7    conda-forge
     libbrotlidec              1.0.9                h166bdaf_7    conda-forge
     libbrotlienc              1.0.9                h166bdaf_7    conda-forge
     libcblas                  3.9.0           16_linux64_openblas    conda-forge
     libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
     libcudf                   22.08.00        cuda11_gb71873c701_0    rapidsai
     libcuml                   22.08.00        cuda11_g1e2f8a9aa_0    rapidsai
     libcumlprims              22.08.00        cuda11_g1770e60_0    nvidia
     libcurl                   7.85.0               h7bff187_0    conda-forge
     libcusolver               11.4.1.48                     0    nvidia
     libcusparse               11.7.5.86                     0    nvidia
     libdeflate                1.14                 h166bdaf_0    conda-forge
     libedit                   3.1.20191231         he28a2e2_2    conda-forge
     libev                     4.33                 h516909a_1    conda-forge
     libevent                  2.1.10               h9b69904_4    conda-forge
     libfaiss                  1.7.0           cuda112h5bea7ad_8_cuda    conda-forge
     libffi                    3.4.2                h7f98852_5    conda-forge
     libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
     libgfortran-ng            12.1.0              h69a702a_16    conda-forge
     libgfortran5              12.1.0              hdcd56e2_16    conda-forge
     libgomp                   12.1.0              h8d9b700_16    conda-forge
     libgoogle-cloud           2.1.0                hf2e47f9_1    conda-forge
     liblapack                 3.9.0           16_linux64_openblas    conda-forge
     libllvm11                 11.1.0               he0ac6c6_4    conda-forge
     libnghttp2                1.47.0               hdcd2b5c_1    conda-forge
     libnsl                    2.0.0                h7f98852_0    conda-forge
     libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
     libpng                    1.6.38               h753d276_0    conda-forge
     libprotobuf               3.20.1               h6239696_4    conda-forge
     libraft-distance          22.08.00        cuda11_g87a7d16c_0    rapidsai
     libraft-headers           22.08.00        cuda11_g87a7d16c_0    rapidsai
     libraft-nn                22.08.00        cuda11_g87a7d16c_0    rapidsai
     librmm                    22.08.00        cuda11_gd212232c_0    rapidsai
     libsqlite                 3.39.4               h753d276_0    conda-forge
     libssh2                   1.10.0               haa6b8db_3    conda-forge
     libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
     libthrift                 0.16.0               h491838f_2    conda-forge
     libtiff                   4.4.0                h55922b4_4    conda-forge
     libutf8proc               2.7.0                h7f98852_0    conda-forge
     libuuid                   2.32.1            h7f98852_1000    conda-forge
     libwebp-base              1.2.4                h166bdaf_0    conda-forge
     libxcb                    1.13              h7f98852_1004    conda-forge
     libzlib                   1.2.12               h166bdaf_4    conda-forge
     llvmlite                  0.39.1           py38h38d86a4_0    conda-forge
     locket                    1.0.0              pyhd8ed1ab_0    conda-forge
     lz4                       4.0.0            py38h1bf946c_2    conda-forge
     lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
     markupsafe                2.1.1            py38h0a891b7_1    conda-forge
     msgpack-python            1.0.4            py38h43d8883_0    conda-forge
     nccl                      2.14.3.1             h0800d71_0    conda-forge
     ncurses                   6.3                  h27087fc_1    conda-forge
     numba                     0.56.2           py38h9a4aae9_1    conda-forge
     numpy                     1.23.3           py38h3a7f9d9_0    conda-forge
     nvtx                      0.2.3            py38h497a2fe_1    conda-forge
     openjpeg                  2.5.0                h7d73246_1    conda-forge
     openssl                   1.1.1q               h166bdaf_0    conda-forge
     orc                       1.7.6                h6c59b99_0    conda-forge
     packaging                 21.3               pyhd8ed1ab_0    conda-forge
     pandas                    1.4.4            py38h47df419_0    conda-forge
     parquet-cpp               1.5.1                         2    conda-forge
     partd                     1.3.0              pyhd8ed1ab_0    conda-forge
     pillow                    9.2.0            py38ha3b2c9c_2    conda-forge
     pip                       22.2.2             pyhd8ed1ab_0    conda-forge
     protobuf                  3.20.1           py38hfa26641_0    conda-forge
     psutil                    5.9.2            py38h0a891b7_0    conda-forge
     pthread-stubs             0.4               h36c2ea0_1001    conda-forge
     ptxcompiler               0.6.1            py38h7525318_0    conda-forge
     pyarrow                   8.0.1           py38h097c49a_2_cpu    conda-forge
     pycparser                 2.21               pyhd8ed1ab_0    conda-forge
     pynvml                    11.4.1             pyhd8ed1ab_0    conda-forge
     pyopenssl                 22.1.0             pyhd8ed1ab_0    conda-forge
     pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
     pyraft                    22.08.00        cuda11_py38_g87a7d16c_0    rapidsai
     pysocks                   1.7.1            py38h578d9bd_5    conda-forge
     python                    3.8.13          h582c2e5_0_cpython    conda-forge
     python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
     python_abi                3.8                      2_cp38    conda-forge
     pytz                      2022.4             pyhd8ed1ab_0    conda-forge
     pyyaml                    6.0              py38h0a891b7_4    conda-forge
     re2                       2022.06.01           h27087fc_0    conda-forge
     readline                  8.1.2                h0f457ee_0    conda-forge
     rmm                       22.08.00        cuda11_py38_gd212232c_0    rapidsai
     s2n                       1.0.10               h9b69904_0    conda-forge
     scipy                     1.9.1            py38hea3f02b_0    conda-forge
     setuptools                65.4.1             pyhd8ed1ab_0    conda-forge
     six                       1.16.0             pyh6c4a22f_0    conda-forge
     snappy                    1.1.9                hbd366e4_1    conda-forge
     sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
     spdlog                    1.8.5                h4bd325d_1    conda-forge
     sqlite                    3.39.4               h4ff8645_0    conda-forge
     tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
     tk                        8.6.12               h27826a3_0    conda-forge
     toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
     tornado                   6.1              py38h0a891b7_3    conda-forge
     treelite                  2.4.0            py38hdd725b4_1    conda-forge
     treelite-runtime          2.4.0                    pypi_0    pypi
     typing_extensions         4.4.0              pyha770c72_0    conda-forge
     ucx                       1.13.1               h538f049_0    conda-forge
     ucx-proc                  1.0.0                       gpu    rapidsai
     ucx-py                    0.27.00         py38_g9abe3c1_0    rapidsai
     urllib3                   1.26.11            pyhd8ed1ab_0    conda-forge
     wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
     xorg-libxau               1.0.9                h7f98852_0    conda-forge
     xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
     xz                        5.2.6                h166bdaf_0    conda-forge
     yaml                      0.2.5                h7f98852_2    conda-forge
     zict                      2.2.0              pyhd8ed1ab_0    conda-forge
     zipp                      3.9.0              pyhd8ed1ab_0    conda-forge
     zlib                      1.2.12               h166bdaf_4    conda-forge
     zstd                      1.5.2                h6239696_4    conda-forge

Additional context Add any other context about the problem here.

royinx commented 2 years ago

from my opinion, please make sure the normalized_x is the array you want to put into the model. if you are implementing unit norm on x , using axis=1, all the output must return [1,0].

robertclancy commented 2 years ago

If you run the snippet above, you will see normalized_x all have unit norm:

normalized_x = x / np.linalg.norm(x, axis=1, keepdims=True)
normalized_x
array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]])
royinx commented 2 years ago

First, all 5 coordinates in normalized_x are sharing the same coordinate [1,0], the distance must be zero whatever which clustering distance or clustering models used.

Second, indeed, the cosine similarity does not depend on the length of the two vectors. But there are no angle difference in array x , they are all the points on x-axis.
So even you set the eps to 0.00001 , it will still return [0,0,0,0,0], means they are the same cluster.

You can try to set

x = np.array([[1, 0],[2, 0],[3, 0],[4, 0],[5, 0]]).astype('float64')
normalized_labels = DBSCAN(eps=0.00001, min_samples = 2, metric='cosine').fit(normalized_x).labels_
print(normalized_labels) # [0 0 0 1 1]
robertclancy commented 1 year ago

First, all 5 coordinates in normalized_x are sharing the same coordinate [1,0], the distance must be zero whatever which clustering distance or clustering models used.

I completely agree with you, but the bug is that DBSCAN using cosine metric with the original x produces labels of

array([-1, -1, -1, -1, -1], dtype=int32)

i.e. they are all regarded as outliers, when clearly they should all be in one cluster.

royinx commented 1 year ago

That's also a bug using cosine distance on DBSCAN #4938. you can normalise the unit norm and try to use euclidean distance instead.

for eps, you need grid search to tune it yourself, I have take references from link 1 and link 2

  1. X' = X/ ||X|| , Y' = Y/ ||Y||
  2. ||Y'|| = ||X'|| =1
  3. cos_sim(X,Y) = cos_sim(X' ,Y') = X'Y'
  4. cos_dis(X', Y') = ( 1 - cos_sim( X' ,Y' ) ) = ( 1 - X'Y'/ (||X'|| ||Y'||) ) = ( 1- X'Y' )
  5. eucli_dis = ||X' - Y'||^2 = 2 - 2 cos_sim(X', Y') = 2 *(1-cos_sim(X',Y')) = 2 cos_dis(X',Y')

thus, eps_eu = 2 * eps_cos

BUT this is the theory only. While I implemented grid search for fine tuning, the vector length significantly affects the eps variation.

So, normalizing the vector and doing a grid search on eps is the fastest way.

robertclancy commented 1 year ago

the eps parameter used for cosine is multiplied by 2 https://github.com/rapidsai/cuml/blob/73b8d00d03edd8f462369fdf5a255cb1fb58a94a/cpp/src/dbscan/vertexdeg/algo.cuh#L85 before being passed into the neighbourhood search method, while the eps parameter for euclidean is squared: https://github.com/rapidsai/cuml/blob/73b8d00d03edd8f462369fdf5a255cb1fb58a94a/cpp/src/dbscan/vertexdeg/algo.cuh#L104

so if the vectors are normalized, you can get the same results as cosine by using euclidean with eps=np.sqrt(2 * eps).

i.e.:

# !! assuming normalized_x is already normalized !!
eps = 0.5
cosine_labels = DBSCAN(eps=eps, metric='cosine').fit(normalized_x).labels_
euclidean_labels = DBSCAN(eps=np.sqrt(2 * eps), metric='euclidean').fit(normalized_x).labels_

assert np.all(cosine_labels == euclidean_labels)

will pass.

This still doesn't explain the bug I am seeing however when passing in a non-normalized x.

georgeliu95 commented 1 year ago

Hey Robert @robertclancy. This bug is caused by #5360, where the matrixVectorOp gets an incorrect parameter. In your example, the data in the calculation of vertex degree would be:

data_x = [[1, 0],[2, 0],[3, 0],[4, 0],[5, 0]]
rowNorm = [[1, 0],[2, 0],[3, 0],[4, 0],[5, 0]]
data_x_after_norm = [[1, 0],[2, 0],[3, 0],[4, 0],[5, 0]]
data_adj = 
    [[1, 1, 0, 0, 0],
     [1, 1, 1, 0, 0],
     [0, 1, 1, 1, 0],
     [0, 0, 1, 1, 1],
     [0, 0, 0, 1, 1]]

So if min_samples = 5 as default, there is acturally no core points, that's why you get outliers there. If you set min_samples as 3, then you get expected results. You can change the related lines passing the right parameter to fix it, before RAPIDS team fix it in the following release.