Open robertclancy opened 2 years ago
from my opinion, please make sure the normalized_x is the array you want to put into the model. if you are implementing unit norm on x , using axis=1, all the output must return [1,0].
If you run the snippet above, you will see normalized_x
all have unit norm:
normalized_x = x / np.linalg.norm(x, axis=1, keepdims=True)
normalized_x
array([[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.]])
First, all 5 coordinates in normalized_x
are sharing the same coordinate [1,0], the distance must be zero whatever which clustering distance or clustering models used.
Second, indeed, the cosine similarity does not depend on the length of the two vectors.
But there are no angle difference in array x
, they are all the points on x-axis.
So even you set the eps to 0.00001 , it will still return [0,0,0,0,0], means they are the same cluster.
You can try to set
x = np.array([[1, 0],[2, 0],[3, 0],[4, 0],[5, 0]]).astype('float64')
normalized_labels = DBSCAN(eps=0.00001, min_samples = 2, metric='cosine').fit(normalized_x).labels_
print(normalized_labels) # [0 0 0 1 1]
First, all 5 coordinates in normalized_x are sharing the same coordinate [1,0], the distance must be zero whatever which clustering distance or clustering models used.
I completely agree with you, but the bug is that DBSCAN using cosine metric with the original x
produces labels of
array([-1, -1, -1, -1, -1], dtype=int32)
i.e. they are all regarded as outliers, when clearly they should all be in one cluster.
That's also a bug using cosine distance on DBSCAN #4938. you can normalise the unit norm and try to use euclidean distance instead.
for eps, you need grid search to tune it yourself, I have take references from link 1 and link 2
thus, eps_eu = 2 * eps_cos
BUT this is the theory only. While I implemented grid search for fine tuning, the vector length significantly affects the eps variation.
So, normalizing the vector and doing a grid search on eps is the fastest way.
the eps
parameter used for cosine is multiplied by 2 https://github.com/rapidsai/cuml/blob/73b8d00d03edd8f462369fdf5a255cb1fb58a94a/cpp/src/dbscan/vertexdeg/algo.cuh#L85
before being passed into the neighbourhood search method, while the eps
parameter for euclidean is squared:
https://github.com/rapidsai/cuml/blob/73b8d00d03edd8f462369fdf5a255cb1fb58a94a/cpp/src/dbscan/vertexdeg/algo.cuh#L104
so if the vectors are normalized, you can get the same results as cosine by using euclidean with eps=np.sqrt(2 * eps)
.
i.e.:
# !! assuming normalized_x is already normalized !!
eps = 0.5
cosine_labels = DBSCAN(eps=eps, metric='cosine').fit(normalized_x).labels_
euclidean_labels = DBSCAN(eps=np.sqrt(2 * eps), metric='euclidean').fit(normalized_x).labels_
assert np.all(cosine_labels == euclidean_labels)
will pass.
This still doesn't explain the bug I am seeing however when passing in a non-normalized x
.
Hey Robert @robertclancy. This bug is caused by #5360, where the matrixVectorOp
gets an incorrect parameter.
In your example, the data in the calculation of vertex degree would be:
data_x = [[1, 0],[2, 0],[3, 0],[4, 0],[5, 0]]
rowNorm = [[1, 0],[2, 0],[3, 0],[4, 0],[5, 0]]
data_x_after_norm = [[1, 0],[2, 0],[3, 0],[4, 0],[5, 0]]
data_adj =
[[1, 1, 0, 0, 0],
[1, 1, 1, 0, 0],
[0, 1, 1, 1, 0],
[0, 0, 1, 1, 1],
[0, 0, 0, 1, 1]]
So if min_samples = 5
as default, there is acturally no core points, that's why you get outliers there. If you set min_samples
as 3, then you get expected results.
You can change the related lines passing the right parameter to fix it, before RAPIDS team fix it in the following release.
Describe the bug DBScan will produce different results when using the cosine metric if the input vectors are normalized or not.
Steps/Code to reproduce bug
Expected behavior I expect that the output should not depend on normalization since the cosine similarity does not depend on the length of the two vectors.
Environment details (please complete the following information):
Click here to see environment details
Additional context Add any other context about the problem here.