mhahsler / dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
GNU General Public License v3.0
314 stars 64 forks source link

Discrepancies in outlier score between HDBSCAN R and python #38

Closed matteo-1980 closed 4 years ago

matteo-1980 commented 5 years ago

Dear author,

I have started a detailed examination of the hdbscan function contained in the dbscan package because I observed an anomalous distribution of the outlier scores for my dataset. In fact, the scores tended to group all around two values: 0 and 1.

In order to grasp the reasons behind this problem I have tried to apply the hdbscan function both in R and in python to the same dataset.

The result that I got is that using the python the distribution of the outlier scores makes a lot more sense than the one outputted by R on the same dataset.

Is there a known issue on this?

Would you be able to fix this?

Thanks in advance for your feedback

Best regards

Matteo

peekxc commented 5 years ago

Do you have a reproducible example by chance? A reprex or something would help a lot

matteo-1980 commented 5 years ago

Okay. I will produce it. Where shall I then store code and data?

peekxc commented 5 years ago

Reprex should do most things for you actually. Example:

data("iris")
model <- dbscan::hdbscan(iris[,-5], 5)
print(model)
#> HDBSCAN clustering for 150 objects.
#> Parameters: minPts = 5
#> The clustering contains 2 cluster(s) and 0 noise points.
#> 
#>   1   2 
#> 100  50 
#> 
#> Available fields: cluster, minPts, cluster_scores,
#>                   membership_prob, outlier_scores, hc
head(model$outlier_scores)
#> [1] 0.0000000 0.0000000 0.4654775 0.4226497 0.1835034 0.6220355

Created on 2019-11-01 by the reprex package (v0.3.0)

matteo-1980 commented 5 years ago

Here it is :-) Only the R part for the moment (I have issues with Python, which I hope to solve on Monday). As you can see most of the observations get NA as outlier score and the remaining observations take either 0 or 1 as score, very few obs being distributed in the middle. I guess this is related to overlapping points. I noticed this when replicating the data: with artificial data I did not get the problem. I was able to reproduce it only by reducing the variety among the points. Thanks for the help

# Libraries
library(dbscan)
#> Warning: package 'dbscan' was built under R version 3.5.3
library(DescTools)
#> Warning: package 'DescTools' was built under R version 3.5.3
library(reprex)
#> Warning: package 'reprex' was built under R version 3.5.3

# Generate data
dset1=data.frame("id1"= c(rep(1,500),rep(0,500)),
                 "id2"= c(rep(0,500),rep(1,500)),
                 "val"= c(round(rnorm(250,1,0.5),1),round(rnorm(250,2,0.5),1),round(rnorm(250,2,0.25),2),round(rnorm(250,3,0.25),1)) )

# Check repeated observations
print(paste0('Repeated observations over total observations: ',round(1-dim(unique(dset1))[1]/dim(dset1)[1],4)*100,'%'))
#> [1] "Repeated observations over total observations: 85.8%"

# Generate outlying observation
dset1=rbind(dset1,data.frame("id1"=0, "id2"=1, "val"=-3))

# Launch hdbscan
hdb_res1=hdbscan(dset1, minPts = 5)

# Check na scores
print(paste0('Scores equal to NA over total observations: ',round(sum(is.na(hdb_res1$outlier_scores))/dim(dset1)[1],4)*100,'%'))
#> [1] "Scores equal to NA over total observations: 82.02%"

# Plot histogram of outlier scores
Freq(hdb_res1$outlier_scores)
#>         level  freq   perc  cumfreq  cumperc
#> 1     [0,0.1]    71  39.4%       71    39.4%
#> 2   (0.1,0.2]     0   0.0%       71    39.4%
#> 3   (0.2,0.3]     0   0.0%       71    39.4%
#> 4   (0.3,0.4]     0   0.0%       71    39.4%
#> 5   (0.4,0.5]     3   1.7%       74    41.1%
#> 6   (0.5,0.6]     0   0.0%       74    41.1%
#> 7   (0.6,0.7]     5   2.8%       79    43.9%
#> 8   (0.7,0.8]     0   0.0%       79    43.9%
#> 9   (0.8,0.9]     0   0.0%       79    43.9%
#> 10    (0.9,1]   101  56.1%      180   100.0%

Created on 2019-11-02 by the reprex package (v0.3.0)

peekxc commented 5 years ago

Just some initial thoughts, without going too far into the code.

NaN's are very often caused by dividing things like 0/0. The NA's are could likely be caused by the fact that the core dist is 0.0 for most of the observations. This occurs because most of the data is redundant.

dset1 <- data.frame(
    "id1"= c(rep(1,500),rep(0,500)),
    "id2"= c(rep(0,500),rep(1,500)),
    "val"= c(round(rnorm(250,1,0.5),1),round(rnorm(250,2,0.5),1),round(rnorm(250,2,0.25),2),round(rnorm(250,3,0.25),1)) 
)
nrow(dset1)
#> [1] 1000
nrow(unique(dset1))
#> [1] 140

Created on 2019-11-02 by the reprex package (v0.3.0)

This becomes worse in HDBSCAN's specific internals, actually. While redundant data is already a nice pathological way to mess with the outlier detection (since the distance between identical points is 0, so notions of outlierness are based on distributions of distance get skewed), in HDBSCAN recall the principle distance to consider the mutual reachability distance:

d_mrd(x_i, x_j) = max( d_core(x_i), d_core(x_j), d(x_i, x_j) )

Indeed, in the above example, less than half a percent of the distances are actually unique:

library(dbscan)

# Generate data
n <- 500
dset1=data.frame(
    "id1"= c(rep(1,n),rep(0,n)),
    "id2"= c(rep(0,n),rep(1,n)),
    "val"= round(c(rnorm(n/4,1,0.5),rnorm(n/4,2,0.5),rnorm(n/4,2,0.25), rnorm(n/4,3,0.25)), 1)
)
dset1=rbind(dset1,data.frame("id1"=0, "id2"=1, "val"=-3))

x <- dset1 
minPts <- 5
core_dist <- kNNdist(x, k = minPts - 1)[, minPts -1]
xdist <- dist(x, method = "euclidean")
mrd <- dbscan:::mrd(xdist, core_dist)

length(unique(mrd))/length(mrd)
#> [1] 0.0004295704

Created on 2019-11-02 by the reprex package (v0.3.0)

The definition of GLOSH from Eq. 8 of Campello's 2015 paper defines the GLOSH score as 1 - e_max(x_i)/e(x_i). So the way to get NaN's is probably when e_max(x_i)/e(x_i) is NaN.

Some preliminary experiments seems to point toward this as the cause as well. You can generate arbitrarily number of NaN's if you try to think of

library(dbscan)

# Generate data
n <- 24
dset1=data.frame(
    "id1"= c(rep(1,n),rep(0,n)),
  "id2"= c(rep(0,n),rep(1,n)),
    "val"= round(c(rnorm(n/4,1,0.5),rnorm(n/4,2,0.5),rnorm(n/4,2,0.25), rnorm(n/4,3,0.25)), 1)
)
dset1=rbind(dset1,data.frame("id1"=0, "id2"=1, "val"=-3))

## outlier scores fine here
hdb <- hdbscan(dset1, minPts = 5)
hdb$outlier_scores
#>  [1] 6.250000e-01 7.771561e-16 6.666667e-01 2.500000e-01 4.000000e-01
#>  [6] 4.000000e-01 0.000000e+00 0.000000e+00 0.000000e+00 7.771561e-16
#> [11] 7.771561e-16 2.500000e-01 2.500000e-01 0.000000e+00 2.500000e-01
#> [16] 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e-01 0.000000e+00
#> [21] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.250000e-01
#> [26] 7.771561e-16 6.666667e-01 2.500000e-01 4.000000e-01 4.000000e-01
#> [31] 0.000000e+00 0.000000e+00 0.000000e+00 7.771561e-16 7.771561e-16
#> [36] 2.500000e-01 2.500000e-01 0.000000e+00 2.500000e-01 0.000000e+00
#> [41] 0.000000e+00 0.000000e+00 4.000000e-01 0.000000e+00 0.000000e+00
#> [46] 0.000000e+00 0.000000e+00 0.000000e+00 9.164407e-01

## Many neighborhood with score 1-0/0
hdb <- hdbscan(dset1, minPts = 2)
hdb$outlier_scores
#>  [1]   0   0   0   0   0   0 NaN   0 NaN   0   0   1 NaN NaN NaN NaN NaN
#> [18]   0   1   0   0 NaN NaN NaN   0   0   0   0   0   0 NaN   0 NaN   0
#> [35]   0   1 NaN NaN NaN NaN NaN   0   1   0   0 NaN NaN NaN   1

Created on 2019-11-02 by the reprex package (v0.3.0)

I could add a check to see if the scores are NaN, and then reassign them to be 0, but that may not be helpful. Rather, I recommend removing redundant data to begin with. You can keep track of their indices and assign them the same outlier score as the point they are identical with in post-processing.

I don't know what the Python guys do. Maybe they have some additional checks to assign NaN's to 0, or maybe they remove redundant data before actually processing the data, and assign the scores post-process.

matteo-1980 commented 5 years ago

Dear author, I have reproduced the case in Python. I get what follows. Actually Python handles repeated observations well and identifies the synthetic outlier properly. I think the R package should be fixed: this is a bug. Can we expect a fix? Thanks


# Libraries
import hdbscan as hd
import pandas as pd
import numpy as np

# Generate data
id1=[1] * 500 + [0]*500
id2=[0] * 500 + [1]*500
va1=list(np.random.normal(1,0.5,250)) + list(np.random.normal(2,0.5,250)) + list(np.random.normal(2,0.25,250)) + list(np.random.normal(3,0.25,250))
val=[ round(elem, 1) for elem in va1 ]
dset1=pd.DataFrame({'id1':id1,'id2':id2,'val':val})

# Add synthetic outlier
out1=pd.DataFrame({'id1':[0],'id2':[1],'val':[-100]})
dset1=dset1.append(out1, ignore_index=True)

# Create clustering object
clusterer = hd.HDBSCAN()

# Fit the clustering
clusterer.fit(dset1)

# Get scores in a variable
scores=clusterer.outlier_scores_

# Print frequencies
print ( 'Frequencies of scores:')
print ( pd.crosstab(index=pd.cut(scores, 20), columns="count"))

Frequencies of scores:
col_0                count
row_0                     
(-0.000986, 0.0493]    983
(0.0493, 0.0986]         0
(0.0986, 0.148]          0
(0.148, 0.197]           0
(0.197, 0.246]           0
(0.246, 0.296]           0
(0.296, 0.345]           0
(0.345, 0.394]           0
(0.394, 0.444]           0
(0.444, 0.493]           0
(0.493, 0.542]           4
(0.542, 0.592]           0
(0.592, 0.641]           0
(0.641, 0.69]            3
(0.69, 0.739]            0
(0.739, 0.789]           1
(0.789, 0.838]           0
(0.838, 0.887]           0
(0.887, 0.937]           1
(0.937, 0.986]           1

# Get highest value in clusterer.outlier_scores_
print('Maximum score of the non-outlying observations: ', round(max(scores[0:999]),2))
print('Score of the outlying observation: ', round(scores[1000],2))

Maximum score of the non-outlying observations:  0.89
Score of the outlying observation:  0.99
mhahsler commented 4 years ago

GLOSH now (development version on CRAN) returns now 0 instead of NaN if we have k duplicate points in the data. This is consistent since if we have for example 3 observations at exactly the same location, then at k <= 3 this point is not an outlier.