OPTICS detecting the wrong outlier

adrinjalali commented 6 years ago

The following example shows OPTICS on a small dataset, with one outlier. It seems that the outlier is incorrectly detected.

>>> from sklearn.cluster import OPTICS
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
...               [80, 70], [80, 80], [250, 800]])
>>> 
>>> 
>>> np.random.seed(0)
>>> n_points_per_cluster = 10
>>> 
>>> C1 = [-5, -2] + .8 * np.random.randn(n_points_per_cluster, 2)
>>> C2 = [4, -1] + .1 * np.random.randn(n_points_per_cluster, 2)
>>> X = np.vstack((C1, C2, np.array([[100, 200]])))
>>> 
>>> clustering = OPTICS(metric="euclidean").fit(X)
>>> np.hstack((X, clustering.labels_[:, np.newaxis]))
array([[ -3.58875812,  -1.67987423,   0.        ],
       [ -4.21700961,  -0.20728544,   0.        ],
       [ -3.50595361,  -2.7818223 ,   0.        ],
       [ -4.23992927,  -2.12108577,   0.        ],
       [ -5.08257508,  -1.6715212 ,   0.        ],
       [ -4.88476514,  -0.83658119,   0.        ],
       [ -4.39116982,  -1.90265999,   0.        ],
       [ -4.64490941,  -1.73306054,   0.        ],
       [ -3.80473674,  -2.16412661,   0.        ],
       [ -4.74954584,  -2.68327659,   0.        ],
       [  3.74470102,  -0.93463814,  -1.        ],
       [  4.08644362,  -1.0742165 ,   1.        ],
       [  4.22697546,  -1.14543657,   1.        ],
       [  4.00457585,  -1.01871839,   1.        ],
       [  4.15327792,  -0.85306412,   1.        ],
       [  4.01549474,  -0.96218375,   1.        ],
       [  3.91122143,  -1.19807965,   1.        ],
       [  3.96520879,  -0.9843651 ,   1.        ],
       [  4.12302907,  -0.87976202,   1.        ],
       [  3.96126732,  -1.03023028,   1.        ],
       [100.        , 200.        ,   1.        ]])
>>> clustering # doctest: +NORMALIZE_WHITESPACE
OPTICS(algorithm='ball_tree', leaf_size=30, max_bound=inf, maxima_ratio=0.75,
    metric='euclidean', metric_params=None, min_cluster_size_ratio=0.005,
    min_maxima_ratio=0.001, min_samples=5, n_jobs=1, p=2,
    rejection_ratio=0.7, significant_min=0.003, similarity_threshold=0.4)

adrinjalali commented 6 years ago

ping @GaelVaroquaux

jnothman commented 6 years ago

perhaps our tests weren't strong enough @espg

espg commented 6 years ago

This looks to be a bug in _automatic_cluster. I'm guessing that it occurs when the leaf in the reachability tree is too small, and is merged to the adjacent leaf, instead of being marked as noise / removed. I'll see if I can find a fix (I think around lines 709 or 725).

jnothman commented 6 years ago

Thank you both

adrinjalali commented 6 years ago

(continuing from #3846) @qinhanmin2014 I don't think I can come up with a bug-free example. Check the following:

>>> from sklearn.cluster import OPTICS
>>> import numpy as np
>>> 
>>> # Generate sample data
>>> 
>>> np.random.seed(0)
>>> for _ in range(10):
...     n_points_per_cluster = 250
...     C1 = [-50, -20] + .08 * np.random.randn(n_points_per_cluster, 2)
...     C2 = [40, -10] + .01 * np.random.randn(n_points_per_cluster, 2)
...     C3 = [10, -20] + .02 * np.random.randn(n_points_per_cluster, 2)
...     C4 = [-20, 30] + .03 * np.random.randn(n_points_per_cluster, 2)
...     C5 = [30, -20] + .06 * np.random.randn(n_points_per_cluster, 2)
...     C6 = [50, 60] + .02 * np.random.randn(n_points_per_cluster, 2)
...     X = np.vstack((C1, C2, C3, C4, C5, C6))
...     clustering = OPTICS(min_samples=9, rejection_ratio=0.5)
...     # Run the fit
...     clustering.fit(X)
...     print(np.sum(clustering.labels_ == -1))
5
5
5
5
5
5
5
5
5
5

I really don't think the data keeps having exactly 5 outliers. I also checked the example provided here and the algorithm detects way too many outliers IMO.

jnothman commented 6 years ago

Thanks for investigating, Adrin. @espg are you working on this? I fear we may need to retract OPTICS from 0.20 if this is not resolved soon.

espg commented 6 years ago

@jnothman I'm on travel and won't be able to dedicate time on this until next week. The bug @adrinjalali first flagged in this thread needs to be investigated, but the recent example above doesn't seem to be incorrect to me. The clusters are widely spaced, so OPTICS evaluates all points as belonging to a given label; the 5 outliers are one point per transition to the next cluster (boundary point where the reachability distance spikes). OPTICS calculates reachability distances starting with the first point to all other points, and then removes the current point from future consideration and recursively repeats. That means that the reachability distances will be low for all points in the cluster under consideration except for the last point on the cluster-- since all it's neighbors have been removed, it will have a reachability distance based on distance to the next cluster (instead of inter-cluster distance), which is quite high in the case above. If you run the same test with DBSCAN, you'll get varying outliers depending on what the eps value is set to, but generally much higher in number for eps under 0.08, and then zero outliers if you set eps greater than twice that... the fact that OPTICS is only predicting one outlier per cluster means it's generally mapping the clusters better then DBSCAN (with less information, since eps doesn't need to be set), so this is a feature, not a bug.

I could see the expected behavior for OPTICS would be to have zero outliers for the case(s) above, and that might be achievable with some change to auto_extract...I'll take a look at this next week and see if there is a way to better handle boundary points.

jnothman commented 6 years ago

Thanks!!

qinhanmin2014 commented 6 years ago

Maybe we can compare our implementation with R, or implementation from the original author if available. Tagging blocker.

rth commented 6 years ago

Is this issue worth delaying the RC release for? It doesn't sound like it will be solved in the next few days... Maybe add this bug fix either in the final 0.20.0 or 0.20.1?

adrinjalali commented 6 years ago

Here's a result from the R package (dbscan)

For the code:

set.seed(2)
n <- 400
x <- cbind(
  x = c(0, 2, 0, 2) + rnorm(n, sd=0.1),
  y = c(2, 0, 0, 2) + rnorm(n, sd=0.1)
)

library(dbscan)

res = optics(x)
#predict(res, x)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .17)
cl$cluster
plot(cl)
hullplot(x, cl)

These are the calculated clusters (no outliers):

> cl$cluster
  [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3
 [60] 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2
[119] 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1
[178] 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
[237] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3
[296] 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2
[355] 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

And the following plots:

order reachability convex_hulls

espg commented 6 years ago

@adrinjalali can you please compare the reachability plot from R with the output from this implementation of OPTICS? I'm 95% sure that the issue is in the auto_extract function, but it would be good to be positive, and I'm not really familiar with R. If the bug is in auto_extract, the calculated and ordered reachability distances should be identical to within machine precision from the two implementations.

adrinjalali commented 6 years ago

@espg unfortunately it doesn't seem to be the case. They're pretty different.

Here's the R code I used to create the data and get the reachability:

set.seed(2)
n <- 100
x <- cbind(
  x = c(0, 2, 0, 2) + rnorm(n, sd=0.1),
  y = c(2, 0, 0, 2) + rnorm(n, sd=0.1)
)

write.csv(x, file='/tmp/data.csv')

library(dbscan)

res = optics(x)
#predict(res, x)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .17)
cl$cluster
plot(cl)
hullplot(x, cl)

write.csv(matrix(c(cl$order, cl$reachdist), ncol=2), file='/tmp/r.csv')

And here's the python code to apply OPTICS on the same data:

import pandas as pd
import numpy as np
from sklearn.cluster import OPTICS
data = pd.read_csv('/tmp/data.csv')
data = data[['x', 'y']]
clustering = OPTICS(min_samples=5)
clustering.fit(data)
np.savetxt("/tmp/py.csv",
           np.transpose(np.array([clustering.ordering_,
                                  clustering.reachability_])))

Putting the two generated files together, here's what you'd see.

python order	python reachability	R order	R reachability
0	inf	1	Inf
52	0.07329448816232	57	Inf
56	0.071566498480906	73	Inf
12	0.086188714113748	53	Inf
72	0.030358668261071	13	0.03035866826107
40	1.58645318232263	89	0.091487092011897
88	0.064431760283397	49	0.064431760283397
48	0.08831105629047	41	0.088311056290458
4	0.115609686828167	25	0.115609686828167
24	0.077812333737916	5	0.077812333737916
68	0.079401452929697	69	0.079401452929697
64	0.088171163308115	65	0.088171163308108
28	0.073131402725988	81	0.073131402725989
80	0.094438516074332	29	0.094438516074335
36	1.57660870727407	37	0.070793201620562
76	1.56178634568536	77	0.108586894358706
44	0.102916640800507	45	0.102916640800511
...	...	...	...

Does this help?

espg commented 6 years ago

This is interesting, especially given that the R version is finding multiple instances of inf reachability past the first point. I'm still unclear on the R code that you posted-- I see you set and eps threshold (0.17), but I don't see you define minPts anywhere. What is R using or defaulting to-- it looks like maybe two points per cluster? Also, there is not max_bound set on our implementation of OPTICS, so the results won't be directly comparable...

Can you rerun with the R version using minPts of 5, and an eps of 10.0, and then do the same for OPTICS but with max_bound set to 10.0? If you provide a link to the R results and generated data, I can spin up OPTICS directly to compare against that.

Thanks @adrinjalali

adrinjalali commented 6 years ago

Trying to come up with a smaller example, here's the R code (it also includes installing the required package, you can just run it if you're curious)

set.seed(2)
n <- 15
x <- cbind(
    x = c(0, 2, 0) + rnorm(n, sd=0.1),
    y = c(2, 0, 0) + rnorm(n, sd=0.1)
)

write.csv(x, file='/tmp/data.csv')

install.packages('dbscan')
library(dbscan)

res = optics(x, eps = 10, minPts = 5)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .5)
cl$cluster
plot(cl)
hullplot(x, cl)

mat = matrix(c(cl$order, cl$reachdist,
               cl$coredist, cl$predecessor,
               cl$cluster), ncol=5)
colnames(mat)<- c('order', 'reachdist', 'coredist', 'predecessor', 'cluster')
write.csv(mat, file='/tmp/r.csv')

And the attributes of the object are now:

> cl
OPTICS ordering/clustering for 15 objects.
Parameters: minPts = 5, eps = 10, eps_cl = 0.5, xi = NA
The clustering contains 3 cluster(s) and 0 noise points.

1 2 3 
5 5 5 

Available fields: order, reachdist, coredist, predecessor, minPts, eps, eps_cl, xi, cluster

The resulting matrix would be:

	order	reachdist	coredist	predecessor	cluster
1	1	Inf	0.333208651272379	NA	1
2	13	0.292663608330619	0.333843791583614	5	3
3	10	0.182346160922536	0.251820141958916	12	2
4	7	0.17871163011086	0.333208651272379	13	1
5	4	0.355622421383634	0.292663608330619	14	3
6	6	1.56320382265559	0.251820141958916	1	2
7	15	0.17871163011086	0.287675880147458	13	1
8	12	0.355622421383634	0.409445643425412	14	3
9	9	0.182346160922536	0.195940809782328	12	2
10	3	0.17871163011086	0.243693355280387	13	1
11	14	0.355622421383634	0.409445643425412	14	3
12	11	0.244256566906985	0.182346160922536	15	2
13	8	0.333208651272379	0.17871163011086	1	1
14	5	1.7015610774674	0.355622421383634	9	3
15	2	0.251820141958916	0.244256566906985	6	2

Here's the data used in this example:

x	y
-0.089691454662498	1.76889309153948
2.01848491846467	0.087860458092127
0.158784533120882	0.003580671801523
-0.113037567424629	2.10128286921271
1.9919748243449	0.043226515453962
0.013242028438109	0.209081920524915
0.070795472927173	1.88000741803561
1.97603019758282	0.158963820029007
0.198447393665293	0.195465164222325
-0.013878701211967	2.00049377768281
2.04176507507926	-0.245170638784613
0.098175277746366	0.047723730261362
-0.039269535550381	1.94034418313686
1.89603310230511	0.079220327029965
0.178222896030858	0.028963671017735

And the python code on this same data:

import pandas as pd
import numpy as np
from sklearn.cluster import OPTICS
data = pd.read_csv('/tmp/data.csv')
data = data[['x', 'y']]
clustering = OPTICS(min_samples=5, max_bound=10)
clustering.fit(data)
np.savetxt("/tmp/py.csv",
           np.transpose(np.array([clustering.ordering_,
                                  clustering.reachability_])))

would result in:

ordering	reachability	core_distances	labels
0	inf	0.333208651272384	0
12	0.29266360833062	0.333843791583615	-1
9	0.182346160922537	0.251820141958916	-1
3	0.178711630110863	0.333208651272384	0
6	0.355622421383636	0.29266360833062	-1
5	1.56320382265559	0.251820141958916	-1
11	0.178711630110863	0.287675880147461	0
2	0.29266360833062	0.409445643425412	-1
14	0.182346160922537	0.195940809782328	-1
8	0.178711630110863	0.243693355280386	0
13	0.292663608330621	0.409445643425412	-1
4	0.251820141958916	0.182346160922537	-1
1	0.333208651272384	0.178711630110863	0
7	1.7015610774674	0.355622421383636	-1
10	0.182346160922537	0.244256566906986	-1

The results kinda seem closer now!

espg commented 6 years ago

Huh, interesting. The core distances all agree. My initial impression is that the R version is wrong here. The starting point has coordinates (−0.0896914546625, 1.76889309154), and the algorithm should jump to the next closest neighbor... R version goes to point 15, with coordinates (0.178222896030858, 0.028963671017735), and a distance of 1.76044. Sklearn version goes to point 13, with coordinates (−0.0392695355504, 1.94034418314), and a much closer distance of 0.17871. Both algorithms agree on the reachability (and core) distance for point 13, but I can't figure out why the R version orders it number 8 in the ordered list instead of number 2. The points should be ordered according to cluster; points 1, 13, 10, 7, and 4 are all identified as a cluster in both R and Sklearn, but I'd expect them to be adjacent in the ordering scheme (they are in sklearn, they are not at all in R...kinda curious how they got the labeling correct since they aren't in contiguous order, I guess something to do with this predecessor label?).

I'm reviewing the source paper to see if there is something I'm missing for the core algorithm here...

jnothman commented 6 years ago

Worth comparing to Weka/ELKI?

I really should think about how we can better facilitate cross-language comparisons for new implementations.

adrinjalali commented 6 years ago

I'm not used to weka, this is what I got out of it:

=== Run information ===

Scheme:       weka.clusterers.OPTICS -E 10.0 -M 5 -A "weka.core.EuclideanDistance -R first-last" -F -no-gui -db-output .
Relation:     data-weka.filters.unsupervised.attribute.Remove-R1
Instances:    15
Attributes:   2
              x
              y
Test mode:    evaluate on training data

=== Clustering model (full training set) ===

OPTICS clustering results
============================================================================================

Clustered DataObjects: 15
Number of attributes: 2
Epsilon: 10.0; minPoints: 5
Write results to file: yes
Distance-type: 
Number of generated clusters: 0
Elapsed time: .0

( 0.) -0.089691,1.768893                        -->  c_dist: 0.142        r_dist: UNDEFINED   
(12.) -0.03927,1.940344                         -->  c_dist: 0.077        r_dist: 0.142
( 3.) -0.113038,2.101283                        -->  c_dist: 0.142        r_dist: 0.077
( 6.) 0.070795,1.880007                         -->  c_dist: 0.127        r_dist: 0.077
( 9.) -0.013879,2.000494                        -->  c_dist: 0.105        r_dist: 0.077
( 5.) 0.013242,0.209082                         -->  c_dist: 0.111        r_dist: 0.666
(11.) 0.098175,0.047724                         -->  c_dist: 0.079        r_dist: 0.111
(14.) 0.178223,0.028964                         -->  c_dist: 0.108        r_dist: 0.079
( 8.) 0.198447,0.195465                         -->  c_dist: 0.086        r_dist: 0.079
( 2.) 0.158785,0.003581                         -->  c_dist: 0.111        r_dist: 0.079
(13.) 1.896033,0.07922                          -->  c_dist: 0.154        r_dist: 0.789
( 1.) 2.018485,0.08786                          -->  c_dist: 0.142        r_dist: 0.154
(10.) 2.041765,-0.245171                        -->  c_dist: 0.175        r_dist: 0.142
( 7.) 1.97603,0.158964                          -->  c_dist: 0.175        r_dist: 0.142
( 4.) 1.991975,0.043227                         -->  c_dist: 0.125        r_dist: 0.142

Time taken to build model (full training data) : 0.01 seconds

=== Model and evaluation on training set ===

Clustered Instances

Unclustered instances : 15

jnothman commented 6 years ago

thanks! but that's not with the same 100 instances is it? it says 15

adrinjalali commented 6 years ago

The latest example I used for R and Python in the results I put here were also with the same 15 points. I guess it shows the difference. I was trying to make the example as small as I can.

jnothman commented 6 years ago

Oh, I must have missed some comments. Weka looks quite different in its distances?

amueller commented 6 years ago

what's the status of this?

espg commented 6 years ago

I took a look at the original code for automatic extraction from an OPTICS ordered list, and the same bug is present. You can decouple the build of the graph from the graph extraction-- so if you use a different OPTICS extraction function to build the graph on the points in this example and pass it to the unmodified auto_extract code from @amyxzhang, the same bug crops up. In other words, I've isolated where the bug is, but I don't really know how to fix it.

espg commented 6 years ago

Following up from @jnothman request in #11838 , here is the reference paper implementation pseudo code.

Converting reachability plot to dendrogram:

dendo

Main algorithm with tree trimming:

main_algo

amyxzhang commented 6 years ago

Hey all, sorry I just got notice of this issue today! It's been a SUPER long time since I looked at this code but I hope this explanation helps.

Basically the way the algorithm works (as described in the above pseudocode) is that, given a reachability plot, it searches for local maximas to iteratively split the plot, starting from the highest local maxima. A local maxima is a point on the reachability plot where the point to the left is lower and the point to the right is lower and the split that would occur does not create two groups that are now too small (according to given thresholds).

So the bug in the first script by @adrinjalali is because that outlier is at the end of the plot and thus doesn't count as a local maxima given the criteria. Some code could be written to consider potentially rejecting points at the ends of the reachability plot based on some threshold but I don't think it's currently part of the algorithm.

You'll notice that in both the 1st and 2nd script by @adrinjalali, there are middle points in the reachability plots that don't get added to a cluster even though they don't seem like outliers in the scatterplot. @espg is right that this is expected behavior of the algorithm as it stands - those points are splitpoints to divide a cluster in two and don't get added to the new right-hand cluster. Again, this could be revisited using some sort of threshold that looks at the average distance of the cluster to the right of the splitpoint.

There was also a point raised about too many outliers. That depends on the various parameters input towards the extraction algorithm. Also currently, the auto-extractor is only returning leaves of the resulting cluster tree as opposed to all nodes, leading to more outliers, since it's only returning the most densely packed clusters. You could if you wanted to, slice the tree at a particular depth and get a different set of clusters (much as you would in DBSCAN).

In the end, I don't think these issues are showstoppers if there's a deployment going out today.

adrinjalali commented 6 years ago

Thanks @amyxzhang.

If I may, I have one worry regarding labelling those split points as outliers. In practice, with large enough amount of data, people are likely to not investigate why a data point is labelled -1, and simply count them as an outlier.

Here's an undesirable scenario: a data scientist working in a fraud detection department is looking for algorithms to find outliers to raise their risk value, and report them as potential fraud. Since the outliers and split points are labelled the same way, the people whose data happen to be the split points, would potentially be treated unfairly as potential fraud.

That's why I would argue that those split points should either be included in the clusters, or be labelled specifically as a split point (-2 as opposed to -1 for instance).

I know in an ideal world the above scenario wouldn't happen, but unfortunately it's likely that once an algorithm gets into scikit-learn, it's used w/o careful consideration of the details of the algorithm.

That's of course just my one cent contribution to the discussion.

amyxzhang commented 6 years ago

Yes @adrinjalali, I agree that something could be done about the issues, maybe the strategies I mentioned above, as a final pass over the clusters perhaps. I think there would need to be testing to consider edge cases though. It may be trickier than anticipated with noisier data (lots of potential splitpoints, some of which are discarded). Back when I was working on this, I was mainly interested in the spatial boundary delineation of highly dense areas, so the issue did not come up.

jnothman commented 6 years ago

I've not yet had the time to understand the details here. But would it be correct to say that the implementation is faithful to the reference paper, but that there is an ambiguity in the algorithm that could be handled in a way that would be friendlier to ?some users.

jnothman commented 6 years ago

?

amyxzhang commented 6 years ago

Right, the implementation follows the algorithm, which doesn't give guidance for splitpoints and outlier points at the very end of the reachability plot. I think some small tweaks, with sufficient testing, can resolve these issues to make these examples conform to expectations.

jnothman commented 6 years ago

well, fwiw it would be very good to get this changed before the imminent release if we can come up with specified functionality ..

adrinjalali commented 6 years ago

I'm reading the original paper, and I'll get into the details of the implementation (for my own curiosity). Then I'll probably be able to fix some of these issues. But if anybody has the time and is already familiar with the code/algorithm, please don't wait for me. Anyhow, I'll try my best.

espg commented 6 years ago

It may be possible to deal with the split points by lumping them into the left-hand objects if their core distance is small. Since OPTICS removes points from consideration once they've been processed, and always picks the next closest point, the last point in a cluster tends to have large reachability distances and often gets marked as noise. Since we know this, and read the graph from left to right, we could have these points marked as cluster members. This should work for cases that have no noise; for cases that have noise (i.e., where the split point is a noise point not close to any cluster), I'd expect that the core distance would be closer to the reachability distance. For the inverse case (where the spit point is part of the cluster), the reachability distance should be large, but the core distance should be much smaller. If we screen this based off of a ratio of those two numbers:

We shouldn't have to calculate any new distances (core and reachability are stored already)
We shouldn't have to post process (we can do in the loop / tree build process)
We don't have to add new label definitions (i.e., -2 for split points)

The other path I see is doing some sort of post processing of the split points after the routine finishes (i.e., by labeling split points as -2, and then doing nearest neighbor lookups to determine new labels as either cluster members or noise)...but my guess is that this second approach will be more error prone, especially in cases where the split point is noise, but has no other noise points nearby to infer correct labeling.

adrinjalali commented 6 years ago

@espg that sounds reasonable. I've put a version of it in the PR #11857 (threshold hard coded as 1.5).

Please excuse all the debugging prints for now, the patch is around line 675.

I'll try to test it in different scenarios and also fix the "last point being a NOISE" issue.

kno10 commented 6 years ago

Boundary points on spikes in OPTICS plots are a bit tricky to handle. That is why ELKI uses the predecessor (and R copied that from ELKI).

Erich Schubert, Michael Gertz: Improving the Cluster Structure Extracted from OPTICS Plots. LWDA 2018: 318-329

adrinjalali commented 6 years ago

Thanks @kno10. Is it the DBSCAN package you're talking about? As far as I know, R only includes the original paper's method (DBSCAN and Xi), with some fixes. Which package are you talking about? I've used the dbscan package as reference.

kno10 commented 6 years ago

@adrinjalali The dbscan package states: "The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation". But the same technique can (and should) be used with any other plot-based extraction technique. The article mentioned above http://ceur-ws.org/Vol-2191/paper37.pdf explains why the "last" point of a valley may or may not belong to the cluster, and how to use the predecessor information to distinguish these two cases. If you always include the last point, you'll get "spike" artifacts.

adrinjalali commented 6 years ago

Update

The issue with the multiple inf reachability values, PR #12029 fixes the issue (needs a second review)
The issue with the metric parameter, almost fixed, it still doesn't support sparse input, issue #12009, #11982, PR #12028 (needs a second review)
Handling the split points borrowing the xi-steep concept from the original paper: PR #12049 (needs review)
Compare floats in quick_scan with isclose: PR #11929 (waiting to be merged)

Related outstanding

Make the reachability_ values consistent with a reference implementation (See https://github.com/scikit-learn/scikit-learn/pull/11929#issuecomment-419301414)
Either add a parameter related to maxima_ratio check, or compare with all members of the cluster (See https://github.com/scikit-learn/scikit-learn/pull/11857#discussion_r215698537)

adrinjalali commented 5 years ago

Closing this as the sqlnk algorithm having these issues is removed.

scikit-learn / scikit-learn

OPTICS detecting the wrong outlier #11677

Update

Related outstanding