Closed adrinjalali closed 5 years ago
ping @GaelVaroquaux
perhaps our tests weren't strong enough @espg
This looks to be a bug in _automatic_cluster
. I'm guessing that it occurs when the leaf in the reachability tree is too small, and is merged to the adjacent leaf, instead of being marked as noise / removed. I'll see if I can find a fix (I think around lines 709 or 725).
Thank you both
(continuing from #3846) @qinhanmin2014 I don't think I can come up with a bug-free example. Check the following:
>>> from sklearn.cluster import OPTICS
>>> import numpy as np
>>>
>>> # Generate sample data
>>>
>>> np.random.seed(0)
>>> for _ in range(10):
... n_points_per_cluster = 250
... C1 = [-50, -20] + .08 * np.random.randn(n_points_per_cluster, 2)
... C2 = [40, -10] + .01 * np.random.randn(n_points_per_cluster, 2)
... C3 = [10, -20] + .02 * np.random.randn(n_points_per_cluster, 2)
... C4 = [-20, 30] + .03 * np.random.randn(n_points_per_cluster, 2)
... C5 = [30, -20] + .06 * np.random.randn(n_points_per_cluster, 2)
... C6 = [50, 60] + .02 * np.random.randn(n_points_per_cluster, 2)
... X = np.vstack((C1, C2, C3, C4, C5, C6))
... clustering = OPTICS(min_samples=9, rejection_ratio=0.5)
... # Run the fit
... clustering.fit(X)
... print(np.sum(clustering.labels_ == -1))
5
5
5
5
5
5
5
5
5
5
I really don't think the data keeps having exactly 5 outliers. I also checked the example provided here and the algorithm detects way too many outliers IMO.
Thanks for investigating, Adrin. @espg are you working on this? I fear we may need to retract OPTICS from 0.20 if this is not resolved soon.
@jnothman I'm on travel and won't be able to dedicate time on this until next week. The bug @adrinjalali first flagged in this thread needs to be investigated, but the recent example above doesn't seem to be incorrect to me. The clusters are widely spaced, so OPTICS evaluates all points as belonging to a given label; the 5 outliers are one point per transition to the next cluster (boundary point where the reachability distance spikes). OPTICS calculates reachability distances starting with the first point to all other points, and then removes the current point from future consideration and recursively repeats. That means that the reachability distances will be low for all points in the cluster under consideration except for the last point on the cluster-- since all it's neighbors have been removed, it will have a reachability distance based on distance to the next cluster (instead of inter-cluster distance), which is quite high in the case above. If you run the same test with DBSCAN, you'll get varying outliers depending on what the eps value is set to, but generally much higher in number for eps under 0.08, and then zero outliers if you set eps greater than twice that... the fact that OPTICS is only predicting one outlier per cluster means it's generally mapping the clusters better then DBSCAN (with less information, since eps doesn't need to be set), so this is a feature, not a bug.
I could see the expected behavior for OPTICS would be to have zero outliers for the case(s) above, and that might be achievable with some change to auto_extract...I'll take a look at this next week and see if there is a way to better handle boundary points.
Thanks!!
Maybe we can compare our implementation with R, or implementation from the original author if available. Tagging blocker.
Is this issue worth delaying the RC release for? It doesn't sound like it will be solved in the next few days... Maybe add this bug fix either in the final 0.20.0 or 0.20.1?
Here's a result from the R
package (dbscan
)
For the code:
set.seed(2)
n <- 400
x <- cbind(
x = c(0, 2, 0, 2) + rnorm(n, sd=0.1),
y = c(2, 0, 0, 2) + rnorm(n, sd=0.1)
)
library(dbscan)
res = optics(x)
#predict(res, x)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .17)
cl$cluster
plot(cl)
hullplot(x, cl)
These are the calculated clusters (no outliers):
> cl$cluster
[1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3
[60] 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2
[119] 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1
[178] 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
[237] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3
[296] 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2
[355] 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
And the following plots:
@adrinjalali can you please compare the reachability plot from R with the output from this implementation of OPTICS? I'm 95% sure that the issue is in the auto_extract
function, but it would be good to be positive, and I'm not really familiar with R. If the bug is in auto_extract
, the calculated and ordered reachability distances should be identical to within machine precision from the two implementations.
@espg unfortunately it doesn't seem to be the case. They're pretty different.
Here's the R
code I used to create the data and get the reachability:
set.seed(2)
n <- 100
x <- cbind(
x = c(0, 2, 0, 2) + rnorm(n, sd=0.1),
y = c(2, 0, 0, 2) + rnorm(n, sd=0.1)
)
write.csv(x, file='/tmp/data.csv')
library(dbscan)
res = optics(x)
#predict(res, x)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .17)
cl$cluster
plot(cl)
hullplot(x, cl)
write.csv(matrix(c(cl$order, cl$reachdist), ncol=2), file='/tmp/r.csv')
And here's the python
code to apply OPTICS
on the same data:
import pandas as pd
import numpy as np
from sklearn.cluster import OPTICS
data = pd.read_csv('/tmp/data.csv')
data = data[['x', 'y']]
clustering = OPTICS(min_samples=5)
clustering.fit(data)
np.savetxt("/tmp/py.csv",
np.transpose(np.array([clustering.ordering_,
clustering.reachability_])))
Putting the two generated files together, here's what you'd see.
python order | python reachability | R order | R reachability |
---|---|---|---|
0 | inf | 1 | Inf |
52 | 0.07329448816232 | 57 | Inf |
56 | 0.071566498480906 | 73 | Inf |
12 | 0.086188714113748 | 53 | Inf |
72 | 0.030358668261071 | 13 | 0.03035866826107 |
40 | 1.58645318232263 | 89 | 0.091487092011897 |
88 | 0.064431760283397 | 49 | 0.064431760283397 |
48 | 0.08831105629047 | 41 | 0.088311056290458 |
4 | 0.115609686828167 | 25 | 0.115609686828167 |
24 | 0.077812333737916 | 5 | 0.077812333737916 |
68 | 0.079401452929697 | 69 | 0.079401452929697 |
64 | 0.088171163308115 | 65 | 0.088171163308108 |
28 | 0.073131402725988 | 81 | 0.073131402725989 |
80 | 0.094438516074332 | 29 | 0.094438516074335 |
36 | 1.57660870727407 | 37 | 0.070793201620562 |
76 | 1.56178634568536 | 77 | 0.108586894358706 |
44 | 0.102916640800507 | 45 | 0.102916640800511 |
... | ... | ... | ... |
Does this help?
This is interesting, especially given that the R version is finding multiple instances of inf
reachability past the first point. I'm still unclear on the R code that you posted-- I see you set and eps threshold (0.17), but I don't see you define minPts anywhere. What is R using or defaulting to-- it looks like maybe two points per cluster? Also, there is not max_bound set on our implementation of OPTICS, so the results won't be directly comparable...
Can you rerun with the R version using minPts of 5, and an eps of 10.0, and then do the same for OPTICS but with max_bound set to 10.0? If you provide a link to the R results and generated data, I can spin up OPTICS directly to compare against that.
Thanks @adrinjalali
Trying to come up with a smaller example, here's the R
code (it also includes installing the required package, you can just run it if you're curious)
set.seed(2)
n <- 15
x <- cbind(
x = c(0, 2, 0) + rnorm(n, sd=0.1),
y = c(2, 0, 0) + rnorm(n, sd=0.1)
)
write.csv(x, file='/tmp/data.csv')
install.packages('dbscan')
library(dbscan)
res = optics(x, eps = 10, minPts = 5)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .5)
cl$cluster
plot(cl)
hullplot(x, cl)
mat = matrix(c(cl$order, cl$reachdist,
cl$coredist, cl$predecessor,
cl$cluster), ncol=5)
colnames(mat)<- c('order', 'reachdist', 'coredist', 'predecessor', 'cluster')
write.csv(mat, file='/tmp/r.csv')
And the attributes of the object are now:
> cl
OPTICS ordering/clustering for 15 objects.
Parameters: minPts = 5, eps = 10, eps_cl = 0.5, xi = NA
The clustering contains 3 cluster(s) and 0 noise points.
1 2 3
5 5 5
Available fields: order, reachdist, coredist, predecessor, minPts, eps, eps_cl, xi, cluster
The resulting matrix would be:
order | reachdist | coredist | predecessor | cluster | |
---|---|---|---|---|---|
1 | 1 | Inf | 0.333208651272379 | NA | 1 |
2 | 13 | 0.292663608330619 | 0.333843791583614 | 5 | 3 |
3 | 10 | 0.182346160922536 | 0.251820141958916 | 12 | 2 |
4 | 7 | 0.17871163011086 | 0.333208651272379 | 13 | 1 |
5 | 4 | 0.355622421383634 | 0.292663608330619 | 14 | 3 |
6 | 6 | 1.56320382265559 | 0.251820141958916 | 1 | 2 |
7 | 15 | 0.17871163011086 | 0.287675880147458 | 13 | 1 |
8 | 12 | 0.355622421383634 | 0.409445643425412 | 14 | 3 |
9 | 9 | 0.182346160922536 | 0.195940809782328 | 12 | 2 |
10 | 3 | 0.17871163011086 | 0.243693355280387 | 13 | 1 |
11 | 14 | 0.355622421383634 | 0.409445643425412 | 14 | 3 |
12 | 11 | 0.244256566906985 | 0.182346160922536 | 15 | 2 |
13 | 8 | 0.333208651272379 | 0.17871163011086 | 1 | 1 |
14 | 5 | 1.7015610774674 | 0.355622421383634 | 9 | 3 |
15 | 2 | 0.251820141958916 | 0.244256566906985 | 6 | 2 |
Here's the data used in this example:
x | y |
---|---|
-0.089691454662498 | 1.76889309153948 |
2.01848491846467 | 0.087860458092127 |
0.158784533120882 | 0.003580671801523 |
-0.113037567424629 | 2.10128286921271 |
1.9919748243449 | 0.043226515453962 |
0.013242028438109 | 0.209081920524915 |
0.070795472927173 | 1.88000741803561 |
1.97603019758282 | 0.158963820029007 |
0.198447393665293 | 0.195465164222325 |
-0.013878701211967 | 2.00049377768281 |
2.04176507507926 | -0.245170638784613 |
0.098175277746366 | 0.047723730261362 |
-0.039269535550381 | 1.94034418313686 |
1.89603310230511 | 0.079220327029965 |
0.178222896030858 | 0.028963671017735 |
And the python
code on this same data:
import pandas as pd
import numpy as np
from sklearn.cluster import OPTICS
data = pd.read_csv('/tmp/data.csv')
data = data[['x', 'y']]
clustering = OPTICS(min_samples=5, max_bound=10)
clustering.fit(data)
np.savetxt("/tmp/py.csv",
np.transpose(np.array([clustering.ordering_,
clustering.reachability_])))
would result in:
ordering | reachability | core_distances | labels |
---|---|---|---|
0 | inf | 0.333208651272384 | 0 |
12 | 0.29266360833062 | 0.333843791583615 | -1 |
9 | 0.182346160922537 | 0.251820141958916 | -1 |
3 | 0.178711630110863 | 0.333208651272384 | 0 |
6 | 0.355622421383636 | 0.29266360833062 | -1 |
5 | 1.56320382265559 | 0.251820141958916 | -1 |
11 | 0.178711630110863 | 0.287675880147461 | 0 |
2 | 0.29266360833062 | 0.409445643425412 | -1 |
14 | 0.182346160922537 | 0.195940809782328 | -1 |
8 | 0.178711630110863 | 0.243693355280386 | 0 |
13 | 0.292663608330621 | 0.409445643425412 | -1 |
4 | 0.251820141958916 | 0.182346160922537 | -1 |
1 | 0.333208651272384 | 0.178711630110863 | 0 |
7 | 1.7015610774674 | 0.355622421383636 | -1 |
10 | 0.182346160922537 | 0.244256566906986 | -1 |
The results kinda seem closer now!
Huh, interesting. The core distances all agree. My initial impression is that the R version is wrong here. The starting point has coordinates (−0.0896914546625, 1.76889309154), and the algorithm should jump to the next closest neighbor... R version goes to point 15, with coordinates (0.178222896030858, 0.028963671017735), and a distance of 1.76044. Sklearn version goes to point 13, with coordinates (−0.0392695355504, 1.94034418314), and a much closer distance of 0.17871. Both algorithms agree on the reachability (and core) distance for point 13, but I can't figure out why the R version orders it number 8 in the ordered list instead of number 2. The points should be ordered according to cluster; points 1, 13, 10, 7, and 4 are all identified as a cluster in both R and Sklearn, but I'd expect them to be adjacent in the ordering scheme (they are in sklearn, they are not at all in R...kinda curious how they got the labeling correct since they aren't in contiguous order, I guess something to do with this predecessor label?).
I'm reviewing the source paper to see if there is something I'm missing for the core algorithm here...
Worth comparing to Weka/ELKI?
I really should think about how we can better facilitate cross-language comparisons for new implementations.
I'm not used to weka, this is what I got out of it:
=== Run information ===
Scheme: weka.clusterers.OPTICS -E 10.0 -M 5 -A "weka.core.EuclideanDistance -R first-last" -F -no-gui -db-output .
Relation: data-weka.filters.unsupervised.attribute.Remove-R1
Instances: 15
Attributes: 2
x
y
Test mode: evaluate on training data
=== Clustering model (full training set) ===
OPTICS clustering results
============================================================================================
Clustered DataObjects: 15
Number of attributes: 2
Epsilon: 10.0; minPoints: 5
Write results to file: yes
Distance-type:
Number of generated clusters: 0
Elapsed time: .0
( 0.) -0.089691,1.768893 --> c_dist: 0.142 r_dist: UNDEFINED
(12.) -0.03927,1.940344 --> c_dist: 0.077 r_dist: 0.142
( 3.) -0.113038,2.101283 --> c_dist: 0.142 r_dist: 0.077
( 6.) 0.070795,1.880007 --> c_dist: 0.127 r_dist: 0.077
( 9.) -0.013879,2.000494 --> c_dist: 0.105 r_dist: 0.077
( 5.) 0.013242,0.209082 --> c_dist: 0.111 r_dist: 0.666
(11.) 0.098175,0.047724 --> c_dist: 0.079 r_dist: 0.111
(14.) 0.178223,0.028964 --> c_dist: 0.108 r_dist: 0.079
( 8.) 0.198447,0.195465 --> c_dist: 0.086 r_dist: 0.079
( 2.) 0.158785,0.003581 --> c_dist: 0.111 r_dist: 0.079
(13.) 1.896033,0.07922 --> c_dist: 0.154 r_dist: 0.789
( 1.) 2.018485,0.08786 --> c_dist: 0.142 r_dist: 0.154
(10.) 2.041765,-0.245171 --> c_dist: 0.175 r_dist: 0.142
( 7.) 1.97603,0.158964 --> c_dist: 0.175 r_dist: 0.142
( 4.) 1.991975,0.043227 --> c_dist: 0.125 r_dist: 0.142
Time taken to build model (full training data) : 0.01 seconds
=== Model and evaluation on training set ===
Clustered Instances
Unclustered instances : 15
thanks! but that's not with the same 100 instances is it? it says 15
The latest example I used for R and Python in the results I put here were also with the same 15 points. I guess it shows the difference. I was trying to make the example as small as I can.
Oh, I must have missed some comments. Weka looks quite different in its distances?
what's the status of this?
I took a look at the original code for automatic extraction from an OPTICS ordered list, and the same bug is present. You can decouple the build of the graph from the graph extraction-- so if you use a different OPTICS extraction function to build the graph on the points in this example and pass it to the unmodified auto_extract code from @amyxzhang, the same bug crops up. In other words, I've isolated where the bug is, but I don't really know how to fix it.
Following up from @jnothman request in #11838 , here is the reference paper implementation pseudo code.
Converting reachability plot to dendrogram:
Main algorithm with tree trimming:
Hey all, sorry I just got notice of this issue today! It's been a SUPER long time since I looked at this code but I hope this explanation helps.
Basically the way the algorithm works (as described in the above pseudocode) is that, given a reachability plot, it searches for local maximas to iteratively split the plot, starting from the highest local maxima. A local maxima is a point on the reachability plot where the point to the left is lower and the point to the right is lower and the split that would occur does not create two groups that are now too small (according to given thresholds).
So the bug in the first script by @adrinjalali is because that outlier is at the end of the plot and thus doesn't count as a local maxima given the criteria. Some code could be written to consider potentially rejecting points at the ends of the reachability plot based on some threshold but I don't think it's currently part of the algorithm.
You'll notice that in both the 1st and 2nd script by @adrinjalali, there are middle points in the reachability plots that don't get added to a cluster even though they don't seem like outliers in the scatterplot. @espg is right that this is expected behavior of the algorithm as it stands - those points are splitpoints to divide a cluster in two and don't get added to the new right-hand cluster. Again, this could be revisited using some sort of threshold that looks at the average distance of the cluster to the right of the splitpoint.
There was also a point raised about too many outliers. That depends on the various parameters input towards the extraction algorithm. Also currently, the auto-extractor is only returning leaves of the resulting cluster tree as opposed to all nodes, leading to more outliers, since it's only returning the most densely packed clusters. You could if you wanted to, slice the tree at a particular depth and get a different set of clusters (much as you would in DBSCAN).
In the end, I don't think these issues are showstoppers if there's a deployment going out today.
Thanks @amyxzhang.
If I may, I have one worry regarding labelling those split points as outliers. In practice, with large enough amount of data, people are likely to not investigate why a data point is labelled -1, and simply count them as an outlier.
Here's an undesirable scenario: a data scientist working in a fraud detection department is looking for algorithms to find outliers to raise their risk value, and report them as potential fraud. Since the outliers and split points are labelled the same way, the people whose data happen to be the split points, would potentially be treated unfairly as potential fraud.
That's why I would argue that those split points should either be included in the clusters, or be labelled specifically as a split point (-2 as opposed to -1 for instance).
I know in an ideal world the above scenario wouldn't happen, but unfortunately it's likely that once an algorithm gets into scikit-learn, it's used w/o careful consideration of the details of the algorithm.
That's of course just my one cent contribution to the discussion.
Yes @adrinjalali, I agree that something could be done about the issues, maybe the strategies I mentioned above, as a final pass over the clusters perhaps. I think there would need to be testing to consider edge cases though. It may be trickier than anticipated with noisier data (lots of potential splitpoints, some of which are discarded). Back when I was working on this, I was mainly interested in the spatial boundary delineation of highly dense areas, so the issue did not come up.
I've not yet had the time to understand the details here. But would it be correct to say that the implementation is faithful to the reference paper, but that there is an ambiguity in the algorithm that could be handled in a way that would be friendlier to ?some users.
?
Right, the implementation follows the algorithm, which doesn't give guidance for splitpoints and outlier points at the very end of the reachability plot. I think some small tweaks, with sufficient testing, can resolve these issues to make these examples conform to expectations.
well, fwiw it would be very good to get this changed before the imminent release if we can come up with specified functionality ..
I'm reading the original paper, and I'll get into the details of the implementation (for my own curiosity). Then I'll probably be able to fix some of these issues. But if anybody has the time and is already familiar with the code/algorithm, please don't wait for me. Anyhow, I'll try my best.
It may be possible to deal with the split points by lumping them into the left-hand objects if their core distance is small. Since OPTICS removes points from consideration once they've been processed, and always picks the next closest point, the last point in a cluster tends to have large reachability distances and often gets marked as noise. Since we know this, and read the graph from left to right, we could have these points marked as cluster members. This should work for cases that have no noise; for cases that have noise (i.e., where the split point is a noise point not close to any cluster), I'd expect that the core distance would be closer to the reachability distance. For the inverse case (where the spit point is part of the cluster), the reachability distance should be large, but the core distance should be much smaller. If we screen this based off of a ratio of those two numbers:
The other path I see is doing some sort of post processing of the split points after the routine finishes (i.e., by labeling split points as -2, and then doing nearest neighbor lookups to determine new labels as either cluster members or noise)...but my guess is that this second approach will be more error prone, especially in cases where the split point is noise, but has no other noise points nearby to infer correct labeling.
@espg that sounds reasonable. I've put a version of it in the PR #11857 (threshold hard coded as 1.5
).
Please excuse all the debugging prints for now, the patch is around line 675.
I'll try to test it in different scenarios and also fix the "last point being a NOISE" issue.
Boundary points on spikes in OPTICS plots are a bit tricky to handle. That is why ELKI uses the predecessor (and R copied that from ELKI).
Erich Schubert, Michael Gertz: Improving the Cluster Structure Extracted from OPTICS Plots. LWDA 2018: 318-329
Thanks @kno10. Is it the DBSCAN package you're talking about? As far as I know, R only includes the original paper's method (DBSCAN and Xi), with some fixes. Which package are you talking about? I've used the dbscan
package as reference.
@adrinjalali The dbscan
package states: "The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation". But the same technique can (and should) be used with any other plot-based extraction technique.
The article mentioned above http://ceur-ws.org/Vol-2191/paper37.pdf explains why the "last" point of a valley may or may not belong to the cluster, and how to use the predecessor information to distinguish these two cases. If you always include the last point, you'll get "spike" artifacts.
inf
reachability values, PR #12029 fixes the issue (needs a second review)metric
parameter, almost fixed, it still doesn't support sparse input, issue #12009, #11982, PR #12028 (needs a second review)quick_scan
with isclose
: PR #11929 (waiting to be merged)reachability_
values consistent with a reference implementation (See https://github.com/scikit-learn/scikit-learn/pull/11929#issuecomment-419301414)maxima_ratio
check, or compare with all members of the cluster (See https://github.com/scikit-learn/scikit-learn/pull/11857#discussion_r215698537)Closing this as the sqlnk
algorithm having these issues is removed.
The following example shows
OPTICS
on a small dataset, with one outlier. It seems that the outlier is incorrectly detected.