Closed brainbytes42 closed 3 years ago
DENCLUE is designed for clustering spatial data, i.e. at least 2-dimensional. We have no plan to support one-dimensional data.
That's too bad, as it works very well for the vast majority of (1D-)cases and only fails with this few more "special" cases. So it's almost there...
Side-Note: Embedding the given data into 2D-Space leads to the same results - so it's possible (obviously: with lower probability...) to run into this issue for 2-dimensional data as well. (I verified that for embeddings by setting one dimension to 0, by setting x=y.)
// 1
double[][] data = {{1,1}, {2,2}, {3,3}, {4,4}};
// 2
double[][] data = {{0,342.3619853516575}, {0,342.3624853515066}, {0,345.23949951166287}, {0,345.2464880370535}, {0,345.24698803713545}, {0,345.24698803713545}};
Side-Note 2: It fails also for "real" 2D-Data for x = y^2:
// 1
double[][] data = {{1, 1}, {2, 4}, {3, 9}, {4, 16}};
// 2
double[][] data = {{342.3619853516575}, {342.3624853515066}, {345.23949951166287}, {345.2464880370535}, {345.24698803713545}, {345.24698803713545}};
// x -> (x, x^2)
data = Arrays.stream(data).map(row -> new double[]{row[0], row[0] * row[0]}).toArray(double[][]::new);
Nevertheless, it would be nice to have this restriction documented (as it is for high-dimensional data), which would have saved me some time and effort, now obviously not even for the intended use.
The intrinsic dimension is one no matter how many more 0 columns you add to the data.
It is not a restriction but a common sense for any clustering algorithm. It doesn't hurt to learn basic ML concepts first.
Well, I can assure you, I'm quite familiar with those concepts - if you read (and understand - my English isn't perfect, but not that bad, i hope - so please don't be insulting) what I wrote, you'll notice that I said it is an embedding into the 2D. So the data remains, as it is embedded, 1D, that's not the question. But, unsupervised is not commonly understood as "you need to hope for your data not possibly to be in some weird kind of shape that fails undocumented even if it has clusters"...
Apart from that, as I have shown, it fails for "real" 2D-data as well, but that seems to be of no interest...
Whatever. So long and thanks for all the fish - have a nice day.
I apologize if my comment offended you. What I tried to say is that DENCLUE paper says clearly that it is for spatial data.
Thank you, I'm glad to hear this. Maybe I was a bit harsh then, too.
Last two remarks from my side:
Nevertheless, meanwhile I implemented my own version of a kernel density based clustering for 1-D-data, which might be a little rough and doesn't scale that well, but it seems to be more reliable for my not-that-big data.
Apart from that, I'm glad SMILE exists, as it gives us a nice toolset for the java-world. So: thank you for that. :-)
Clustering analysis is usually a multi-variate technique. DENCLUE is a little different. DENCLUE class implements DENCLUE 2.0 algorithm. Theoretically, DENCLUE can work on 1d data. It is just hill climbing process. What the paper doesn't tell is that data points rarely converge to the same attractors even if they belong to the same cluster. In fact, we often end with many cluttered attractors. To handle this, we add the second phase after the DENCLUE: clustering attractors with another algorithm. We choose DBSCAN for this phase. This combination works very well in most case. But it fails in 1-dimensional sometimes. The errors in your cases are actually from DBSCAN part. There is no evidence that DBSCAN works on 1-dimensional data.
As said, we develop clustering package for multivariate analysis. So we have no intention to work on this edge case.
@haifengl
what can i use if i want to cluster one dimensional data with kernel density estimation. I thought that's the thing you use when you want to cluster one dimensional data.
@brainbytes42 can you show what u did to do the one dimensional case?
@brainbytes42 can you show what u did to do the one dimensional case?
unfortunately, I can't show the code - but basically, I have reimplemented a simple version of Denclue 2.0 around smile's smile.stat.distribution.KernelDensity
, to reinvent only half the wheel... A little bit of hill-climbing and a reasonable epsilon... So basically, 1D-data ist not a limitation of the algorithm, but only in smile's implementation, unfortunately.
@brainbytes42 I was just thinking about using the KernelDensity class and calculating the local maxima. But I don't know a smart way of calculating them. Does the SMILE library have anything that can calculate local maxima of the probability density function?
I don't think so... But luckily, hill climbing isn't that hard to implement for 1D-Data... ;-)
Describe the bug For some data-sets, the DENCLUE-Algorithm fails with an exception:
IllegalArgumentException: Invalid radius: 0.0
IllegalStateException: Attractors contains NaN/infinity. sigma is likely too small.
Expected behavior For the given data, I expect no exception in both cases.
In the first case, I expect the data unclustered (=OUTLIER), as each point has it's own kernel disjoint from all other points with the given bandwidth. See the following image of the kernel density:
In the second case, all points are very close to each other, the kernel density shows one big peak. Thererfore, I expect a cluster, but (again) no exception:
Actual behavior DENCLUE.fit(data, sigma, m, tol, minPts) fails with an Exception:
Code snippet
Input data See code.
Additional context
These examples seem to be some kind of border cases, which won't happen too often - but they do happen for me. It's possible to catch the exceptions - yes - but especially in the second case, the valid cluster is missed, therefore the results are false.
Thank you very much!