DENCLUE-Clustering fails for non-cluster-situation or points very close to each other

brainbytes42 commented 4 years ago

Describe the bug For some data-sets, the DENCLUE-Algorithm fails with an exception:

IllegalArgumentException: Invalid radius: 0.0
IllegalStateException: Attractors contains NaN/infinity. sigma is likely too small.

Expected behavior For the given data, I expect no exception in both cases.

In the first case, I expect the data unclustered (=OUTLIER), as each point has it's own kernel disjoint from all other points with the given bandwidth. See the following image of the kernel density:
In the second case, all points are very close to each other, the kernel density shows one big peak. Thererfore, I expect a cluster, but (again) no exception:

Actual behavior DENCLUE.fit(data, sigma, m, tol, minPts) fails with an Exception:

First case:

java.lang.IllegalArgumentException: Invalid radius: 0.0
at smile.clustering.DBSCAN.fit(DBSCAN.java:158)
at smile.clustering.DBSCAN.fit(DBSCAN.java:131)
at smile.clustering.DENCLUE.fit(DENCLUE.java:154)
at ml.DenclueBugreport2.main(DenclueBugreport2.java:26)

Second case:

java.lang.IllegalStateException: Attractors contains NaN/infinity. sigma is likely too small.
at smile.clustering.DENCLUE.fit(DENCLUE.java:143)
at ml.DenclueBugreport2.main(DenclueBugreport2.java:48)

Code snippet

import java.util.Arrays;
import smile.clustering.DENCLUE;
import smile.stat.distribution.KernelDensity;

public class DenclueBugreport2 {

    public static void main(String[] args) {

        try {
            double[][] data = {{1}, {2}, {3}, {4}};
            double sigma = 0.1;
            int m = data.length;
            double tol = 0.001;
            int minPts = 1;

            // for illustration, print kernel density: all kernels are separate!
            final KernelDensity kernelDensity = new KernelDensity(Arrays.stream(data).mapToDouble(x -> x[0]).toArray(), sigma);
            for (double x = 0; x < 5; x += 0.1) {
                System.out.printf("p(%.2f) = %.10f%n", x, kernelDensity.p(x));
            }

            // fails with exception: java.lang.IllegalArgumentException: Invalid radius: 0.0
            DENCLUE denclue = DENCLUE.fit(data, sigma, m, tol, minPts);

        } catch (Exception e) {
            e.printStackTrace();
        }

        System.out.println("++++++++++++++++++++++++++++++++++++++++");

        try {
            double[][] data = {{342.3619853516575}, {342.3624853515066}, {345.23949951166287}, {345.2464880370535}, {345.24698803713545}, {345.24698803713545}};
            double sigma = 3.7052341353520095;
            int m = data.length;
            double tol = 0.001;
            int minPts = 1;

            // for illustration, print kernel density
            final KernelDensity kernelDensity = new KernelDensity(Arrays.stream(data).mapToDouble(x -> x[0]).toArray(), sigma);
            for (double x = 330; x < 360; x += 0.5) {
                System.out.printf("p(%.2f) = %.10f%n", x, kernelDensity.p(x));
            }

            // fails with exception: java.lang.IllegalArgumentException: Invalid radius: 0.0
            DENCLUE denclue = DENCLUE.fit(data, sigma, m, tol, minPts);
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

}

Input data See code.

Additional context

What Java (OpenJDK, Orack JDK, etc.) are you using and which Java version: Gradle 5.0 / OpenJDK 8.0 (Corretto)
Which Smile version: 2.5.3
What is your build system: Windows

These examples seem to be some kind of border cases, which won't happen too often - but they do happen for me. It's possible to catch the exceptions - yes - but especially in the second case, the valid cluster is missed, therefore the results are false.

Thank you very much!

haifengl commented 4 years ago

DENCLUE is designed for clustering spatial data, i.e. at least 2-dimensional. We have no plan to support one-dimensional data.

brainbytes42 commented 4 years ago

That's too bad, as it works very well for the vast majority of (1D-)cases and only fails with this few more "special" cases. So it's almost there...

Side-Note: Embedding the given data into 2D-Space leads to the same results - so it's possible (obviously: with lower probability...) to run into this issue for 2-dimensional data as well. (I verified that for embeddings by setting one dimension to 0, by setting x=y.)

// 1
double[][] data = {{1,1}, {2,2}, {3,3}, {4,4}};

// 2
double[][] data = {{0,342.3619853516575}, {0,342.3624853515066}, {0,345.23949951166287}, {0,345.2464880370535}, {0,345.24698803713545}, {0,345.24698803713545}};

Side-Note 2: It fails also for "real" 2D-Data for x = y^2:

// 1
double[][] data = {{1, 1}, {2, 4}, {3, 9}, {4, 16}};

// 2
double[][] data = {{342.3619853516575}, {342.3624853515066}, {345.23949951166287}, {345.2464880370535}, {345.24698803713545}, {345.24698803713545}};
// x -> (x, x^2)
data = Arrays.stream(data).map(row -> new double[]{row[0], row[0] * row[0]}).toArray(double[][]::new);

Nevertheless, it would be nice to have this restriction documented (as it is for high-dimensional data), which would have saved me some time and effort, now obviously not even for the intended use.

haifengl commented 4 years ago

The intrinsic dimension is one no matter how many more 0 columns you add to the data.

It is not a restriction but a common sense for any clustering algorithm. It doesn't hurt to learn basic ML concepts first.

brainbytes42 commented 3 years ago

Well, I can assure you, I'm quite familiar with those concepts - if you read (and understand - my English isn't perfect, but not that bad, i hope - so please don't be insulting) what I wrote, you'll notice that I said it is an embedding into the 2D. So the data remains, as it is embedded, 1D, that's not the question. But, unsupervised is not commonly understood as "you need to hope for your data not possibly to be in some weird kind of shape that fails undocumented even if it has clusters"...

Apart from that, as I have shown, it fails for "real" 2D-data as well, but that seems to be of no interest...

Whatever. So long and thanks for all the fish - have a nice day.

haifengl commented 3 years ago

I apologize if my comment offended you. What I tried to say is that DENCLUE paper says clearly that it is for spatial data.

brainbytes42 commented 3 years ago

Thank you, I'm glad to hear this. Maybe I was a bit harsh then, too.

Last two remarks from my side:

I have no access to the original DENCLUE-Paper, but the DENCLUE 2.0-Paper, that is also referenced for your implementation, does not require the data to be at least two-dimensional. In section two, they introduce the Algorithm being suitable for data in d-dimensional space, d being a natural number. They even motivate the concept using a one-dimensional example. (And as you might have guessed, I see no reason, why this would be inappropriate... ;-) )
Maybe it's a bit far off, but for the embedding-discussion an analogy came to my mind, that might clarify the problem: Imagine a long and straight road which would fit a 2D-space. But in our world, it is embedded into our 3D-space. You probably wouldn't hesitate, to drive your car (the algorithm) along that road and expect it to work smoothly. The only thing you would have to do, is hold your steerin wheel straight - you would not say, the road is impossible to drive, as it has no curves or other non-2D-shapes in it. Now, the question would be, is there some technical reason, the car couldn't drive straigt - maybe it explodes, if the steering wheel isn't turned at least once a minute or whatever... I see this restriction neither for the car, nor for a Kernel or a Hill-Climbing.

Nevertheless, meanwhile I implemented my own version of a kernel density based clustering for 1-D-data, which might be a little rough and doesn't scale that well, but it seems to be more reliable for my not-that-big data.

Apart from that, I'm glad SMILE exists, as it gives us a nice toolset for the java-world. So: thank you for that. :-)

haifengl commented 3 years ago

Clustering analysis is usually a multi-variate technique. DENCLUE is a little different. DENCLUE class implements DENCLUE 2.0 algorithm. Theoretically, DENCLUE can work on 1d data. It is just hill climbing process. What the paper doesn't tell is that data points rarely converge to the same attractors even if they belong to the same cluster. In fact, we often end with many cluttered attractors. To handle this, we add the second phase after the DENCLUE: clustering attractors with another algorithm. We choose DBSCAN for this phase. This combination works very well in most case. But it fails in 1-dimensional sometimes. The errors in your cases are actually from DBSCAN part. There is no evidence that DBSCAN works on 1-dimensional data.

As said, we develop clustering package for multivariate analysis. So we have no intention to work on this edge case.

eQX8HNMTU2 commented 3 years ago

@haifengl

what can i use if i want to cluster one dimensional data with kernel density estimation. I thought that's the thing you use when you want to cluster one dimensional data.

eQX8HNMTU2 commented 3 years ago

@brainbytes42 can you show what u did to do the one dimensional case?

brainbytes42 commented 3 years ago

@brainbytes42 can you show what u did to do the one dimensional case?

unfortunately, I can't show the code - but basically, I have reimplemented a simple version of Denclue 2.0 around smile's smile.stat.distribution.KernelDensity, to reinvent only half the wheel... A little bit of hill-climbing and a reasonable epsilon... So basically, 1D-data ist not a limitation of the algorithm, but only in smile's implementation, unfortunately.

eQX8HNMTU2 commented 3 years ago

@brainbytes42 I was just thinking about using the KernelDensity class and calculating the local maxima. But I don't know a smart way of calculating them. Does the SMILE library have anything that can calculate local maxima of the probability density function?

brainbytes42 commented 3 years ago

I don't think so... But luckily, hill climbing isn't that hard to implement for 1D-Data... ;-)

haifengl / smile

DENCLUE-Clustering fails for non-cluster-situation or points very close to each other #614