haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.04k stars 1.13k forks source link

DENCLUE-Clustering fails for non-cluster-situation or points very close to each other #614

Closed brainbytes42 closed 3 years ago

brainbytes42 commented 4 years ago

Describe the bug For some data-sets, the DENCLUE-Algorithm fails with an exception:

  1. IllegalArgumentException: Invalid radius: 0.0
  2. IllegalStateException: Attractors contains NaN/infinity. sigma is likely too small.

Expected behavior For the given data, I expect no exception in both cases.

  1. In the first case, I expect the data unclustered (=OUTLIER), as each point has it's own kernel disjoint from all other points with the given bandwidth. See the following image of the kernel density: grafik

  2. In the second case, all points are very close to each other, the kernel density shows one big peak. Thererfore, I expect a cluster, but (again) no exception: grafik

Actual behavior DENCLUE.fit(data, sigma, m, tol, minPts) fails with an Exception:

  1. First case:
    java.lang.IllegalArgumentException: Invalid radius: 0.0
    at smile.clustering.DBSCAN.fit(DBSCAN.java:158)
    at smile.clustering.DBSCAN.fit(DBSCAN.java:131)
    at smile.clustering.DENCLUE.fit(DENCLUE.java:154)
    at ml.DenclueBugreport2.main(DenclueBugreport2.java:26)
  2. Second case:
    java.lang.IllegalStateException: Attractors contains NaN/infinity. sigma is likely too small.
    at smile.clustering.DENCLUE.fit(DENCLUE.java:143)
    at ml.DenclueBugreport2.main(DenclueBugreport2.java:48)

Code snippet

import java.util.Arrays;
import smile.clustering.DENCLUE;
import smile.stat.distribution.KernelDensity;

public class DenclueBugreport2 {

    public static void main(String[] args) {

        try {
            double[][] data = {{1}, {2}, {3}, {4}};
            double sigma = 0.1;
            int m = data.length;
            double tol = 0.001;
            int minPts = 1;

            // for illustration, print kernel density: all kernels are separate!
            final KernelDensity kernelDensity = new KernelDensity(Arrays.stream(data).mapToDouble(x -> x[0]).toArray(), sigma);
            for (double x = 0; x < 5; x += 0.1) {
                System.out.printf("p(%.2f) = %.10f%n", x, kernelDensity.p(x));
            }

            // fails with exception: java.lang.IllegalArgumentException: Invalid radius: 0.0
            DENCLUE denclue = DENCLUE.fit(data, sigma, m, tol, minPts);

        } catch (Exception e) {
            e.printStackTrace();
        }

        System.out.println("++++++++++++++++++++++++++++++++++++++++");

        try {
            double[][] data = {{342.3619853516575}, {342.3624853515066}, {345.23949951166287}, {345.2464880370535}, {345.24698803713545}, {345.24698803713545}};
            double sigma = 3.7052341353520095;
            int m = data.length;
            double tol = 0.001;
            int minPts = 1;

            // for illustration, print kernel density
            final KernelDensity kernelDensity = new KernelDensity(Arrays.stream(data).mapToDouble(x -> x[0]).toArray(), sigma);
            for (double x = 330; x < 360; x += 0.5) {
                System.out.printf("p(%.2f) = %.10f%n", x, kernelDensity.p(x));
            }

            // fails with exception: java.lang.IllegalArgumentException: Invalid radius: 0.0
            DENCLUE denclue = DENCLUE.fit(data, sigma, m, tol, minPts);
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

}

Input data See code.

Additional context

These examples seem to be some kind of border cases, which won't happen too often - but they do happen for me. It's possible to catch the exceptions - yes - but especially in the second case, the valid cluster is missed, therefore the results are false.

Thank you very much!

haifengl commented 4 years ago

DENCLUE is designed for clustering spatial data, i.e. at least 2-dimensional. We have no plan to support one-dimensional data.

brainbytes42 commented 4 years ago

That's too bad, as it works very well for the vast majority of (1D-)cases and only fails with this few more "special" cases. So it's almost there...

Side-Note: Embedding the given data into 2D-Space leads to the same results - so it's possible (obviously: with lower probability...) to run into this issue for 2-dimensional data as well. (I verified that for embeddings by setting one dimension to 0, by setting x=y.)

// 1
double[][] data = {{1,1}, {2,2}, {3,3}, {4,4}};

// 2
double[][] data = {{0,342.3619853516575}, {0,342.3624853515066}, {0,345.23949951166287}, {0,345.2464880370535}, {0,345.24698803713545}, {0,345.24698803713545}};

Side-Note 2: It fails also for "real" 2D-Data for x = y^2:

// 1
double[][] data = {{1, 1}, {2, 4}, {3, 9}, {4, 16}};

// 2
double[][] data = {{342.3619853516575}, {342.3624853515066}, {345.23949951166287}, {345.2464880370535}, {345.24698803713545}, {345.24698803713545}};
// x -> (x, x^2)
data = Arrays.stream(data).map(row -> new double[]{row[0], row[0] * row[0]}).toArray(double[][]::new);

Nevertheless, it would be nice to have this restriction documented (as it is for high-dimensional data), which would have saved me some time and effort, now obviously not even for the intended use.

haifengl commented 4 years ago

The intrinsic dimension is one no matter how many more 0 columns you add to the data.

It is not a restriction but a common sense for any clustering algorithm. It doesn't hurt to learn basic ML concepts first.

brainbytes42 commented 3 years ago

Well, I can assure you, I'm quite familiar with those concepts - if you read (and understand - my English isn't perfect, but not that bad, i hope - so please don't be insulting) what I wrote, you'll notice that I said it is an embedding into the 2D. So the data remains, as it is embedded, 1D, that's not the question. But, unsupervised is not commonly understood as "you need to hope for your data not possibly to be in some weird kind of shape that fails undocumented even if it has clusters"...

Apart from that, as I have shown, it fails for "real" 2D-data as well, but that seems to be of no interest...

Whatever. So long and thanks for all the fish - have a nice day.

haifengl commented 3 years ago

I apologize if my comment offended you. What I tried to say is that DENCLUE paper says clearly that it is for spatial data.

brainbytes42 commented 3 years ago

Thank you, I'm glad to hear this. Maybe I was a bit harsh then, too.

Last two remarks from my side:

Nevertheless, meanwhile I implemented my own version of a kernel density based clustering for 1-D-data, which might be a little rough and doesn't scale that well, but it seems to be more reliable for my not-that-big data.

Apart from that, I'm glad SMILE exists, as it gives us a nice toolset for the java-world. So: thank you for that. :-)

haifengl commented 3 years ago

Clustering analysis is usually a multi-variate technique. DENCLUE is a little different. DENCLUE class implements DENCLUE 2.0 algorithm. Theoretically, DENCLUE can work on 1d data. It is just hill climbing process. What the paper doesn't tell is that data points rarely converge to the same attractors even if they belong to the same cluster. In fact, we often end with many cluttered attractors. To handle this, we add the second phase after the DENCLUE: clustering attractors with another algorithm. We choose DBSCAN for this phase. This combination works very well in most case. But it fails in 1-dimensional sometimes. The errors in your cases are actually from DBSCAN part. There is no evidence that DBSCAN works on 1-dimensional data.

As said, we develop clustering package for multivariate analysis. So we have no intention to work on this edge case.

eQX8HNMTU2 commented 3 years ago

@haifengl

what can i use if i want to cluster one dimensional data with kernel density estimation. I thought that's the thing you use when you want to cluster one dimensional data.

eQX8HNMTU2 commented 3 years ago

@brainbytes42 can you show what u did to do the one dimensional case?

brainbytes42 commented 3 years ago

@brainbytes42 can you show what u did to do the one dimensional case?

unfortunately, I can't show the code - but basically, I have reimplemented a simple version of Denclue 2.0 around smile's smile.stat.distribution.KernelDensity, to reinvent only half the wheel... A little bit of hill-climbing and a reasonable epsilon... So basically, 1D-data ist not a limitation of the algorithm, but only in smile's implementation, unfortunately.

eQX8HNMTU2 commented 3 years ago

@brainbytes42 I was just thinking about using the KernelDensity class and calculating the local maxima. But I don't know a smart way of calculating them. Does the SMILE library have anything that can calculate local maxima of the probability density function?

brainbytes42 commented 3 years ago

I don't think so... But luckily, hill climbing isn't that hard to implement for 1D-Data... ;-)