haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.03k stars 1.13k forks source link

Wrong DENCLUE Cluster-Results #613

Closed brainbytes42 closed 3 years ago

brainbytes42 commented 3 years ago

Describe the bug Using the DENCLUE-Algorithm to cluster one-dimensional data (i.e. finding peaks in KernelDensity and assigning cluster-data) works well in most cases. Sometimes though, the clustering provides zwo clusters wher should be only one.

Expected behavior For the given data and config, I expect a single cluster with a single attractor-position at Kernel-Density-Peak. As shown by the use of KernelDensity, there is a single peak expected at 979,25: grafik

Actual behavior The returned Clustering consists of two clusters, having their attractors close to the real peak - one to the left, one to the right: around 978,2... and 980,6...

[Previously, I used the denclue.y-results-array for clustering, which resulted more often in this behaviour. I found that using predict(.) for the same data delivers more stable results, i.e. provides less "split-clusters" - I thougt, maybe this change could be a usable workaround, but meanwhile I found the case below which fails even while using this workaround.]

Code snippet

import java.util.Arrays;
import java.util.Comparator;
import smile.clustering.DENCLUE;
import smile.stat.distribution.KernelDensity;

public class DenclueBugreport {

    public static void main(String[] args) {

        // @formatter:off
        double[][] data = {{946.4020195314661}, {967.9050009767525}, {967.983022460714}, {971.5659926757216}, {982.445025390014}, {1004.1194995115511}, {1007.0109956054948}, {1007.7774868165143}, {1039.149496093858}};
        // @formatter:on

        double sigma = 19.211259991771023;
        int m = data.length;
        double tol = 0.001;
        int minPts = 1;

        DENCLUE denclue = DENCLUE.fit(data, sigma, m, tol, minPts);

        final KernelDensity kernelDensity = new KernelDensity(Arrays.stream(data).mapToDouble(x -> x[0]).toArray(), sigma);

        Arrays.stream(data).sorted(Comparator.comparingDouble(x->x[0])).forEach(x-> System.out.printf("predict( %4.10f ) = %10d  |  KD = %.10f%n", x[0], denclue.predict(x), kernelDensity.p(x[0])));

        System.out.println("++++++++++++++++++++++++++++++++++++++++");

        // attractors around 978,2... and 980,6... giving 2 clusters...
        // [ for a lower tolerance (0.0001), it's 978,8... and 979,5... ]
        System.out.println("denclue.y = " + Arrays.toString(denclue.y));
        System.out.println("denclue.attractors = " + Arrays.deepToString(denclue.attractors));

        System.out.println("++++++++++++++++++++++++++++++++++++++++");

        // ... but it's only one real max at 979,25! (between both attractor-clusters, as seen in rasterized kernel density estimation:)
        for (double x = 975; x < 982; x+=0.25) {
            System.out.printf("p(%.2f) = %.10f%n", x, kernelDensity.p(x));
        }

    }

}

Input data Given in the code.

Additional context

Do I miss something? In most other cases, it works like a charme - but I need this last portion of reliability... Thank you!

haifengl commented 3 years ago

DENCLUE is designed for clustering spatial data, i.e. at least 2-dimensional. We have no plan to support one-dimensional data.

brainbytes42 commented 3 years ago

That's too bad, as it works very well for the vast majority of (1D-)cases and only fails with this few more "special" cases. So it's almost there...

Side-Note: Embedding the given data into 2D-Space leads to the same results - so it's possible (obviously: with lower probability...) to run into this issue for 2-dimensional data as well.

double[][] data = {{946.4020195314661}, {967.9050009767525}, {967.983022460714}, {971.5659926757216}, {982.445025390014}, {1004.1194995115511}, {1007.0109956054948}, {1007.7774868165143}, {1039.149496093858}};
// x -> (x, x)
data = Arrays.stream(data).map(row -> new double[]{row[0], row[0]}).toArray(double[][]::new);

// adapt sigma for diagonal stretching of data
double sigma = 19.211259991771023;
sigma = Math.sqrt(2 * sigma * sigma);

Side-Note 2: It fails also for "real" 2D-Data for x -> (x, random):

double[][] data = {{946.4020195314661}, {967.9050009767525}, {967.983022460714}, {971.5659926757216}, {982.445025390014}, {1004.1194995115511}, {1007.0109956054948}, {1007.7774868165143}, {1039.149496093858}};

data = Arrays.stream(data).map(row -> new double[]{row[0], 10*Math.random()}).toArray(double[][]::new);

Nevertheless, it would be nice to have this restriction documented (as it is for high-dimensional data), which would have saved me some time and effort, now obviously not even for the intended use.