Closed brainbytes42 closed 3 years ago
DENCLUE is designed for clustering spatial data, i.e. at least 2-dimensional. We have no plan to support one-dimensional data.
That's too bad, as it works very well for the vast majority of (1D-)cases and only fails with this few more "special" cases. So it's almost there...
Side-Note: Embedding the given data into 2D-Space leads to the same results - so it's possible (obviously: with lower probability...) to run into this issue for 2-dimensional data as well.
double[][] data = {{946.4020195314661}, {967.9050009767525}, {967.983022460714}, {971.5659926757216}, {982.445025390014}, {1004.1194995115511}, {1007.0109956054948}, {1007.7774868165143}, {1039.149496093858}};
// x -> (x, x)
data = Arrays.stream(data).map(row -> new double[]{row[0], row[0]}).toArray(double[][]::new);
// adapt sigma for diagonal stretching of data
double sigma = 19.211259991771023;
sigma = Math.sqrt(2 * sigma * sigma);
Side-Note 2: It fails also for "real" 2D-Data for x -> (x, random):
double[][] data = {{946.4020195314661}, {967.9050009767525}, {967.983022460714}, {971.5659926757216}, {982.445025390014}, {1004.1194995115511}, {1007.0109956054948}, {1007.7774868165143}, {1039.149496093858}};
data = Arrays.stream(data).map(row -> new double[]{row[0], 10*Math.random()}).toArray(double[][]::new);
Nevertheless, it would be nice to have this restriction documented (as it is for high-dimensional data), which would have saved me some time and effort, now obviously not even for the intended use.
Describe the bug Using the DENCLUE-Algorithm to cluster one-dimensional data (i.e. finding peaks in KernelDensity and assigning cluster-data) works well in most cases. Sometimes though, the clustering provides zwo clusters wher should be only one.
Expected behavior For the given data and config, I expect a single cluster with a single attractor-position at Kernel-Density-Peak. As shown by the use of KernelDensity, there is a single peak expected at 979,25:
Actual behavior The returned Clustering consists of two clusters, having their attractors close to the real peak - one to the left, one to the right: around 978,2... and 980,6...
[Previously, I used the denclue.y-results-array for clustering, which resulted more often in this behaviour. I found that using predict(.) for the same data delivers more stable results, i.e. provides less "split-clusters" - I thougt, maybe this change could be a usable workaround, but meanwhile I found the case below which fails even while using this workaround.]
Code snippet
Input data Given in the code.
Additional context
Do I miss something? In most other cases, it works like a charme - but I need this last portion of reliability... Thank you!