haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.02k stars 1.13k forks source link

Integer overflow in hierarchical clustering when number of examples are equal or more than 46,341 #707

Closed sven-h closed 2 years ago

sven-h commented 2 years ago

Describe the bug I run hierarchical clusterings with a lot of examples and noticed that it does not compute it correctly. After further analysis I found out that an integer overflow happens internally. The corresponding line is: int length = n * (n+1) / 2; . If n is 46,341, then n * (n+1) is greater than 2^31-1 (int datatype). Therefore an integer overflow happens. I suggest to use long troughout the whole function (it is also important to set n, length, and k to type long). Then the overflow happens when using more than 3,037,000,499 examples which should be enough.

Expected behavior Not overflow happening.

Actual behavior java.lang.NegativeArraySizeException in case of choosing 46,341. When using 65,536 then the calculation n * (n+1) is again positive and the overflow is undiscovered but the calculations are wrong.

Code snippet

double[][] data = new double[46341][2];

int n = data.length;
int length = n * (n+1) / 2;
System.out.println(length);

//fill with random values
Random rnd = new Random(1234);
for (int i = 0; i < data.length; i++) {
    for (int j = 0; j < data[i].length; j++) {
        data[i][j] = rnd.nextDouble();
    }
}

//this call is also executed when running e.g. SingleLinkage.of(data)
float[] prox = Linkage.proximity(data);
haifengl commented 2 years ago

Thanks. It won't solve the problem by changing to long. Java array size is up to only Integer.MAX_VALUE. The array allocation will fail anyway for large data set.

sven-h commented 2 years ago

Yes, that is true. If the number of examples are above 65535, the array generation is not possible anymore.

sven-h commented 2 years ago

But the java array size problem occurs above 65,535 examples whereas the error computing some intermediate numbers like n * (n+1) / 2; occurs already at 46,341 examples. The problem is also contained in the Linkage class constructor which fails with datasets between 46,341 and 65,534 examples. If the computation is done with long instead of int, then the algorithm can also compute the clustering until 65,535 examples.

haifengl commented 2 years ago

I made some changes so that it works with up to 65535 rows. Please try master branch.