Closed sven-h closed 2 years ago
Thanks. It won't solve the problem by changing to long. Java array size is up to only Integer.MAX_VALUE. The array allocation will fail anyway for large data set.
Yes, that is true. If the number of examples are above 65535, the array generation is not possible anymore.
But the java array size problem occurs above 65,535 examples whereas the error computing some intermediate numbers like n * (n+1) / 2;
occurs already at 46,341 examples.
The problem is also contained in the Linkage class constructor which fails with datasets between 46,341 and 65,534 examples.
If the computation is done with long instead of int, then the algorithm can also compute the clustering until 65,535 examples.
I made some changes so that it works with up to 65535 rows. Please try master branch.
Describe the bug I run hierarchical clusterings with a lot of examples and noticed that it does not compute it correctly. After further analysis I found out that an integer overflow happens internally. The corresponding line is:
int length = n * (n+1) / 2;
. If n is 46,341, thenn * (n+1)
is greater than 2^31-1 (int datatype). Therefore an integer overflow happens. I suggest to uselong
troughout the whole function (it is also important to setn
,length
, andk
to type long). Then the overflow happens when using more than 3,037,000,499 examples which should be enough.Expected behavior Not overflow happening.
Actual behavior
java.lang.NegativeArraySizeException
in case of choosing 46,341. When using 65,536 then the calculationn * (n+1)
is again positive and the overflow is undiscovered but the calculations are wrong.Code snippet