Open assaf127 opened 3 years ago
I managed to fix this problem for me:
first, in kmcuda.cc
, inside kmeans_init_centroids
I changed this:
if (j == 0 || j > samples_size) {
assert(j > 0 && j <= samples_size);
INFO("\ninternal bug in kmeans_init_centroids: j = %" PRIu32 "\n", j);
}
to that:
if (j == 0 || j > samples_size) {
assert(j > 0 && j <= samples_size);
INFO("\ninternal bug in kmeans_init_centroids: j = %" PRIu32 "\n", j);
return kmcudaRuntimeError;
}
In order to avoid the cudaMemcpy
error in the next call to cuda_copy_sample_t
. This fix makes the error recoverable: I can call kmeans_cuda with different yinyang_t
value after I get this error message without needing to restart the python application.
In addition, i found that the source of the NaN values is in the _sqrt
function used in the distance_t
functions in metric_abstraction.h
. Apparently, even though it is theoretically impossible for a negative value to find its way to the _sqrt
function, and even after using absolute value on the input for this function, NaN values still appeared - maybe due to some numerical errors, somewhere in the calculations or in the CUDA implementation of the square root functions. The NaN results probably from input values too close to zero, therefor it is possible to just change the NaN to zeros and have the code working ok, using something like this:
FPATTR static float distance_t_no_nan(const F *__restrict__ v1, const F *__restrict__ v2,
uint64_t v1_size, uint64_t v1_index) {
auto res = distance_t(v1, v2, v1_size, v1_index);
return _eq(res, res) ? res : 0;
}
This will slow the code a bit, so I only use it if the original distance_t
returned NaN.
This is working OK for me, I can PR if necesary (tested it only on my system).
I get the following output (using
verbosity=3
) when running kmeans_cuda from python on a certain input (attached here):There are 14641 vectors, and their dimension is 64, trying to get 292 clusters. I'm using the default
yinyang_t=0.1
. If I reduce it toyinyang_t=0.01
the function succeeds, with only a singledist_sum is NaN
error for step 1. This would have been fine if I could wrap the function call with try-except, but unfortunately after the first failure there is probably some memory error, and running the code again withyinyang_t=0.01
results in:And I need to restart python again.
I'm using ubuntu 20.04 and RTX 2080Ti, and compiled the library using CUDA_ARCH=75. The errors can be reproduced using the attached file and the following code:
I tried to look at the code and figure out where the NaNs come from (my data has no NaNs in it), but couldn't find the source of the problem. I also didn't find a way to handle this problem in a recoverable way. kmeans_input.zip