Unrecoverable cudaMemcpy error

I get the following output (using verbosity=3) when running kmeans_cuda from python on a certain input (attached here):

performing kmeans++...
kmeans++: dump 292 64 0x564e90a8e000
kmeans++: dev #0: 0x7fd5f5000000 0x7fd5f51ef600 0x7fd5f51fd5c0
step 1[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)

internal bug inside kmeans_init_centroids: dist_sum is NaN
step 2[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)

internal bug inside kmeans_init_centroids: dist_sum is NaN
step 3[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)

internal bug inside kmeans_init_centroids: dist_sum is NaN
step 4[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)

internal bug inside kmeans_init_centroids: dist_sum is NaN
step 5[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)

internal bug inside kmeans_init_centroids: dist_sum is NaN

internal bug in kmeans_init_centroids: j = 0
step 6[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)
cudaMemcpyAsync( host_dists + offset, (*dists)[devi].get(), length * sizeof(float), cudaMemcpyDeviceToHost)
....../kmcuda/src/kmeans.cu:810 -> an illegal memory access was encountered

kmeans_cuda_plus_plus failed
kmeans_init_centroids() failed for yinyang groups: an illegal memory access was encountered
kmeans_cuda_yy failed: no error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: cudaMemcpy failed

There are 14641 vectors, and their dimension is 64, trying to get 292 clusters. I'm using the default yinyang_t=0.1. If I reduce it to yinyang_t=0.01 the function succeeds, with only a single dist_sum is NaN error for step 1. This would have been fine if I could wrap the function call with try-except, but unfortunately after the first failure there is probably some memory error, and running the code again with yinyang_t=0.01 results in:

...../kmcuda/src/kmcuda.cc:151 -> an illegal memory access was encountered

And I need to restart python again.

I'm using ubuntu 20.04 and RTX 2080Ti, and compiled the library using CUDA_ARCH=75. The errors can be reproduced using the attached file and the following code:

from libKMCUDA import kmeans_cuda
import pickle
with open('kmeans_input.pickle', 'rb') as f:
    params = pickle.load(f)
kmeans_cuda(**params)

I tried to look at the code and figure out where the NaNs come from (my data has no NaNs in it), but couldn't find the source of the problem. I also didn't find a way to handle this problem in a recoverable way. kmeans_input.zip

I managed to fix this problem for me: first, in kmcuda.cc, inside kmeans_init_centroids I changed this:

if (j == 0 || j > samples_size) {
    assert(j > 0 && j <= samples_size);
    INFO("\ninternal bug in kmeans_init_centroids: j = %" PRIu32 "\n", j);
}

to that:

if (j == 0 || j > samples_size) {
    assert(j > 0 && j <= samples_size);
    INFO("\ninternal bug in kmeans_init_centroids: j = %" PRIu32 "\n", j);
    return kmcudaRuntimeError;
}

In order to avoid the cudaMemcpy error in the next call to cuda_copy_sample_t. This fix makes the error recoverable: I can call kmeans_cuda with different yinyang_t value after I get this error message without needing to restart the python application.

In addition, i found that the source of the NaN values is in the _sqrt function used in the distance_t functions in metric_abstraction.h. Apparently, even though it is theoretically impossible for a negative value to find its way to the _sqrt function, and even after using absolute value on the input for this function, NaN values still appeared - maybe due to some numerical errors, somewhere in the calculations or in the CUDA implementation of the square root functions. The NaN results probably from input values too close to zero, therefor it is possible to just change the NaN to zeros and have the code working ok, using something like this:

FPATTR static float distance_t_no_nan(const F *__restrict__ v1, const F *__restrict__ v2,
                               uint64_t v1_size, uint64_t v1_index) {
  auto res = distance_t(v1, v2, v1_size, v1_index);
  return _eq(res, res) ? res : 0;
}

This will slow the code a bit, so I only use it if the original distance_t returned NaN. This is working OK for me, I can PR if necesary (tested it only on my system).

src-d / kmcuda

Unrecoverable cudaMemcpy error #112