src-d / kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
Other
806 stars 146 forks source link

Random NaN issue #24

Open francisr opened 7 years ago

francisr commented 7 years ago

I have a cudaMemcpy failed that pops up randomly when my init is k-means++. Full log:

INFO:quantise:Processing slice 51/128 arguments: 1 0x7ffed93d5414 0.010 0.10 0 124557 4 8192 0 0 0 2 0x2621470 0x460b2a0 0x462b2b0 (nil) reassignments threshold: 1245 yinyang groups: 819 reusing passed_yy for centroids_yy GPU #0 memory: used 531759104 bytes (4.2%), free 12268404736 bytes, total 12800163840 bytes GPU #0 has 49152 bytes of shared memory per block transposing the samples... transpose <<<(3893, 1), (8, 32)>>> 124557, 4, xyswap performing kmeans++... done
running Lloyd until reassignments drop below 13701 plans: [(0, 124557)] planc: [(0, 8192)] iteration 1: 124557 reassignments iteration 2: 20600 reassignments iteration 3: 11307 reassignments transposing the samples... transpose <<<(256, 1), (8, 32)>>> 8192, 4, xyswap performing kmeans++... step 1 internal bug inside kmeans_init_centroids: dist_sum is NaN step 2 internal bug inside kmeans_init_centroids: dist_sum is NaN step 3 internal bug inside kmeans_init_centroids: dist_sum is NaN step 4 internal bug inside kmeans_init_centroids: dist_sum is NaN step 5 internal bug inside kmeans_init_centroids: dist_sum is NaN step 6 internal bug inside kmeans_init_centroids: dist_sum is NaN step 7 internal bug inside kmeans_init_centroids: dist_sum is NaN step 8 internal bug inside kmeans_init_centroids: dist_sum is NaN step 9 internal bug inside kmeans_init_centroids: dist_sum is NaN step 10 internal bug inside kmeans_init_centroids: dist_sum is NaN step 11 internal bug inside kmeans_init_centroids: dist_sum is NaN step 12 internal bug inside kmeans_init_centroids: dist_sum is NaN step 13 internal bug inside kmeans_init_centroids: dist_sum is NaN step 14 internal bug inside kmeans_init_centroids: dist_sum is NaN step 15 internal bug inside kmeans_init_centroids: dist_sum is NaN step 16 internal bug inside kmeans_init_centroids: dist_sum is NaN step 17 internal bug inside kmeans_init_centroids: dist_sum is NaN step 18 internal bug inside kmeans_init_centroids: dist_sum is NaN step 19 internal bug inside kmeans_init_centroids: dist_sum is NaN step 20 internal bug inside kmeans_init_centroids: dist_sum is NaN step 21 internal bug inside kmeans_init_centroids: dist_sum is NaN step 22 internal bug inside kmeans_init_centroids: dist_sum is NaN step 23 internal bug inside kmeans_init_centroids: dist_sum is NaN step 24 internal bug inside kmeans_init_centroids: dist_sum is NaN step 25 internal bug inside kmeans_init_centroids: dist_sum is NaN step 26 internal bug inside kmeans_init_centroids: dist_sum is NaN step 27 internal bug inside kmeans_init_centroids: dist_sum is NaN step 28 internal bug inside kmeans_init_centroids: dist_sum is NaN step 29 internal bug inside kmeans_init_centroids: dist_sum is NaN step 30 internal bug inside kmeans_init_centroids: dist_sum is NaN step 31 internal bug inside kmeans_init_centroids: dist_sum is NaN step 32 internal bug inside kmeans_init_centroids: dist_sum is NaN step 33 internal bug inside kmeans_init_centroids: dist_sum is NaN step 34 internal bug inside kmeans_init_centroids: dist_sum is NaN step 35 internal bug inside kmeans_init_centroids: dist_sum is NaN step 36 internal bug inside kmeans_init_centroids: dist_sum is NaN step 37 internal bug inside kmeans_init_centroids: dist_sum is NaN step 38 internal bug inside kmeans_init_centroids: dist_sum is NaN step 39 internal bug inside kmeans_init_centroids: dist_sum is NaN step 40 internal bug inside kmeans_init_centroids: dist_sum is NaN step 41 internal bug inside kmeans_init_centroids: dist_sum is NaN step 42 internal bug inside kmeans_init_centroids: dist_sum is NaN step 43 internal bug inside kmeans_init_centroids: dist_sum is NaN step 44 internal bug inside kmeans_init_centroids: dist_sum is NaN step 45 internal bug inside kmeans_init_centroids: dist_sum is NaN step 46 internal bug inside kmeans_init_centroids: dist_sum is NaN step 47 internal bug inside kmeans_init_centroids: dist_sum is NaN step 48 internal bug inside kmeans_init_centroids: dist_sum is NaN step 49 internal bug inside kmeans_init_centroids: dist_sum is NaN step 50 internal bug inside kmeans_init_centroids: dist_sum is NaN step 51 internal bug inside kmeans_init_centroids: dist_sum is NaN step 52 internal bug inside kmeans_init_centroids: dist_sum is NaN step 53 internal bug inside kmeans_init_centroids: dist_sum is NaN step 54 internal bug inside kmeans_init_centroids: dist_sum is NaN step 55 internal bug inside kmeans_init_centroids: dist_sum is NaN step 56 internal bug inside kmeans_init_centroids: dist_sum is NaN step 57 internal bug inside kmeans_init_centroids: dist_sum is NaN step 58 internal bug inside kmeans_init_centroids: dist_sum is NaN step 59 internal bug inside kmeans_init_centroids: dist_sum is NaN step 60 internal bug inside kmeans_init_centroids: dist_sum is NaN step 61 internal bug inside kmeans_init_centroids: dist_sum is NaN step 62 internal bug inside kmeans_init_centroids: dist_sum is NaN step 63 internal bug inside kmeans_init_centroids: dist_sum is NaN step 64 internal bug inside kmeans_init_centroids: dist_sum is NaN step 65 internal bug inside kmeans_init_centroids: dist_sum is NaN step 66 internal bug inside kmeans_init_centroids: dist_sum is NaN step 67 internal bug inside kmeans_init_centroids: dist_sum is NaN step 68 internal bug inside kmeans_init_centroids: dist_sum is NaN step 69 internal bug inside kmeans_init_centroids: dist_sum is NaN step 70 internal bug inside kmeans_init_centroids: dist_sum is NaN step 71 internal bug inside kmeans_init_centroids: dist_sum is NaN step 72 internal bug inside kmeans_init_centroids: dist_sum is NaN step 73 internal bug inside kmeans_init_centroids: dist_sum is NaN step 74 internal bug inside kmeans_init_centroids: dist_sum is NaN step 75 internal bug inside kmeans_init_centroids: dist_sum is NaN step 76 internal bug inside kmeans_init_centroids: dist_sum is NaN step 77 internal bug inside kmeans_init_centroids: dist_sum is NaN step 78 internal bug inside kmeans_init_centroids: dist_sum is NaN step 79 internal bug inside kmeans_init_centroids: dist_sum is NaN step 80 internal bug inside kmeans_init_centroids: dist_sum is NaN step 81 internal bug inside kmeans_init_centroids: dist_sum is NaN step 82 internal bug inside kmeans_init_centroids: dist_sum is NaN step 83 internal bug inside kmeans_init_centroids: dist_sum is NaN step 84 internal bug inside kmeans_init_centroids: dist_sum is NaN step 85 internal bug inside kmeans_init_centroids: dist_sum is NaN step 86 internal bug inside kmeans_init_centroids: dist_sum is NaN step 87 internal bug inside kmeans_init_centroids: dist_sum is NaN step 88 internal bug inside kmeans_init_centroids: dist_sum is NaN step 89 internal bug inside kmeans_init_centroids: dist_sum is NaN step 90 internal bug inside kmeans_init_centroids: dist_sum is NaN step 91 internal bug inside kmeans_init_centroids: dist_sum is NaN step 92 internal bug inside kmeans_init_centroids: dist_sum is NaN step 93 internal bug inside kmeans_init_centroids: dist_sum is NaN step 94 internal bug inside kmeans_init_centroids: dist_sum is NaN step 95 internal bug inside kmeans_init_centroids: dist_sum is NaN step 96 internal bug inside kmeans_init_centroids: dist_sum is NaN step 97 internal bug inside kmeans_init_centroids: dist_sum is NaN step 98 internal bug inside kmeans_init_centroids: dist_sum is NaN step 99 internal bug inside kmeans_init_centroids: dist_sum is NaN step 100 internal bug inside kmeans_init_centroids: dist_sum is NaN step 101 internal bug inside kmeans_init_centroids: dist_sum is NaN step 102 internal bug inside kmeans_init_centroids: dist_sum is NaN step 103 internal bug inside kmeans_init_centroids: dist_sum is NaN step 104 internal bug inside kmeans_init_centroids: dist_sum is NaN step 105 internal bug inside kmeans_init_centroids: dist_sum is NaN step 106 internal bug inside kmeans_init_centroids: dist_sum is NaN step 107 internal bug inside kmeans_init_centroids: dist_sum is NaN step 108 internal bug inside kmeans_init_centroids: dist_sum is NaN step 109 internal bug inside kmeans_init_centroids: dist_sum is NaN step 110 internal bug inside kmeans_init_centroids: dist_sum is NaN step 111 internal bug inside kmeans_init_centroids: dist_sum is NaN step 112 internal bug inside kmeans_init_centroids: dist_sum is NaN step 113 internal bug inside kmeans_init_centroids: dist_sum is NaN step 114 internal bug inside kmeans_init_centroids: dist_sum is NaN step 115 internal bug inside kmeans_init_centroids: dist_sum is NaN step 116 internal bug inside kmeans_init_centroids: dist_sum is NaN step 117 internal bug inside kmeans_init_centroids: dist_sum is NaN step 118 internal bug inside kmeans_init_centroids: dist_sum is NaN step 119 internal bug inside kmeans_init_centroids: dist_sum is NaN step 120 internal bug inside kmeans_init_centroids: dist_sum is NaN step 121 internal bug inside kmeans_init_centroids: dist_sum is NaN step 122 internal bug inside kmeans_init_centroids: dist_sum is NaN step 123 internal bug inside kmeans_init_centroids: dist_sum is NaN step 124 internal bug inside kmeans_init_centroids: dist_sum is NaN step 125 internal bug inside kmeans_init_centroids: dist_sum is NaN step 126 internal bug inside kmeans_init_centroids: dist_sum is NaN step 127 internal bug inside kmeans_init_centroids: dist_sum is NaN step 128 internal bug inside kmeans_init_centroids: dist_sum is NaN step 129 internal bug inside kmeans_init_centroids: dist_sum is NaN step 130 internal bug inside kmeans_init_centroids: dist_sum is NaN step 131 internal bug inside kmeans_init_centroids: dist_sum is NaN step 132 internal bug inside kmeans_init_centroids: dist_sum is NaN step 133 internal bug inside kmeans_init_centroids: dist_sum is NaN step 134 internal bug inside kmeans_init_centroids: dist_sum is NaN step 135 internal bug inside kmeans_init_centroids: dist_sum is NaN step 136 internal bug inside kmeans_init_centroids: dist_sum is NaN step 137 internal bug inside kmeans_init_centroids: dist_sum is NaN step 138 internal bug inside kmeans_init_centroids: dist_sum is NaN step 139 internal bug inside kmeans_init_centroids: dist_sum is NaN step 140 internal bug inside kmeans_init_centroids: dist_sum is NaN step 141 internal bug inside kmeans_init_centroids: dist_sum is NaN step 142 internal bug inside kmeans_init_centroids: dist_sum is NaN step 143 internal bug inside kmeans_init_centroids: dist_sum is NaN step 144 internal bug inside kmeans_init_centroids: dist_sum is NaN step 145 internal bug inside kmeans_init_centroids: dist_sum is NaN step 146 internal bug inside kmeans_init_centroids: dist_sum is NaN step 147 internal bug inside kmeans_init_centroids: dist_sum is NaN step 148 internal bug inside kmeans_init_centroids: dist_sum is NaN step 149 internal bug inside kmeans_init_centroids: dist_sum is NaN step 150 internal bug inside kmeans_init_centroids: dist_sum is NaN step 151 internal bug inside kmeans_init_centroids: dist_sum is NaN step 152 internal bug inside kmeans_init_centroids: dist_sum is NaN step 153 internal bug inside kmeans_init_centroids: dist_sum is NaN step 154 internal bug inside kmeans_init_centroids: dist_sum is NaN step 155 internal bug inside kmeans_init_centroids: dist_sum is NaN step 156 internal bug inside kmeans_init_centroids: dist_sum is NaN step 157 internal bug inside kmeans_init_centroids: dist_sum is NaN step 158 internal bug inside kmeans_init_centroids: dist_sum is NaN step 159 internal bug inside kmeans_init_centroids: dist_sum is NaN step 160 internal bug inside kmeans_init_centroids: dist_sum is NaN step 161 internal bug inside kmeans_init_centroids: dist_sum is NaN step 162 internal bug inside kmeans_init_centroids: dist_sum is NaN step 163 internal bug inside kmeans_init_centroids: dist_sum is NaN step 164 internal bug inside kmeans_init_centroids: dist_sum is NaN step 165 internal bug inside kmeans_init_centroids: dist_sum is NaN step 166 internal bug inside kmeans_init_centroids: dist_sum is NaN step 167 internal bug inside kmeans_init_centroids: dist_sum is NaN step 168 internal bug inside kmeans_init_centroids: dist_sum is NaN step 169 internal bug inside kmeans_init_centroids: dist_sum is NaN step 170 internal bug inside kmeans_init_centroids: dist_sum is NaN step 171 internal bug inside kmeans_init_centroids: dist_sum is NaN step 172 internal bug inside kmeans_init_centroids: dist_sum is NaN

internal bug in kmeans_init_centroids: j = 0 step 173cudaMemcpyAsync( host_dists + offset, (dists)[devi].get(), length sizeof(float), cudaMemcpyDeviceToHost) /tmp/kmcuda/src/kmeans.cu:814 -> an illegal memory access was encountered

kmeans_cuda_plus_plus failed kmeans_init_centroids() failed for yinyang groups: an illegal memory access was encountered kmeans_cuda_yy failed: no error Traceback (most recent call last): File "./tflm/quantise.py", line 162, in main() File "./tflm/quantise.py", line 103, in main codebook, labels = kmeans_cuda(M, K, seed=0, verbosity=2, init="k-means++") RuntimeError: cudaMemcpy failed

francisr commented 6 years ago

Any update?

vmarkovtsev commented 6 years ago

Sorry, I completely missed this one :( Could you please provide the data on which I can reproduce the problem - apparently the code works for me.

As a workaround, try setting yinyang_t to 0.

francisr commented 6 years ago

It's hard to provide an example as this error doesn't always occur, I'll see what I can get.
Would setting yinyang_t to 0 incur any accuracy penalty?

vmarkovtsev commented 6 years ago

No accuracy penalty, 1.5x speed decrease in the worst case.

nsmartinez commented 6 years ago

Hi! I managed to (kind of) reproduce the issue. The files attached are the minimal steps required to produce the error. However, it only occurs for around 20% of the times. I am using a machine with two GTX TITAN X, and this program crashes in both of them. This is running 6.2.1 version.

bug.zip

vmarkovtsev commented 6 years ago

OK, I am reproducing the bug this week, thanks for posting.

lawrence1999 commented 4 years ago

Sorry, I completely missed this one :( Could you please provide the data on which I can reproduce the problem - apparently the code works for me.

As a workaround, try setting yinyang_t to 0.

hello,I want to know What does this parameter ‘yinyang_t’mean?

thmyl commented 12 months ago

hello! I occer this bug today yet. any update here? Thanks a lot!