src-d / kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
Other
806 stars 145 forks source link

Is there a limit on max columns (features) that kmcuda can manage? #110

Open sbushmanov opened 4 years ago

sbushmanov commented 4 years ago

kmcuda runs well until 12'000 features:

from libKMCUDA import kmeans_cuda
from time import time

X = np.random.rand(10, 12000).astype(dtype=np.float32)

start = time()
centers_, labels_ = kmeans_cuda(X, 10)
print(time() - start)

0.19472670555114746

It never finishes with 13'000 ÷ 60'000 features.

It throws an error right away with 70'000+ features:

from libKMCUDA import kmeans_cuda
from time import time

X = np.random.rand(10, 70000).astype(dtype=np.float32)

start = time()
centers_, labels_ = kmeans_cuda(X, 10)
print(time() - start)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-8e783410a8e6> in <module>
     10 
     11 start = time()
---> 12 centers_, labels_ = kmeans_cuda(X, 10)
     13 print(time() - start)

ValueError: "samples": more than 70000 features is not supported

So my question is:

Is there a limit on horizontal dimension kmcuda can manage or I'm missing something?

I'm running Ubuntu 18.04, conda python 3.7 environment, CUDA 10.2, libKMCuda 6.2.3 installed via pip