Buffered input for very large dataset.

src-d / kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA

Other

791 stars 145 forks source link

Buffered input for very large dataset. #7

Closed sharthZ23 closed 7 years ago

sharthZ23 commented 7 years ago

How about add buffered input for large datasets? ~500M samples in 64 dimension into ~150M clusters for example.

vmarkovtsev commented 7 years ago

Yes, it is in the plans (aka minibatch). Is the data sparse or dense? I planned to add sparse features support first.

sharthZ23 commented 7 years ago

I have dense data.

vmarkovtsev commented 7 years ago

This is for my future reference: 500Mx64x4=128GB

vmarkovtsev commented 7 years ago

Facebook Research has recently released https://github.com/facebookresearch/faiss which is a very fast K-nn CUDA impl. It has K-means and allows to have datasets which do not fit into memory (it splits them and processes chunk by chunk). Faiss is focused on very large scale and is awesome so I will not address this issue.