It would be beneficial to write a GPU version of it and use it when the data is large enough.
Thinks to take into account:
1) test if it's worth running it on the CPU in certain cases or should we always run it on the GPU if the rest of the algorithm will run on the GPU also?
2) if possible it would be great to pass the data to the GPU only once and use it for both kmeans|| and the rest of the algorithm so we don't move the data all the time
3) kmeans||, especially if the number of clusters is large, can be very memory hungry (as it will be calculating distances for p * k clusters where k is the number of clusters specified by the user and p is a probability based variable larger than 1). This might be one of the reasons to keep the calculations on the CPU in some cases.
4) do benchmarks when done
Currently the KMeans|| initialization algorithm is performed on the CPU (https://github.com/h2oai/h2o4gpu/blob/master/src/gpu/kmeans/kmeans_h2o4gpu.cu#L367) which is a major bottleneck in cases where the data is large (for example Homesite Kaggle dataset).
It would be beneficial to write a GPU version of it and use it when the data is large enough.
Thinks to take into account:
1) test if it's worth running it on the CPU in certain cases or should we always run it on the GPU if the rest of the algorithm will run on the GPU also? 2) if possible it would be great to pass the data to the GPU only once and use it for both kmeans|| and the rest of the algorithm so we don't move the data all the time 3) kmeans||, especially if the number of clusters is large, can be very memory hungry (as it will be calculating distances for p * k clusters where k is the number of clusters specified by the user and p is a probability based variable larger than 1). This might be one of the reasons to keep the calculations on the CPU in some cases. 4) do benchmarks when done