mlampros / ClusterR

Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering
https://mlampros.github.io/ClusterR/
84 stars 29 forks source link

Why `predict_KMeans()` is slower than R function? #46

Closed kadyb closed 1 year ago

kadyb commented 1 year ago

Thank you for package! I'm currently looking for an efficient implementation of kmeans in R and I did some benchmark. I wrote my function to predict kmeans (it may be incorrect), but it seems to be faster than ClusterR::predict_KMeans(). What is the reason?

library("ClusterR")
set.seed(1)

## dataset
n = 1e5
x = cbind(x1 = rnorm(n, sd = 0.3),
          x2 = rnorm(n, mean = 1, sd = 0.3),
          x3 = rnorm(n, mean = 10, sd = 4),
          x4 = rnorm(n, sd = 1))

#### train ####
system.time({ mdl1 = stats::kmeans(x, 1000, iter.max = 10) }) #> 8.71
system.time({ mdl2 = ClusterR::KMeans_rcpp(x, 1000, max_iters = 10) }) #> 75.78
system.time({ mdl3 = ClusterR::MiniBatchKmeans(x, 1000, batch_size = 100, max_iters = 10) }) #> 5.44
system.time({ mdl4 = ClusterR::KMeans_arma(x, 1000, n_iter = 10) }) #> 0.53

#### predict ####
## my function
predict.kmeans = function(mdl, newdata) {
  vec = integer(nrow(newdata))
  mdl = t(mdl)
  for (i in seq_len(nrow(newdata))) {
    vec[i] = which.min(colSums((mdl - newdata[i, ])^2))
  }
  return(vec)
}

system.time({ predict.kmeans(mdl1$centers, x) }) #> 2.44
system.time({ ClusterR::predict_KMeans(x, mdl2$centroids, threads = 1) }) #> 6.89
system.time({ ClusterR::predict_MBatchKMeans(x, mdl3$centroids) }) #> 7.34
system.time({ ClusterR::predict_KMeans(x, mdl4, threads = 1) }) #> 6.86
Session info ``` R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044) Matrix products: default attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ClusterR_1.3.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.10 fansi_1.0.4 gmp_0.6-10 dplyr_1.1.0 utf8_1.2.2 [6] grid_4.2.2 R6_2.5.1 lifecycle_1.0.3 gtable_0.3.1 magrittr_2.0.3 [11] scales_1.2.1 pillar_1.8.1 ggplot2_3.4.0 rlang_1.0.6 cli_3.6.0 [16] rstudioapi_0.14 generics_0.1.3 vctrs_0.5.2 tools_4.2.2 glue_1.6.2 [21] munsell_0.5.0 compiler_4.2.2 pkgconfig_2.0.3 colorspace_2.1-0 tidyselect_1.2.0 [26] tibble_3.1.8 ```
mlampros commented 1 year ago

@kadyb I'm sorry for not replying earlier. The following is what I see when I run the code,

benchmarking

In my opinion the elapsed time depends also on the system configuration and hardware (because we have compiled code, i.e. an RcppArmadillo function). Based on my attached image the elapsed time in my operating system suggests that the ClusterR::KMeans_rcpp() function needs improvement (for which I intend to work some time in the future) but not the ClusterR::predict_KMeans()

kadyb commented 1 year ago

Thanks! Interesting, I checked on other hardware with Ubuntu and indeed the times look different. I don't know if that matters, but are you using the standard BLAS library or another (e.g. OpenBLAS, Intel MKL)?

On my other PC I see stats::kmeans() and ClusterR::KMeans_rcpp() are comparable:

system.time({ mdl1 = stats::kmeans(x, 1000, iter.max = 10) })
#> user  system elapsed 
#> 8.508   0.000   8.513 
system.time({ mdl2 = ClusterR::KMeans_rcpp(x, 1000, max_iters = 10) })
#> user  system elapsed 
#> 6.657   2.177   8.833 
mlampros commented 1 year ago

Due to the fact that I mainly use RcppArmadillo in the ClusterR package the users of the package might observe speed-ups if the Armadillo C++ library is optimized for the configuration of the user as described in the speed section of the Frequently Asked questions of the C++ library. The Linking section also includes details about optimizations that you mentioned such as using OpenBLAS, LAPACK, BLAS etc. To tell the truth, besides using OpenMP (also) in my RcppArmadillo code I haven't experimented with something else on my Linux Mint computer

mlampros commented 1 year ago

I'll close the issue for now, feel free to re-open it in case the code does not work as expected