tidymodels / butcher

Reduce the size of model objects saved to disk
https://butcher.tidymodels.org/
Other
131 stars 12 forks source link

Add butcher methods for `ClusterR::KMeans_rcpp()` #236

Closed juliasilge closed 1 year ago

juliasilge commented 1 year ago

This PR adds butcher support for a basic k-means algorithm. I was going to do stats::kmeans() but it doesn't have a predict method! 😱

library(tidyverse)
library(ClusterR)
#> Loading required package: gtools
library(butcher)
data(dietary_survey_IBS)
df <- scale(dietary_survey_IBS %>% select(-class) %>% slice_sample(n = 5e3, replace = TRUE))
km <- KMeans_rcpp(df[1:4e3,], clusters = 2, num_init = 5, max_iters = 100, initializer = 'kmeans++')
weigh(km)
#> # A tibble: 8 × 2
#>   object                      size
#>   <chr>                      <dbl>
#> 1 clusters                0.0320  
#> 2 call                    0.00169 
#> 3 centroids               0.000888
#> 4 WCSS_per_cluster        0.000232
#> 5 obs_per_cluster         0.000232
#> 6 total_SSE               0.000056
#> 7 best_initialization     0.000056
#> 8 between.SS_DIV_total.SS 0.000056

out <- butcher(km, verbose = TRUE)
#> ✔ Memory released: "33.18 kB"
#> ✖ Disabled: `print()` and `summary()`
weigh(out)
#> # A tibble: 8 × 2
#>   object                      size
#>   <chr>                      <dbl>
#> 1 centroids               0.000888
#> 2 WCSS_per_cluster        0.000232
#> 3 obs_per_cluster         0.000232
#> 4 call                    0.000112
#> 5 total_SSE               0.000056
#> 6 best_initialization     0.000056
#> 7 between.SS_DIV_total.SS 0.000056
#> 8 clusters                0.000048

predict(km, df[4500:4510,])
#>  [1] 2 1 1 2 2 2 1 1 1 2 2
predict(out, df[4500:4510,])
#>  [1] 2 1 1 2 2 2 1 1 1 2 2

Created on 2022-11-22 with reprex v2.0.2

I picked this one because you have it as an engine in tidyclust @EmilHvitfeldt.

juliasilge commented 1 year ago

Related to #235 to make sure we have some idea what we are doing with unsupervised clustering algorithms in general.

github-actions[bot] commented 1 year ago

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.