[New Engine]: Add support for "tglkmeans".

coforfe commented 2 years ago

Hi Emil,

Congrats again for this new very good initiative.

It would be very nice to include the support of "tglkmeans" that:

It offers a much faster implementation of k-means.
And its output is fully equivalent to what kmeans() has.

Thanks again, Carlos.

EmilHvitfeldt commented 2 years ago

Hello @coforfe 👋

Thanks for the interest! We will take this under advisement

EmilHvitfeldt commented 1 year ago

Thanks again for the suggestion! After looking around I have decided not to include {tglkmeans} as an engine to {tidyclust}. The performance in terms of speed has not been able to beat stats::kmeans() for any of my trials.

If you disagree with my findings please let me know by opening another issue and referencing this one. Thanks!

Small data set

library(tidymodels)
library(tidyclust)

set.seed(1234)
data <- rbind(
  matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2),
  matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2),
  matrix(rnorm(100, mean = 3, sd = 0.3), ncol = 2),
  matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2),
  matrix(rnorm(100, mean = 5, sd = 0.3), ncol = 2)
)
colnames(data) <- c("x", "y")

bench::mark(
  check = FALSE,
  kmeans(data, 5),
  tglkmeans::TGL_kmeans(data, 5, parallel = FALSE)
)
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 kmeans(data, 5)                     76.6µs 91.3µs   10444.   269.33KB    13.3 
#> 2 tglkmeans::TGL_kmeans(data, 5, par…   10ms 10.5ms      91.4    6.17MB     6.53

Medium data set - 3000 observations 275 columns

rec <- recipe(~., data = ames) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_predictors())

ames_num <- prep(rec) |> 
  bake(new_data = NULL)

ames_num
#> # A tibble: 2,930 × 275
#>    Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area BsmtFin_SF_1
#>           <dbl>    <dbl>      <dbl>          <dbl>        <dbl>        <dbl>
#>  1       2.49     2.74       -0.375         -1.16        0.0610       -0.975
#>  2       0.667    0.187      -0.342         -1.12       -0.566         0.816
#>  3       0.697    0.523      -0.442         -1.26        0.0386       -1.42 
#>  4       1.06     0.128      -0.111         -0.780      -0.566        -1.42 
#>  5       0.488    0.467       0.848          0.658      -0.566        -0.527
#>  6       0.608   -0.0216      0.881          0.658      -0.454        -0.527
#>  7      -0.497   -0.663       0.980          0.802      -0.566        -0.527
#>  8      -0.437   -0.653       0.683          0.371      -0.566        -1.42 
#>  9      -0.557   -0.604       0.782          0.562      -0.566        -0.527
#> 10       0.0702  -0.336       0.914          0.706      -0.566         1.26 
#> # ℹ 2,920 more rows
#> # ℹ 269 more variables: BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
#> #   Total_Bsmt_SF <dbl>, First_Flr_SF <dbl>, Second_Flr_SF <dbl>,
#> #   Gr_Liv_Area <dbl>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
#> #   Full_Bath <dbl>, Half_Bath <dbl>, Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>,
#> #   TotRms_AbvGrd <dbl>, Fireplaces <dbl>, Garage_Cars <dbl>,
#> #   Garage_Area <dbl>, Wood_Deck_SF <dbl>, Open_Porch_SF <dbl>, …

bench::mark(
  check = FALSE,
  kmeans(ames_num, 4, iter.max = 10),
  suppressWarnings(tglkmeans::TGL_kmeans(ames_num, 4, max_iter = 10, parallel = FALSE))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 kmeans(ames_num, 4, iter.max = 1…  38.4ms  56.6ms     14.8     24.8MB     14.8
#> 2 suppressWarnings(tglkmeans::TGL_… 111.5ms 114.6ms      7.31      55MB     11.7

Large data set - 1000000 observations 275 columns

ames_num_big <- ames_num |>
  slice_sample(n = 1000000)

bench::mark(
  check = FALSE,
  kmeans(ames_num_big, 4, iter.max = 10),
  suppressWarnings(tglkmeans::TGL_kmeans(ames_num_big, 4, max_iter = 10, parallel = FALSE))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 kmeans(ames_num_big, 4, iter.max…  43.8ms  52.1ms     19.1     24.8MB     13.4
#> 2 suppressWarnings(tglkmeans::TGL_… 122.2ms 133.1ms      5.79      55MB     13.5

^{Created on 2023-08-23 with reprex v2.0.2}

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / tidyclust