Closed coforfe closed 1 year ago
Hello @coforfe 👋
Thanks for the interest! We will take this under advisement
Thanks again for the suggestion! After looking around I have decided not to include {tglkmeans} as an engine to {tidyclust}. The performance in terms of speed has not been able to beat stats::kmeans()
for any of my trials.
If you disagree with my findings please let me know by opening another issue and referencing this one. Thanks!
library(tidymodels)
library(tidyclust)
set.seed(1234)
data <- rbind(
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 3, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 5, sd = 0.3), ncol = 2)
)
colnames(data) <- c("x", "y")
bench::mark(
check = FALSE,
kmeans(data, 5),
tglkmeans::TGL_kmeans(data, 5, parallel = FALSE)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
#> 1 kmeans(data, 5) 76.6µs 91.3µs 10444. 269.33KB 13.3
#> 2 tglkmeans::TGL_kmeans(data, 5, par… 10ms 10.5ms 91.4 6.17MB 6.53
rec <- recipe(~., data = ames) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors()) |>
step_normalize(all_predictors())
ames_num <- prep(rec) |>
bake(new_data = NULL)
ames_num
#> # A tibble: 2,930 × 275
#> Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area BsmtFin_SF_1
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2.49 2.74 -0.375 -1.16 0.0610 -0.975
#> 2 0.667 0.187 -0.342 -1.12 -0.566 0.816
#> 3 0.697 0.523 -0.442 -1.26 0.0386 -1.42
#> 4 1.06 0.128 -0.111 -0.780 -0.566 -1.42
#> 5 0.488 0.467 0.848 0.658 -0.566 -0.527
#> 6 0.608 -0.0216 0.881 0.658 -0.454 -0.527
#> 7 -0.497 -0.663 0.980 0.802 -0.566 -0.527
#> 8 -0.437 -0.653 0.683 0.371 -0.566 -1.42
#> 9 -0.557 -0.604 0.782 0.562 -0.566 -0.527
#> 10 0.0702 -0.336 0.914 0.706 -0.566 1.26
#> # ℹ 2,920 more rows
#> # ℹ 269 more variables: BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
#> # Total_Bsmt_SF <dbl>, First_Flr_SF <dbl>, Second_Flr_SF <dbl>,
#> # Gr_Liv_Area <dbl>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
#> # Full_Bath <dbl>, Half_Bath <dbl>, Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>,
#> # TotRms_AbvGrd <dbl>, Fireplaces <dbl>, Garage_Cars <dbl>,
#> # Garage_Area <dbl>, Wood_Deck_SF <dbl>, Open_Porch_SF <dbl>, …
bench::mark(
check = FALSE,
kmeans(ames_num, 4, iter.max = 10),
suppressWarnings(tglkmeans::TGL_kmeans(ames_num, 4, max_iter = 10, parallel = FALSE))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 kmeans(ames_num, 4, iter.max = 1… 38.4ms 56.6ms 14.8 24.8MB 14.8
#> 2 suppressWarnings(tglkmeans::TGL_… 111.5ms 114.6ms 7.31 55MB 11.7
ames_num_big <- ames_num |>
slice_sample(n = 1000000)
bench::mark(
check = FALSE,
kmeans(ames_num_big, 4, iter.max = 10),
suppressWarnings(tglkmeans::TGL_kmeans(ames_num_big, 4, max_iter = 10, parallel = FALSE))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 kmeans(ames_num_big, 4, iter.max… 43.8ms 52.1ms 19.1 24.8MB 13.4
#> 2 suppressWarnings(tglkmeans::TGL_… 122.2ms 133.1ms 5.79 55MB 13.5
Created on 2023-08-23 with reprex v2.0.2
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
Hi Emil,
Congrats again for this new very good initiative.
It would be very nice to include the support of "tglkmeans" that:
k-means
.kmeans()
has.Thanks again, Carlos.