When I try to use tune_cluster() with an apparent() split (because kmeans isn't often used with splits, so apparent() seems to make the most sense to me), the result has a lot of NAs. After a lot of work I eventually traced it down to something really weird: the result seems to depend on the name of the split (!?).
You can reproduce this in the docker image ubcdsci/r-dsci-100-grading:cafad0999c16.
Reprex:
library(tidyverse)
library(tidymodels)
library(tidyclust)
# start by reducing the size of mtcars just to make things cleaner (this is not important for the bug)
mt <- mtcars |> rep_sample_n(size = 10, replace = TRUE, reps = 1) |> ungroup() |> select(mpg, disp)
# specification and recipe
kmeans_spec <- k_means(num_clusters = tune()) |>
set_engine("stats")
kmeans_recipe <- recipe(~ ., data=mt) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
# tuning 1-4 clusters
ks <- tibble(num_clusters = 1:4)
# Now we create two rsets. One using apparent, one manually. They're identical except for the split name.
# RSET 1: manually created single split that just does tuning on the whole data set.
# The split can be named anything you want EXCEPT "Apparent". I named it "banana".
# Note: if you name this "Apparent", you'll see a buggy result just like if you used apparent().
indices <- list(list(analysis = 1:nrow(mt), assessment = 1:nrow(mt)))
splits <- lapply(indices, make_splits, data = mt)
split_good <- manual_rset(splits, c("banana"))
# RSET 2: using apparent.
split_bad <- apparent(mt)
# if you inspect split_good and split_bad, they're identical aside from the split name.
# Now we tune the number of clusters with each rset
results_good <- workflow() |>
add_recipe(kmeans_recipe) |>
add_model(kmeans_spec) |>
tune_cluster(resamples = split_good, grid = ks) |>
collect_metrics()
results_bad <- workflow() |>
add_recipe(kmeans_recipe) |>
add_model(kmeans_spec) |>
tune_cluster(resamples = split_bad, grid = ks) |>
collect_metrics()
An minor update: if we downgrade tune to version 1.1.2, apparent seems to work again. So perhaps there was a change in tune that broke tune_cluster() from 1.1.2 -> 1.2.0?
When I try to use
tune_cluster()
with anapparent()
split (because kmeans isn't often used with splits, soapparent()
seems to make the most sense to me), the result has a lot ofNA
s. After a lot of work I eventually traced it down to something really weird: the result seems to depend on the name of the split (!?).You can reproduce this in the docker image
ubcdsci/r-dsci-100-grading:cafad0999c16
.Reprex:
The outputs look like: