tidymodels / tidyclust

A tidy unified interface to clustering models
https://tidyclust.tidymodels.org/
Other
108 stars 14 forks source link

Result of `tune_cluster()` depends on the name of the split? #193

Open trevorcampbell opened 2 weeks ago

trevorcampbell commented 2 weeks ago

When I try to use tune_cluster() with an apparent() split (because kmeans isn't often used with splits, so apparent() seems to make the most sense to me), the result has a lot of NAs. After a lot of work I eventually traced it down to something really weird: the result seems to depend on the name of the split (!?).

You can reproduce this in the docker image ubcdsci/r-dsci-100-grading:cafad0999c16.

Reprex:

library(tidyverse)
library(tidymodels)
library(tidyclust)

# start by reducing the size of mtcars just to make things cleaner (this is not important for the bug)
mt <- mtcars |> rep_sample_n(size = 10, replace = TRUE, reps = 1) |> ungroup() |> select(mpg, disp)

# specification and recipe
kmeans_spec <- k_means(num_clusters = tune()) |>
    set_engine("stats")

kmeans_recipe <- recipe(~ ., data=mt) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# tuning 1-4 clusters
ks <- tibble(num_clusters = 1:4)

# Now we create two rsets. One using apparent, one manually. They're identical except for the split name.

# RSET 1: manually created single split that just does tuning on the whole data set. 
# The split can be named anything you want EXCEPT "Apparent". I named it "banana".
# Note: if you name this "Apparent", you'll see a buggy result just like if you used apparent().
indices <- list(list(analysis = 1:nrow(mt), assessment = 1:nrow(mt)))
splits <- lapply(indices, make_splits, data = mt)
split_good <- manual_rset(splits, c("banana"))

# RSET 2: using apparent. 
split_bad <- apparent(mt)

# if you inspect split_good and split_bad, they're identical aside from the split name.

# Now we tune the number of clusters with each rset
results_good <- workflow() |>
    add_recipe(kmeans_recipe) |>
    add_model(kmeans_spec) |>
    tune_cluster(resamples = split_good, grid = ks) |>
    collect_metrics()

results_bad <- workflow() |>
    add_recipe(kmeans_recipe) |>
    add_model(kmeans_spec) |>
    tune_cluster(resamples = split_bad, grid = ks) |>
    collect_metrics()

The outputs look like:

image

trevorcampbell commented 2 weeks ago

An minor update: if we downgrade tune to version 1.1.2, apparent seems to work again. So perhaps there was a change in tune that broke tune_cluster() from 1.1.2 -> 1.2.0?