mlampros / ClusterR

Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering
https://mlampros.github.io/ClusterR/
84 stars 29 forks source link

CENTROIDS parameter does nothing #54

Closed Nikola4213 closed 3 months ago

Nikola4213 commented 3 months ago

The KMeans_rcpp function does not behave correctly when provided the CENTROIDS parameter. It ignores the centroids matrix provided and simply defaults to the "Kmeans++" initializer. I read the C++ code and there there should be a flag that is set to true when a centroid matrix is detected, instructing the program to not go through with any of the other initializers, however for some reason R matrices are not detected correctly. Am I doing something wrong, or is this behavior expected?

Example code and output:

library(ClusterR)

data <- matrix(c( 1.0, 1.0, 1.5, 2.0, 3.0, 4.0, 5.0, 7.0, 3.5, 5.0, 4.5, 5.0, 3.5, 4.5 ), nrow = 7, ncol = 2, byrow = TRUE)

initial_centroids1 <- matrix(c(0.1, 5, 5, 0.1), nrow = 2, ncol = 2, byrow = TRUE) initial_centroids2 <- matrix(c(2.5, 3, 3, 4), nrow = 2, ncol = 2, byrow = TRUE)

km_ic1 <- KMeans_rcpp(data, clusters = 2, max_iters = 100, CENTROIDS = initial_centroids1, verbose = TRUE, fuzzy = FALSE, tol = 1e-05) km_ic2 <- KMeans_rcpp(data, clusters = 2, max_iters = 100, CENTROIDS = initial_centroids2, verbose = TRUE, fuzzy = FALSE, tol = 1e-05) km_kmpp <- KMeans_rcpp(data, clusters = 2, max_iters = 100, initializer = "kmeans++", verbose = TRUE, fuzzy = FALSE, tol = 1e-05) km_oi <- KMeans_rcpp(data, clusters = 2, max_iters = 100, initializer = "optimal_init", verbose = TRUE, fuzzy = FALSE, tol = 1e-05)

OUTPUT:

km_ic1 <- KMeans_rcpp(data, clusters = 2, max_iters = 100, CENTROIDS = initial_centroids1, verbose = TRUE, fuzzy = FALSE, tol = 1e-05)

iteration: 1 --> total WCSS: 26.5 --> squared norm: 1.90485 iteration: 2 --> total WCSS: 11.2257 --> squared norm: 1.07748 iteration: 3 --> total WCSS: 8.525 --> squared norm: 0

===================== end of initialization 1 =====================

km_ic2 <- KMeans_rcpp(data, clusters = 2, max_iters = 100, CENTROIDS = initial_centroids2, verbose = TRUE, fuzzy = FALSE, tol = 1e-05)

iteration: 1 --> total WCSS: 26.5 --> squared norm: 1.90485 iteration: 2 --> total WCSS: 11.2257 --> squared norm: 1.07748 iteration: 3 --> total WCSS: 8.525 --> squared norm: 0

===================== end of initialization 1 =====================

km_kmpp <- KMeans_rcpp(data, clusters = 2, max_iters = 100, initializer = "kmeans++", verbose = TRUE, fuzzy = FALSE, tol = 1e-05)

iteration: 1 --> total WCSS: 26.5 --> squared norm: 1.90485 iteration: 2 --> total WCSS: 11.2257 --> squared norm: 1.07748 iteration: 3 --> total WCSS: 8.525 --> squared norm: 0

===================== end of initialization 1 =====================

km_oi <- KMeans_rcpp(data, clusters = 2, max_iters = 100, initializer = "optimal_init", verbose = TRUE, fuzzy = FALSE, tol = 1e-05)

iteration: 1 --> total WCSS: 10 --> squared norm: 0.694622 iteration: 2 --> total WCSS: 8.525 --> squared norm: 0

===================== end of initialization 1 =====================

mlampros commented 3 months ago

@Nikola4213

I'm sure you already know that once a user specifies (or gives as input) CENTROIDS then the initializers are skipped. The fact that you come to the same squared norm values in your first 3 (out of 4) cases might be related to your toy dataset which consists of 7 rows only and to your specified centroid values. Did you try to use one of the real datasets that are included in the ClusterR package?

Nikola4213 commented 3 months ago

I first noticed the behavior using real data. I was experimenting with using Kmeans_arma first to get a centroids matrix, and then plugging it in to Kmeans_rcpp. That also produces an error, because although Kmeans_arma returns a matrix of centroids, it is for some reason not compatible with the requirements of the CENTROIDS parameter in Kmeans_rcpp. The returned matrix has attr(,"class") [1] "k-means clustering", which might be the issue. Regardless, I can manually reconstruct a compatible centroids matrix. The behavior with regards to defaulting to kmeans++ in the presence of a CENTROIDS input is not changed. Here is an example using the soybean dataset:

library(ClusterR)

data_CR <- soybean[,1:35]

centroids_arma <- KMeans_arma(data = data_CR, clusters = 3, n_iter = 100, seed_mode = "random_spread", seed = 1) km_rcpp_with_centroids <- KMeans_rcpp(data = data_CR, clusters = 3, num_init = 1, max_iters = 100, CENTROIDS = centroids_arma, verbose = TRUE) Error in KMeans_rcpp(data = data_CR, clusters = 3, num_init = 1, max_iters = 100, : CENTROIDS should be a matrix with number of rows equal to the number of clusters and number of columns equal to the number of columns of the data centroids_matrix <- matrix(centroids_arma, nrow = 3, ncol = 35) km_rcpp_with_centroids <- KMeans_rcpp(data= data_CR, clusters = 3, num_init = 1, max_iters = 100, CENTROIDS = centroids_matrix, verbose = TRUE)

iteration: 1 --> total WCSS: 7716 --> squared norm: 5.84451 iteration: 2 --> total WCSS: 4131.14 --> squared norm: 1.65376 iteration: 3 --> total WCSS: 3856.45 --> squared norm: 1.02322 iteration: 4 --> total WCSS: 3787.8 --> squared norm: 0.259271 iteration: 5 --> total WCSS: 3782.46 --> squared norm: 0.140202 iteration: 6 --> total WCSS: 3781.19 --> squared norm: 0

===================== end of initialization 1 =====================

km_rcpp_with_centroids <- KMeans_rcpp(data= data_CR, clusters = 3, num_init = 1, max_iters = 100, initializer = "kmeans++", verbose = TRUE)

iteration: 1 --> total WCSS: 7716 --> squared norm: 5.84451 iteration: 2 --> total WCSS: 4131.14 --> squared norm: 1.65376 iteration: 3 --> total WCSS: 3856.45 --> squared norm: 1.02322 iteration: 4 --> total WCSS: 3787.8 --> squared norm: 0.259271 iteration: 5 --> total WCSS: 3782.46 --> squared norm: 0.140202 iteration: 6 --> total WCSS: 3781.19 --> squared norm: 0

===================== end of initialization 1 =====================

mlampros commented 3 months ago

@Nikola4213

Thank you for making me aware of this issue.

This was actually an omission from my side. I use the R_NilValue as default parameter in the .cpp files for CENTROIDS and although I used it correctly in the Kmeans_arma and MiniBatchKmeans, I mistakenly passed as input to the Kmeans_rcpp function the R_NilValue rather than the CENTROIDS, that means whenever we used the CENTROIDS the function picked the NULL value and ran the default value of the initializer.

This commit fixes the issue and the following code snippet serves as verification,


library(ClusterR)

data(soybean)
X = soybean[, -ncol(soybean)]
y = soybean[, ncol(soybean)]

clusters = length(unique(y))
# table(y)

dat = center_scale(X)

# computation of centroids
km = KMeans_rcpp(dat, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', verbose = TRUE, seed = 1)
str(km)

# the output centroids
centroids = km$centroids
str(centroids)

# we use the computed centroids as input to reproduce the output
km_centr = KMeans_rcpp(dat, clusters = clusters, num_init = 5, max_iters = 100, CENTROIDS = centroids, verbose = TRUE)
str(km_centr)

# we receive identical clusters
identical(x = km$clusters, y = km_centr$clusters)
# TRUE

# Dimensions of the centroids
dim(km_centr$centroids)

# Dimensions of the data
dim(dat)

# The centroids have in the first dimension the number of clusters and in the second the number of columns of the data
# for instance we can sample the rows of the data to create sample centroids
set.seed(seed = 1)
samp_rows = sample(x = 1:nrow(dat), size = clusters, replace = FALSE)
# samp_rows

# the sample (rows) centroids
sample_CENTROIDS = dat[samp_rows, , drop = FALSE]
dim(sample_CENTROIDS)

# we make sure that we don't receive the same results as previously
km_centr_sample = KMeans_rcpp(dat, clusters = clusters, CENTROIDS = sample_CENTROIDS, verbose = TRUE)
str(km_centr_sample)

# we receive different clusters compared to the initial run with initializer = 'kmeans++'
identical(x = km$clusters, y = km_centr_sample$clusters)
# FALSE

You can install the latest version using


remotes::install_github('mlampros/ClusterR', upgrade = 'always', dependencies = TRUE, repos = 'https://cloud.r-project.org/')

I'll submit the updated ClusterR package tomorrow morning to CRAN.

Feel free to close the issue if the code now works as expected.

Nikola4213 commented 3 months ago

Thank you for the fix and for your time @mlampros.