Closed hsbadr closed 1 year ago
I think a single predict() function that includes the four functions that you mention is feasible as each kmeans function of the ClusterR package returns a class (and if not then a class can be added).
A 'fuzzy' parameter for the 'predict_KMeans()' requires the adjustment of the corresponding Rcpp function
A 'threads' parameter does not exist in all functions. For instance (currently) the 'KMeans_rcpp()' function is not parallelized and the 'KMeans_arma()' can be internally optimized (parallelized) if the user of the function has configured the operating system for OpenMP (this function is based on RcppArmadillo code)
Is the batch-size that you mention related to MiniBatchKmeans() or to all ClusterR kmeans functions?
Pull requests for the mentioned features are welcome.
- I think a single predict() function that includes the four functions that you mention is feasible as each kmeans function of the ClusterR package returns a class (and if not then a class can be added).
- A 'fuzzy' parameter for the 'predict_KMeans()' requires the adjustment of the corresponding Rcpp function
Does predict_MBatchKMeans()
support the objects generated from KMeans_rcpp()
and MiniBatchKmeans()
? If so, why wouldn't it supersede predict_KMeans()
and become the basis for the single predict()
function?
- Is the batch-size that you mention related to MiniBatchKmeans() or to all ClusterR kmeans functions?
I mean if we start from a wrapper for all k-Means functions, it can use batch_size
argument to call the appropriate function.
Does predict_MBatchKMeans() support the objects generated from KMeans_rcpp() and MiniBatchKmeans()? If so, why wouldn't it supersede predict_KMeans() and become the basis for the single predict() function?
Recently a contributor has added new functionality to the ClusterR which is related to the predict function (you can see here, here for instance and more in the NEWS.md file). Is this what you meant in your comment?
I mean if we start from a wrapper for all k-Means functions, it can use batch_size argument to call the appropriate function.
It seems that the last changes do not include the minibatchkmeans. The predict() function was included for kmeans, gmm and medoids only
Recently a contributor has added new functionality to the ClusterR which is related to the predict function (you can see here, here for instance and more in the NEWS.md file). Is this what you meant in your comment?
# Support `fuzzy` for probability predictions
predict.KMeansCluster <- function(object, newdata, fuzzy = FALSE, threads = 1, ...) {
if (fuzzy) {
predict_MBatchKMeans(newdata, CENTROIDS = object$centroids, fuzzy = fuzzy)
} else {
predict_KMeans(newdata, CENTROIDS = object$centroids, threads = threads)
}
}
I mean if we start from a wrapper for all k-Means functions, it can use batch_size argument to call the appropriate function.
# k-Means wrapper
KMeans <- function(data, clusters,
batch_size = 1e+07,
num_init = 1,
max_iters = 100,
early_stop_iter = 10,
init_fraction = 1.0,
initializer = 'kmeans++',
tol = 1e-4,
tol_optimal_init = 0.3,
seed = 1,
threads = 1,
CENTROIDS = NULL,
fuzzy = FALSE,
verbose = FALSE, ...) {
if (batch_size < nrow(data)) {
MiniBatchKmeans(data, clusters,
batch_size = batch_size,
num_init = num_init,
max_iters = max_iters,
early_stop_iter = early_stop_iter,
init_fraction = init_fraction,
initializer = initializer,
tol = tol,
tol_optimal_init = tol_optimal_init,
seed = seed,
CENTROIDS = CENTROIDS,
verbose = verbose)
} else {
KMeans_rcpp(data, clusters,
num_init = num_init,
max_iters = max_iters,
initializer = initializer,
tol = tol,
tol_optimal_init = tol_optimal_init,
seed = seed,
CENTROIDS = CENTROIDS,
fuzzy = fuzzy,
verbose = verbose)
}
}
the 'fuzzy' parameter exists currently only in the 'KMeans_rcpp()' function (it was an experimental feature that I added when I created the function back in 2017) and the 'batch_size' parameter is currently used in the 'MiniBatchKmeans()' function (this function was ported from the initial C code in RcppArmadillo).
PR's are welcome for the additional features and functions that you mention.
PR's are welcome for the additional features and functions that you mention.
Actually, there's no need to add fuzzy
feature in the clustering functions; it increases the size of the object for big data. But, it's helpful in the prediction function. It seems that predict_MBatchKMeans()
works fine for all KMeansCluster
objects. So, changing the following lines would work:
https://github.com/mlampros/ClusterR/blob/2ef5eb4f4a8eb7cf469098dbf3878a7f552f3929/R/clustering_functions.R#L600-L602
as follows:
predict.KMeansCluster <- function(object, newdata, fuzzy = FALSE, threads = 1, ...) {
if (fuzzy) {
predict_MBatchKMeans(newdata, CENTROIDS = object$centroids, fuzzy = fuzzy)
} else {
predict_KMeans(newdata, CENTROIDS = object$centroids, threads = threads)
}
}
If you agree, I'll create a PR.
from what I see the current '' function includes the 'threads' parameter,
predict.KMeansCluster <- function(object, newdata, threads = 1, ...)
Adding the 'fuzzy' parameter as you suggest requires the adjustment of the corresponding Rcpp function . The 'predict.MedoidsCluster' includes a 'fuzzy' parameter, because the function already returns the (fuzzy) probabilities
predict.MedoidsCluster <- function(object, newdata, fuzzy = FALSE, threads = 1, ...)
now I see that in the previous PR the 'predict_MBatchKMeans()' function was not included in the 'predict()' function. I'll have time later today and tomorrow, I'll do the modifications and I'll notify you once I push the changes
@hsbadr I just updated the code, now the following work,
require(ClusterR)
data(dietary_survey_IBS)
dat = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]
dat = center_scale(dat)
# kmeans
km = KMeans_rcpp(dat, clusters = 4, num_init = 5, max_iters = 100, initializer = 'kmeans++')
str(km)
preds = predict(object = km, newdata = dat, fuzzy = FALSE, threads = 1)
str(preds)
# num [1:400] 3 3 3 1 4 4 3 4 3 4 ...
preds_fuzzy = predict(object = km, newdata = dat, fuzzy = TRUE, threads = 1)
str(preds_fuzzy)
# num [1:400, 1:4] 0.246 0.273 0.263 0.321 0.281 ...
table(preds, apply(preds_fuzzy, 1, which.max) - 1)
# preds 0 1 2 3
# 1 28 0 0 0
# 2 0 200 0 0
# 3 0 0 63 0
# 4 0 0 0 109
# Mini-Batch-Kmeans
mbkm = MiniBatchKmeans(dat, clusters = 4, batch_size = 20, num_init = 5, early_stop_iter = 10)
str(mbkm)
preds_mbkm = predict(object = mbkm, newdata = dat, fuzzy = FALSE)
str(preds_mbkm)
# num [1:400] 3 3 3 3 3 3 3 3 3 3 ...
preds_fuzzy_mbkm = predict(object = mbkm, newdata = dat, fuzzy = TRUE)
str(preds_fuzzy_mbkm)
# num [1:400, 1:4] 0.234 0.152 0.232 0.198 0.217 ...
table(preds_mbkm, apply(preds_fuzzy_mbkm, 1, which.max) - 1)
# preds_mbkm 0 1 2 3
# 1 8 0 0 0
# 2 0 193 0 0
# 3 0 0 197 0
# 4 0 0 0 2
In the next couple of days I'll add a few test-cases. You can install the updated version from Github using,
remotes::install_github('mlampros/ClusterR', upgrade = 'always', dependencies = TRUE, repos = 'https://cloud.r-project.org/')
I just updated the code, now the following work,
Looks good. Thanks @mlampros!
The only thing is that you've changed the behavior when fuzzy = TRUE
. Originally, it was returning a structure with the list of both clusters and probabilities; something like
return(
structure(
list(
clusters = as.vector(res$clusters + 1),
fuzzy_clusters = res$fuzzy_probs
),
class = "k-means clustering"
)
)
Now, it only returns clusters
or fuzzy_probs
/fuzzy_clusters
:
https://github.com/mlampros/ClusterR/blob/f1d461f6229c91331c9a6fcdd7f8f29a8d2713ea/R/clustering_functions.R#L601-L606
In short, predict_KMeans()
and predict_MBatchKMeans()
have different return values when fuzzy = TRUE
.
In short, predict_KMeans() and predict_MBatchKMeans() have different return values when fuzzy = TRUE
I didn't change the output object of both predict_KMeans()
and predict_MBatchKMeans()
and that because it will give test-errors. It's true that these functions do not return the same object. In any case, now you can just use directly the 'predict()' function which I think serves this purpose.
I could match the output objects that these two functions return but this will be a breaking change and requires a deprecation warning for a specific number of versions. I'll do that in the next days and I'll also include the test cases.
I updated the ClusterR package by adding tests for the unified predict function (predict_KMeans, predict_MBatchKMeans). I also added a deprecation warning in the "predict_MBatchKMeans", the following code snippet shows the output format that will become the default starting from version 1.4.0,
require(ClusterR)
data(dietary_survey_IBS)
dat = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]
dat = center_scale(dat)
# Mini-Batch-Kmeans
mbkm = MiniBatchKmeans(dat, clusters = 4, batch_size = 20, num_init = 5, early_stop_iter = 10)
str(mbkm)
# current output format (which shows a deprecation warning)
pred_mbkm = predict_MBatchKMeans(data = dat, CENTROIDS = mbkm$centroids, fuzzy = TRUE, updated_output = FALSE)
# Warning message:
# `predict_MBatchKMeans()` was deprecated in ClusterR 1.3.0.
# ℹ Beginning from version 1.4.0, if the fuzzy parameter is TRUE the function 'predict_MBatchKMeans' will return only the probabilities, whereas currently it also returns the hard clusters
# This warning is displayed once every 8 hours.
# Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
str(pred_mbkm)
# List of 2
# $ clusters : num [1:400] 3 3 3 3 3 3 3 3 3 3 ...
# $ fuzzy_clusters: num [1:400, 1:4] 0.234 0.152 0.232 0.198 0.217 ...
# - attr(*, "class")= chr "k-means clustering"
# new output format (beginning from version 1.4.0, the 'updated_output' parameter will be removed and this output format will become the default)
pred_mbkm = predict_MBatchKMeans(data = dat, CENTROIDS = mbkm$centroids, fuzzy = TRUE, updated_output = TRUE)
str(pred_mbkm)
# num [1:400, 1:4] 0.234 0.152 0.232 0.198 0.217 ...
I 'll go ahead and submit the new version to CRAN. I'll close the issue for now, feel free to re-open in case the code does not work as expected.
It would be nice to have a unified interface for k-Means functions (
predict_KMeans()
andpredict_MBatchKMeans()
as well asKMeans_rcpp()
andMiniBatchKmeans()
):fuzzy
inpredict_KMeans()
for probability predictionsthreads
to specify the number of threads for parallel processing in all functions.batch_size
or object class.if (batch_size < nrow(data))
,if (is.null(batch_size))
, orif (missing(batch_size))
.