mlampros / ClusterR

Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering
https://mlampros.github.io/ClusterR/
84 stars 29 forks source link

Add predict methods #24

Closed vspinu closed 2 years ago

vspinu commented 3 years ago

I love this package because it's fills the bill for all clustering methods in one place. But the interface and UI is rather non-Rish.

As a starter I am adding predict methods for cluster objects. For this I

If these changes are ok with you I will add tests, update vignette etc. Also, time permitting, will add print, plot and summary methods to make the UI of the package more user friendly for the seasoned R users.

With this change one can do:

     gmm <- GMM(dat, 2, "maha_dist", "random_subset", 10, 10)
     predict(gmm, dat)

instead of the current

     gmm <- GMM(dat, 2, "maha_dist", "random_subset", 10, 10)
     predict_GMM(dat, gmm$centroids, gmm$covariance_matrices, gmm$weights)
mlampros commented 3 years ago

@vspinu thanks for the contribution. give me some time to review the changes

mlampros commented 3 years ago

@vspinu thanks again for the additions and the corrections (regarding the classes). I'll be glad if you add also tests and update the vignette. It will be also nice to have the print, plot and summary methods. I'd rather prefer to keep the old predict functions than adding deprecation messages. I think keeping the old functions is also necessary for the users that already use these functions in their workflows or packages. Once this PR is completed please mention your necessary personal information (vorname, lastname, e-mail) so that I can add you as a contributor to the package DESCRIPTION file. thanks again.

vspinu commented 3 years ago

I'll be glad if you add also tests and update the vignette. It will be also nice to have the print, plot and summary methods.

Ok. I will spend some time on this.

I'd rather prefer to keep the old predict functions than adding deprecation messages.

Of course. I didn't mean to deprecate those.

github-actions[bot] commented 3 years ago

This is Robo-lampros because the Human-lampros is lazy. This PR has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs.

mlampros commented 3 years ago

I've added a message so that the PR is not closed from the github-actions-bot till the changes are applied (tests, etc.)

vspinu commented 3 years ago

Sorry for stalling this. Things started piling on my weekends suddenly, will try to resume this week.

github-actions[bot] commented 3 years ago

This is Robo-lampros because the Human-lampros is lazy. This PR has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs.

vspinu commented 3 years ago

I have re-added the comment, added tests and adjusted the vignette to use predict() method. Also added print() methods. The objects are now printed succinctly as follows:

> gmm
GMM Cluster
 Call: GMM(data = dat, gaussian_comps = 3) 
 Data cols: 42 
 Centroids: 3 
> km
KMeans Cluster
 Call: KMeans_rcpp(data = dat, clusters = clust, num_init = 5, max_iters = 100,      initializer = "optimal_init") 
 Data cols: 42 
 Centroids: 2 
 BSS/SS: 0.2800772 
 SS: 1312201 = 944683.6 (WSS) + 367517.6 (BSS)
> cm
Medoids Cluster
 Call: Clara_Medoids(data = dat, clusters = 2, samples = 5, sample_size = 0.2,      swap_phase = TRUE, fuzzy = T) 
 Data cols: 42 
 Medoids: 2 
 Cluster stats:
                                  2          3
   index                        278         34
   number_obs                   226        174
   max_dissimilarity      91.879999 114.495518
   average_dissimilarity  36.482054  67.212241
   isolation               1.396992   1.740851

You can modify those as you see fit.

One problem is Kmeans_arma which returns only centroids instead of the complete structure as returned by KMeans_rcpp. So the returned object is not classed and I have added a FIXME comment there.

Regarding the UI, for consistency and simplicity, I would propose to actually have only 3 top level functions: KMeans, Medoids and GMM instead of the current (KMeans_rcpp, Kmeans_arma, Cluster_Medoids, Clara_Medoids and GMM). KMeans would have a method argument "rcpp" or "arma", Medoids() would have method argument "pam" or "clara". WDYT?

mlampros commented 3 years ago

First of all thanks for all your implementations. Let me see the changes in the next days.

Regarding the UI, I think it's a good idea but we have to add deprecation messages for the previous functions so that the users can get accustomed to the new functions.

github-actions[bot] commented 2 years ago

This is Robo-lampros because the Human-lampros is lazy. This PR has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs.

mlampros commented 2 years ago

I've added a message so that this PR is not closed.

mlampros commented 2 years ago

@vspinu,

can you also update the NEWS.md file by adding your changes under version 1.2.6?

vspinu commented 2 years ago

Done! I have also bumped the dev version.

github-actions[bot] commented 2 years ago

This is Robo-lampros because the Human-lampros is lazy. This PR has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs.

mlampros commented 2 years ago

thank you for all the changes, I'll need a few days to do the following:

mlampros commented 2 years ago

@vspinu, specific tests fail. are the failures related to your changes? It seems that the functions of the failed tests require a matrix.

vspinu commented 2 years ago

Yes. It has to do with the fact that interally C++ is using positional references instead of names. So each time one modifies the fields of the result internal code would break.

I have fixed that by addressing with names. I had to rename a few internal slots for that. This also makes the internal names consistent with the user level output. I don't think you will object that, but please have a look at the last commit.

mlampros commented 2 years ago

thank you @vspinu I'll have a look today in the afternoon

mlampros commented 2 years ago

thank you very much @vspinu for all the changes. Give me a few days to update the version on CRAN. I'll add also your details in the DESCRIPTION file, is it ok by you?

person("Vitalie", "Spinu", role = "ctb")

Let me know. thanks.

vspinu commented 2 years ago

I'll add also your details in the DESCRIPTION file, is it ok by you?

Sure. Thanks! I still think my contribution is a bit too slim for a contributor tag, but ok. I will try to get some more stuff in as I go with using the package.

mlampros commented 2 years ago

@vspinu,

I intended today to submit the new version of the ClusterR package to CRAN and I ran an R CMD build after I cloned the Github repository. However, on Linux I receive the following error:


> system("R CMD build ClusterR")
* checking for file ‘ClusterR/DESCRIPTION’ ... OK
* preparing ‘ClusterR’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building ‘the_clusterR_package.Rmd’ using rmarkdown
Quitting from lines 189-198 (the_clusterR_package.Rmd) 
Error: processing vignette 'the_clusterR_package.Rmd' failed with diagnostics:
no applicable method for 'predict' applied to an object of class "c('KMeansCluster', 'k-means clustering')"
--- failed re-building ‘the_clusterR_package.Rmd’

SUMMARY: processing the following file failed:
  ‘the_clusterR_package.Rmd’

Error: Vignette re-building failed.
Execution halted

Can you reproduce this error locally with the cloned repository? It seems that it's related to this pull request.

mlampros commented 2 years ago

I explained in another PR on how I build and check an R package that includes Rcpp code. Therefore, if you want to reproduce the error can you follow these steps?

vspinu commented 2 years ago

I am pretty sure everything is ok and it's some strange local issue. Those methods are registered in the NAMESPACE file. Make sure you don't introduce any diffs to NAMESPACE with your process.

R CMD build ClusterR and checks do work ok for me. It also works on travis, right?

I have also triggered a winbuild (don't be surprised to see the email with a result).

mlampros commented 2 years ago

thanks @vspinu, the winbuild looks ok (see the attached). the github action of this PR worked too. I'm wondering why I get this error. I'll have a look again and notify you.

00check_winbuilder.log

mlampros commented 2 years ago

you were right about the differences in the NAMESPACE file but I do not understand why do I get these diffs? I'm not experienced with S3 Methods. What I do is I make a clean build, check and install after I remove the .Rd files of the 'man' folder and create the documentation using devtools::document(). Is there something wrong with this procedure? For instance before the clean build, check I have

S3method(predict,GMMCluster)
S3method(predict,KMeansCluster)
S3method(predict,MedoidsCluster)
S3method(print,GMMCluster)
S3method(print,KMeansCluster)
S3method(print,MedoidsCluster)

in the NAMESPACE and after I use devtools::document() I have

S3method(print,GMMCluster)
S3method(print,KMeansCluster)
S3method(print,MedoidsCluster)

do we have to use Collate in the DESCRIPTION file now that the ClusterR package makes use of S3 methods?

vspinu commented 2 years ago

I also just use devtools::document. Can it be an old version of devtools or roxygen2?

mlampros commented 2 years ago

I use the latest versions of 'devtools' (2.4.2) and 'roxygen2' (7.1.2) on Linux with R version 4.1.1 (2021-08-10)

Which Operating System do you use? Is it possible that you can run the `devtools::document()' function on a Linux machine? It seems to me that it has to do with how we document the S3 methods (a similar SO issue)

When I run devtools::document() I receive,


S3method(print,GMMCluster)
S3method(print,KMeansCluster)
S3method(print,MedoidsCluster)
export(print.GMMCluster)
export(predict.GMMCluster)
export(predict.KMeansCluster)
export(predict.MedoidsCluster)

whereas currently in this repo we have,


S3method(predict,GMMCluster)
S3method(predict,KMeansCluster)
S3method(predict,MedoidsCluster)
S3method(print,GMMCluster)
S3method(print,KMeansCluster)
S3method(print,MedoidsCluster)

The Run Rscript -e 'tic::script()' that the Github Action of this repo uses just only 'builds' and 'checks' the package and does not create the documentation or updates the NAMESPACE.

I know that 'devtools::document' is not a requirement but if I'm not able to re-create the .Rd files then I might have issues in the future in case that I want to update any function of the package

vspinu commented 2 years ago

When I run devtools::document() I receive,

But that's exactly what is needed. It fixes your original issue, right?

The export(print.GMMCluster) is optional and is probably there because you have added #' @export print.GMMCluster. I don't think it's necessary to export those functions, but it's surely up to you.

I am on linux.

mlampros commented 2 years ago

But that's exactly what is needed. It fixes your original issue, right?

I'm still receiving the error,

> system("R CMD build ClusterR")
* checking for file ‘ClusterR/DESCRIPTION’ ... OK
* preparing ‘ClusterR’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building ‘the_clusterR_package.Rmd’ using rmarkdown
Quitting from lines 189-198 (the_clusterR_package.Rmd) 
Error: processing vignette 'the_clusterR_package.Rmd' failed with diagnostics:
no applicable method for 'predict' applied to an object of class "c('KMeansCluster', 'k-means clustering')"
--- failed re-building ‘the_clusterR_package.Rmd’

SUMMARY: processing the following file failed:
  ‘the_clusterR_package.Rmd’

Error: Vignette re-building failed.
Execution halted

and my NAMESPACE file includes the following,

# Generated by roxygen2: do not edit by hand

S3method(print,GMMCluster)
S3method(print,KMeansCluster)
S3method(print,MedoidsCluster)
export(AP_affinity_propagation)
export(AP_preferenceRange)
export(Clara_Medoids)
export(Cluster_Medoids)
export(GMM)
export(KMeans_arma)
export(KMeans_rcpp)
export(MiniBatchKmeans)
export(Optimal_Clusters_GMM)
export(Optimal_Clusters_KMeans)
export(Optimal_Clusters_Medoids)
export(Silhouette_Dissimilarity_Plot)
export(center_scale)
export(distance_matrix)
export(external_validation)
export(plot_2d)
export(predict.GMMCluster)
export(predict.KMeansCluster)
export(predict.MedoidsCluster)
export(predict_GMM)
export(predict_KMeans)
export(predict_MBatchKMeans)
export(predict_Medoids)
import(gtools)
importFrom(Rcpp,evalCpp)
importFrom(ggplot2,aes)
importFrom(ggplot2,element_blank)
importFrom(ggplot2,geom_point)
importFrom(ggplot2,ggplot)
importFrom(ggplot2,scale_shape_manual)
importFrom(ggplot2,scale_size_manual)
importFrom(ggplot2,theme)
importFrom(gmp,as.bigz)
importFrom(gmp,asNumeric)
importFrom(gmp,chooseZ)
importFrom(grDevices,dev.cur)
importFrom(grDevices,dev.off)
importFrom(graphics,abline)
importFrom(graphics,axis)
importFrom(graphics,barplot)
importFrom(graphics,legend)
importFrom(graphics,lines)
importFrom(graphics,mtext)
importFrom(graphics,par)
importFrom(graphics,plot)
importFrom(graphics,text)
importFrom(graphics,title)
importFrom(stats,na.omit)
importFrom(utils,globalVariables)
importFrom(utils,setTxtProgressBar)
importFrom(utils,txtProgressBar)
useDynLib(ClusterR, .registration = TRUE)

I don't know why and you can't reproduce it although on linux

vspinu commented 2 years ago

That's a bit crazy. CMD build is not touching the namespace. So what is changing your namespace then? Didn't you say that devtools::document generates the NAMESPACE with predict exports in this comment?

I have a hunch. Could you add predict to the list of imported functions from stats here.

Do you have stats package in your search path when you run devtools::document? For me stats is always loaded by default.

mlampros commented 2 years ago

devtools::document generates the NAMESPACE with predict exports in this comment?

In this comment I modified you function to make the R CMD build work without success. Don't take this comment into account.

I have a hunch. Could you add predict to the list of imported functions from stats here.

I tried this as well but I receive the same error.

For me stats is always loaded by default.

Althouth the stats package is a System Library package I always deal with its functions as if it were a User Library package. This because I receive NOTEs when I perform R CMD check --as-cran Therefore, if functions of the stats package exist in my code then I always include it in the imports of the DESCRIPTION file.

Although I had the chance to go through the S3 and S4 classes in the past, I preferred in general the R6 classes due to its simplicity and familiarity with Python classes. One of the reasons that I'm now not in position to recognize the source of the problem that arises from the predict error during the R CMD build of the vignettes.

mlampros commented 2 years ago

I submitted the updated version to CRAN (1.2.6). This was actually a problem with the R version I used. After updating to 4.1.1 the issue was resolved.

vspinu commented 2 years ago

Perfectos! Happy to hear it solved by itself :)