mlampros / ClusterR

Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering
https://mlampros.github.io/ClusterR/
84 stars 29 forks source link

Minimum number of clusters argument #15

Closed GarryGelade closed 5 years ago

GarryGelade commented 5 years ago

I think it might be useful to have a minclusters argument for the OptimalClusters type functions.

Instead of processing all numbers of clusters between 1 and maxclusters, the function would just examine the numbers between minclusters and maxclusters.

This would enable users to explore selected segments of the clustering space without having to process large numbers of cluster solutions. This would be very useful for large clustering problems which can be quite time-consuming.

Another useful option would be the ability to specify a vector of cluster numbers such as (1,10,20,30,40,50) which would allow quick exploration of the cluster space .

Thanks Garry

mlampros commented 5 years ago

hello @GarryGelade and I'm sorry for the late reply,

you are right, this would be actually a nice feature. I took a look to the relevant code snippets and I have to modify

The options will be

If you are not in a hurry it might take a couple of days as I currently work on other stuff too. In any case I'll notify you once I upload the updated version on Github. thanks.

GarryGelade commented 5 years ago

Dear Lampros

Great! Thanks so much.

Regards

Garry

From: Lampros Mouselimis notifications@github.com Sent: 24 March 2019 12:38 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

hello @GarryGelade https://github.com/GarryGelade and I'm sorry for the late reply,

you are right, this would be actually a nice feature. I took a look to the relevant code snippets and I have to modify

The options will be

If you are not in a hurry it might take a couple of days as I currently work on other stuff too. In any case I'll notify you once I upload the updated version on Github. thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlampros/ClusterR/issues/15#issuecomment-475954824 , or mute the thread https://github.com/notifications/unsubscribe-auth/AIVmxPt1NrvJbSHAjFlGvl4vakl-WKyHks5vZ3GKgaJpZM4cE-ZO .

mlampros commented 5 years ago

@GarryGelade,

I attempted to modify the 'Optimal_Clusters_KMeans' function yesterday. It is possible, however the plotting of non-contiguous sequences might break things so I'll have to re-implement it from scratch and I currently do not have the time. Is plotting of sequences (except for single values) a requirement for you or have you thought it solely of a vector consisting of the results based on the evaluation metric?

GarryGelade commented 5 years ago

Dear Lampros

I am not so interested in the plot, as I can reproduce that myself. A vector of evaluation metric scores would be fine.

Regards

Garry

From: Lampros Mouselimis notifications@github.com Sent: 26 March 2019 11:33 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

@GarryGelade https://github.com/GarryGelade ,

I attempted to modify the 'Optimal_Clusters_KMeans' function yesterday. It is possible, however the plotting of non-contiguous sequences might break things so I'll have to re-implement it from scratch and I currently do not have the time. Is plotting of sequences (except for single values) a requirement for you or have you thought solely of a vector consisting of the results based on the evaluation metric?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlampros/ClusterR/issues/15#issuecomment-476586614 , or mute the thread https://github.com/notifications/unsubscribe-auth/AIVmxA5y2FOOwK7CjOHI6vu_YqIgjvWsks5vagVYgaJpZM4cE-ZO .

mlampros commented 5 years ago

@GarryGelade,

I've completed the first function 'Optimal_Clusters_KMeans' ( applies to both KMeans_rcpp and MiniBatchKmeans). Please read the NEWS.md file about the limitations. If this is one of the functions that you intended to use please give it a try and let me know. I've added also test cases for the applicable 'criteria'. You can download the updated version (1.1.9) using


devtools::install_github('mlampros/ClusterR')

I'll continue tomorrow with the other two functions ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids' )

GarryGelade commented 5 years ago

Dear Lampros

Unfortunately I got an Installation error

devtools::install_github('mlampros/ClusterR')

Downloading GitHub repo mlampros/ClusterR@master

from URL https://api.github.com/repos/mlampros/ClusterR/zipball/master

Installing ClusterR

Installing 1 package: ggplot2

Installing package into ‘C:/rPackages’

(as ‘lib’ is unspecified)

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ggplot2_3.1.0.zip'

Content type 'application/zip' length 3623184 bytes (3.5 MB)

downloaded 3.5 MB

package ‘ggplot2’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in

    C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages

Installing 1 package: gmp

Installing package into ‘C:/rPackages’

(as ‘lib’ is unspecified)

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/gmp_0.5-13.5.zip'

Content type 'application/zip' length 1109717 bytes (1.1 MB)

downloaded 1.1 MB

package ‘gmp’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in

    C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages

Installing 1 package: gtools

Installing package into ‘C:/rPackages’

(as ‘lib’ is unspecified)

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/gtools_3.8.1.zip'

Content type 'application/zip' length 325812 bytes (318 KB)

downloaded 318 KB

package ‘gtools’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in

    C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages

Installing 1 package: Rcpp

Installing package into ‘C:/rPackages’

(as ‘lib’ is unspecified)

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/Rcpp_1.0.1.zip'

Content type 'application/zip' length 4509616 bytes (4.3 MB)

downloaded 4.3 MB

package ‘Rcpp’ successfully unpacked and MD5 sums checked

Warning: cannot remove prior installation of package ‘Rcpp’

The downloaded binary packages are in

    C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages

Installing 1 package: RcppArmadillo

Installing package into ‘C:/rPackages’

(as ‘lib’ is unspecified)

also installing the dependency ‘Rcpp’

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/Rcpp_1.0.1.zip'

Content type 'application/zip' length 4509616 bytes (4.3 MB)

downloaded 4.3 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/RcppArmadillo_0.9.300.2.0.zip'

Content type 'application/zip' length 2252589 bytes (2.1 MB)

downloaded 2.1 MB

package ‘Rcpp’ successfully unpacked and MD5 sums checked

Warning: cannot remove prior installation of package ‘Rcpp’

package ‘RcppArmadillo’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in

    C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages

"C:/PROGRA~1/R/R-35~1.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL \

"C:/Users/garry/AppData/Local/Temp/RtmpaikF3W/devtools1e50491d5d36/mlampros-ClusterR-59c0cab" --library="C:/rPackages" --install-tests

ERROR: dependency 'Rcpp' is not available for package 'ClusterR'

In R CMD INSTALL

Installation failed: Command failed (1)

Any thoughts?

Garry

From: Lampros Mouselimis notifications@github.com Sent: 26 March 2019 21:39 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

@GarryGelade https://github.com/GarryGelade ,

I've completed the first function 'Optimal_Clusters_KMeans' ( applies to both KMeans_rcpp and MiniBatchKmeans). Please read the NEWS.md https://github.com/mlampros/ClusterR/blob/master/NEWS.md file about the limitations. If this is one of the functions that you intended to use please give it a try and let me know. I've added also test cases for the applicable 'criteria'. You can download the updated version (1.1.9) using

devtools::install_github('mlampros/ClusterR')

I'll continue tomorrow with the other two functions ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids' )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlampros/ClusterR/issues/15#issuecomment-476863211 , or mute the thread https://github.com/notifications/unsubscribe-auth/AIVmxKqL6HrIpx-wojvqESs97lNczSeBks5vapNhgaJpZM4cE-ZO .

mlampros commented 5 years ago

@GarryGelade,

can you try with 'dependencies = FALSE'. The problem appears during removal of the old version of 'Rcpp'. So in case that you have 'Rcpp' installed and its version is >= 0.12.5 , use


devtools::install_github('mlampros/ClusterR', dependencies = FALSE)
GarryGelade commented 5 years ago

Thanks. It was actually a session problem. When I restarted R clean the installation worked.

From: Lampros Mouselimis notifications@github.com Sent: 27 March 2019 11:17 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

@GarryGelade https://github.com/GarryGelade ,

can you try with 'dependencies = FALSE'. The problem appears during removal of the old version of 'Rcpp'. So in case that you have 'Rcpp' installed and its version is >= 0.12.5 , use

devtools::install_github('mlampros/ClusterR', dependencies = FALSE)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlampros/ClusterR/issues/15#issuecomment-477103460 , or mute the thread https://github.com/notifications/unsubscribe-auth/AIVmxJsl3LHdv5_fuEr5_y2X7JwoczQUks5va1MYgaJpZM4cE-ZO .

GarryGelade commented 5 years ago

Dear Lampros

Looks like it works!

maxclus <- c(1, 10, 20, 30, 40, 50, 60, 70)

optK <- Optimal_Clusters_KMeans(train.data.std, maxclus, criterion = "BIC",

                               fK_threshold = 0.85, num_init = 1, max_iters = 100,

                               initializer = "kmeans++", tol = 1e-04, plot_clusters = TRUE,

                               verbose = TRUE, tol_optimal_init = 0.3, seed = 1)

bic <- cbind(optK, maxclus) %>% as.data.frame()

names(bic) <- c("BIC", "nclus")

ggplot(bic, aes(y=BIC, x = nclus)) + geom_line() + geom_point()

theme_bw() + geom_vline(xintercept = 16, linetype="dotted")

Regards, Garry

From: Lampros Mouselimis notifications@github.com Sent: 27 March 2019 11:17 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

@GarryGelade https://github.com/GarryGelade ,

can you try with 'dependencies = FALSE'. The problem appears during removal of the old version of 'Rcpp'. So in case that you have 'Rcpp' installed and its version is >= 0.12.5 , use

devtools::install_github('mlampros/ClusterR', dependencies = FALSE)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlampros/ClusterR/issues/15#issuecomment-477103460 , or mute the thread https://github.com/notifications/unsubscribe-auth/AIVmxJsl3LHdv5_fuEr5_y2X7JwoczQUks5va1MYgaJpZM4cE-ZO .

mlampros commented 5 years ago

@GarryGelade,

I uploaded the updated versions of the other two functions too ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids'). I'll keep this issue open for a few days before I upload the updated version to CRAN, so let me know in case that they do not work as expected.

GarryGelade commented 5 years ago

GMM seems to be working OK, but I get an error with mediods

maxclus <- 2

opt <- Optimal_Clusters_Medoids(train.data.std, maxclus, distance_metric = "euclidean",

                            criterion = "dissimilarity", clara_samples = 0,

                            clara_sample_size = 0, minkowski_p = 1, swap_phase = TRUE,

                            threads = 1, verbose = FALSE, plot_clusters = FALSE, seed = 1)

Error in OptClust(data, pass_vector, distance_metric, FALSE, clara_samples, :

std::bad_alloc

From: Lampros Mouselimis notifications@github.com Sent: 27 March 2019 20:14 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

@GarryGelade https://github.com/GarryGelade ,

I uploaded the updated versions of the other two functions too ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids'). I'll keep this issue open for a few days before I upload the updated version to CRAN, so let me know in case that they do not work as expected.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlampros/ClusterR/issues/15#issuecomment-477329816 , or mute the thread https://github.com/notifications/unsubscribe-auth/AIVmxC87L6TADCPB0mWe0M3s7imKuEUfks5va9EkgaJpZM4cE-ZO .

mlampros commented 5 years ago

@GarryGelade thanks for making me aware of this error,

I tried to reproduce this error with the 3 data sets included in the ClusterR package

but I can't reproduce your error with 'max_clusters = 2', which means this error might have to do with your data set. Would you mind sharing a reproducible example using your 'train.data.std' (if possible) so that I can fix the error and add a test case for this purpose. thanks.

GarryGelade commented 5 years ago

Dear Lampros

I tried to send you my data, but it is 4Mb, and it seems to have been rejected by the server. I will try to Zip it.

This is the mail system at host outmx-028.london.gridhost.co.uk.

I'm sorry to have to inform you that your message could not be delivered to one or more recipients. It's attached below.

For further assistance, please send mail to postmaster.

If you do so, please include this problem report. You can delete your own text from the attached returned message.

               The mail system

reply@reply.github.com:

message size 5701448 exceeds size limit 5120000 of server

in-2.smtp.github.com[192.30.253.171]

From: Lampros Mouselimis notifications@github.com Sent: 28 March 2019 07:16 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

@GarryGelade https://github.com/GarryGelade thanks for making me aware of this error,

I tried to reproduce this error with the 3 data sets included in the ClusterR package

but I can't reproduce your error with 'max_clusters = 2', which means this error might have to do with your data set. Would you mind sharing a reproducible example using your 'train.data.std' (if possible) so that I can fix the error and add a test case for this purpose.

— You are receiving this because you were mentioned. Reply to this email directly, https://github.com/mlampros/ClusterR/issues/15#issuecomment-477477511 view it on GitHub, or https://github.com/notifications/unsubscribe-auth/AIVmxM43wpJ94y-6ksKqxVuo8M8tlBKuks5vbGwxgaJpZM4cE-ZO mute the thread.

mlampros commented 5 years ago

hi @GarryGelade,

if you receive the error also with a subset of you initial data then you can send me the subset.

GarryGelade commented 5 years ago

Dear Lampros

The size of the data makes a difference.

When I use a dataset of 50000 examples, my computer completely freezes

I will you the data in 2 parts.

Train.data.std1.RDS = rows 1:50000

Train.data.std2.RDS = rows 50001:10000

Regards

From: Lampros Mouselimis notifications@github.com Sent: 30 March 2019 06:44 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

hi @GarryGelade https://github.com/GarryGelade ,

if you receive the error also with a subset of you initial data then you can send me the subset.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlampros/ClusterR/issues/15#issuecomment-478213828 , or mute the thread https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO .

GarryGelade commented 5 years ago

First half of data

From: Lampros Mouselimis notifications@github.com Sent: 30 March 2019 06:44 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

hi @GarryGelade https://github.com/GarryGelade ,

if you receive the error also with a subset of you initial data then you can send me the subset.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlampros/ClusterR/issues/15#issuecomment-478213828 , or mute the thread https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO .

GarryGelade commented 5 years ago

Second half of data

From: Lampros Mouselimis notifications@github.com Sent: 30 March 2019 06:44 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

hi @GarryGelade https://github.com/GarryGelade ,

if you receive the error also with a subset of you initial data then you can send me the subset.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlampros/ClusterR/issues/15#issuecomment-478213828 , or mute the thread https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO .

GarryGelade commented 5 years ago

NB if I only use 500 rows, the function performs OK, so the problem is something to do with large datasets.

From: Lampros Mouselimis notifications@github.com Sent: 30 March 2019 06:44 To: mlampros/ClusterR ClusterR@noreply.github.com Cc: GarryGelade garry@business-analytic.co.uk; Mention mention@noreply.github.com Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)

hi @GarryGelade https://github.com/GarryGelade ,

if you receive the error also with a subset of you initial data then you can send me the subset.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlampros/ClusterR/issues/15#issuecomment-478213828 , or mute the thread https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO .

mlampros commented 5 years ago

hi @GarryGelade,

what do you mean by 'First half of data' and 'Second half of data' ? I don't see any observations. Do you attempt to upload the data in a specific account? thanks.

mlampros commented 5 years ago

@GarryGelade just an additional note,

it is highly probable that the


 std::bad_alloc

error that you receive is related with the size of your data and your personal computer RAM. You receive this error in the Optimal_Clusters_Medoids() function, which takes your data and computes a distance matrix. That means if your data consists of 100.000 observations then the Optimal_Clusters_Medoids() function will first attempt to build a distance matrix of size 100.000 x 100.000 observations. There are some threads on the web which can give you a hint on how much memory your data set will occupy (require), such as this one. If this is the case then I would suggest that you use the Clara Medoids function when you compute the optimal clusters, which performs clustering based on samples of the input data set. You can find more information about the clara_samples and clara_sample_size in the package documentation,


Optimal_Clusters_Medoids(data, 
                         max_clusters, 
                         distance_metric,
                         criterion = "dissimilarity", 
                         clara_samples = 0,
                         clara_sample_size = 0, 
                         minkowski_p = 1, 
                         swap_phase = TRUE,
                         threads = 1, 
                         verbose = FALSE, 
                         plot_clusters = TRUE,
                         seed = 1)
stale[bot] commented 5 years ago

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.