snoweye / EMCluster

EM Algorithm for Model-Based Clustering of Finite Mixture Gaussian Distribution
Mozilla Public License 2.0
17 stars 0 forks source link

RStudio session aborts when all samples have a label #7

Closed Martingales closed 4 years ago

Martingales commented 4 years ago

Hi everyone,

I use init.EM for semi-supervised clustering and I found that running init.EM breaks R under the following circumstances:

Background The reason I want to use init.EM is that I suspect I have subclusters in each of my existing clusters. But I also want to allow samples from other clusters to join a subcluster. In essence, I want to use my existing labels as prior information only but not determine the final clustering. That's why I would like to have K larger than the number of existing clusters.

Error should be reproducible with the dev's own example:

library(EMCluster)

x <- da1$da
lab <- da1$class
k <- 12

ret.Rnd <- init.EM(x, nclass = k, lab = lab, method = "Rnd.EM", EMC = .EMC.Rnd)

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Users/drews01/miniconda3/lib/R/lib/libRblas.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] EMCluster_0.2-12 Matrix_1.2-18    MASS_7.3-51.5   

loaded via a namespace (and not attached):
[1] compiler_3.6.1  tools_3.6.1     yaml_2.2.1      grid_3.6.1      lattice_0.20-40
snoweye commented 4 years ago

The feature you want is not implemented. However, an alternative here may be ok for the similar purpose.

Martingales commented 4 years ago

When I understand the code correctly, you re-assign half the samples in each class to be 0 (aka no class). And then init.EM is able to deal with K larger than the current number of clusters?

How are the cluster assignments dealt with internally? Are they a must have assignment kept until the end, are they defining a start clustering or are they acting as a prior all the time?

maitra commented 4 years ago

What you are trying to do is not allowed under the model that mixture models based clustering is built on. In other words, our model assumptions are not general enough. However, the problem is interesting though not completely clear and we will have to think about it. Are you interested in communicating outside git so that we may have a conversation. However, I am not sure about what your suggestions will entail so will also be cautious about how much effort will be required.

Martingales commented 4 years ago

That sounds great! As I can't see your email, can you contact me via my public github profile?

I know that implementing generalised models isn't easy. At this stage it was more important for me to know whether it was a bug or a genuine limitation.

snoweye commented 4 years ago
Martingales commented 4 years ago

Thank you for the clarification!