RStudio session aborts when all samples have a label

Martingales commented 4 years ago

Hi everyone,

I use init.EM for semi-supervised clustering and I found that running init.EM breaks R under the following circumstances:

label for each sample is larger than 0 => so no unaccounted samples
K larger than input cluster

Background The reason I want to use init.EM is that I suspect I have subclusters in each of my existing clusters. But I also want to allow samples from other clusters to join a subcluster. In essence, I want to use my existing labels as prior information only but not determine the final clustering. That's why I would like to have K larger than the number of existing clusters.

Error should be reproducible with the dev's own example:

library(EMCluster)

x <- da1$da
lab <- da1$class
k <- 12

ret.Rnd <- init.EM(x, nclass = k, lab = lab, method = "Rnd.EM", EMC = .EMC.Rnd)

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Users/drews01/miniconda3/lib/R/lib/libRblas.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] EMCluster_0.2-12 Matrix_1.2-18    MASS_7.3-51.5   

loaded via a namespace (and not attached):
[1] compiler_3.6.1  tools_3.6.1     yaml_2.2.1      grid_3.6.1      lattice_0.20-40

snoweye commented 4 years ago

The feature you want is not implemented. However, an alternative here may be ok for the similar purpose.

Martingales commented 4 years ago

When I understand the code correctly, you re-assign half the samples in each class to be 0 (aka no class). And then init.EM is able to deal with K larger than the current number of clusters?

How are the cluster assignments dealt with internally? Are they a must have assignment kept until the end, are they defining a start clustering or are they acting as a prior all the time?

maitra commented 4 years ago

What you are trying to do is not allowed under the model that mixture models based clustering is built on. In other words, our model assumptions are not general enough. However, the problem is interesting though not completely clear and we will have to think about it. Are you interested in communicating outside git so that we may have a conversation. However, I am not sure about what your suggestions will entail so will also be cautious about how much effort will be required.

Martingales commented 4 years ago

That sounds great! As I can't see your email, can you contact me via my public github profile?

I know that implementing generalised models isn't easy. At this stage it was more important for me to know whether it was a bug or a genuine limitation.

snoweye commented 4 years ago

If you intend to have all labels and have larger k or must keep assignments, then neither is implemented.
If you intend to have some labels and have larger k, then it is allowed.
A check has been added to cast errors.

Martingales commented 4 years ago

Thank you for the clarification!

snoweye / EMCluster

RStudio session aborts when all samples have a label #7