mlr-org / mlr3cluster

Cluster analysis for mlr3
https://mlr3cluster.mlr-org.com
GNU Lesser General Public License v3.0
21 stars 6 forks source link

Suggest adding two clustering algorithms: consistency clustering and non-negative matrix factorization. #39

Open zhangkaicr opened 1 year ago

zhangkaicr commented 1 year ago

I am a bioinformatics PhD and I really appreciate your mlr3cluster package. This package provides many unsupervised clustering algorithms. However, I regret to find that the two most commonly used algorithms in bioinformatics analysis, consistency clustering and non-negative matrix factorization, are not included in this package. These two algorithms are widely used in the medical and biological fields. If these two algorithms are added to the package, the application scope will be greatly increased. I also hope to cite this package in my upcoming doctoral thesis. Thank you very much.

zhangkaicr commented 1 year ago

In R language, we can use the ConsensusClusterPlus package for consistency clustering. For non-negative matrix factorization, we can use the NMF package.

damirpolat commented 1 year ago

Thank you for opening the issue and suggesting additional features! In regards to non-negative matrix factorization, it is already implemented in mlr3pipelines as a pipeop. Details are here.
If you want to use it to get cluster assignments as a PredictionClust object, you can try the following:

library(NMF)
library(mlr3)
library(mlr3cluster)
library(mlr3pipelines)
task = tsk("ruspini")
nmf = po("nmf")
nmf$train(list(task))
p = PredictionClust$new(task = task, partition = as.integer(predict(nmf$state)))
damirpolat commented 1 year ago

In regards to ConsensusClusterPlus, there are a couple of unusual things happening here:

  1. The input format is one where rows are features and columns are observations. This is the opposite of the rest of mlr3.
  2. The output is a list of several attempts to cluster with different numbers of clusters. You can summarize it to get consensus for each item but it's for all attempted numbers of clusters. So I’m not sure which attempt (or all of them?) should be shown to users in PredictionClust object.

Do you have any experience with this package? If we can’t address these questions, can you recommend any other packages that implement the same functionality?