Closed johndoe1982 closed 8 years ago
Coming back to one of my orig questions: Would you feel OK (as an alternative to what is discussed above in the papers, that I havent read yet) to do an sfs with a certain perf measure to clustering? If we allow that would that already help?
I think if we could pick the appropriate performance measure, this would definitely help. It would have been my first try anyway. Unfortunately my experience with clustering is very limited. So I'd appreciate any help in picking the measure.
Well you asked for the reason why we currently disallowed normal sfs with a measure for clustering. Because we are really not sure whether this makes sense. I know quite a lot about supervised feature selection, but not so much about in its unsupervised form.
I would really like for somebody to "weigh in" who knows more about this, so we can offer something in mlr which is an accepted approach in this scenario, and not something we come up with in an adhoc fashion....
In principle there's nothing stopping us from optimising e.g. the Dunn Index (and I think this should already be possible for tuning the parameters of the learner?). Since there's no ground truth in clustering and hence you can't really do something completely wrong, I don't have anything against supporting this.
I've had a brief look at the sparcl package and it doesn't seem to implement a way of assigning new data points to clusters. This would need to be implemented for integration with mlr
.
I think, this sounds very reasonable. The Dunn Index should work fine for selecting the features.
For the sparcl package: As far as I know at least the KMeans-algorithm is able to give predictions to assign new data points to clusters. The hierarchical clustering is not meant for this functionality. I am not sure if the sparcl package gives back a "normal" KMeans-object (I know it does for the hierarchical clustering). If that was the case it should be easy to implement the prediction into the package.
another thing I just read about is that you can actually use random.forests in unsupervised mode to do clustering. would it be possible to include this functionality? then the random.forest.importance could be used as a filter as in the supervised learning case...
I am just throwing out ideas here. If you think they are garbage, just tell me.
what exactly is "use random.forests in unsupervised mode" ?
I read about it here
In the second answer the idea is explained. I also read about it in several other posts, but there were just too many of them to keep track of all. However, this seems to be a fairly common strategy, so I would assume that this functionality should be implemented "somewhere" in R already, preferably of course in the random.forest package you are using in mlr.
The unsupervised mode creates a set of synthetic data by a univariate bootstrap of the features (which breaks any dependence structure between the features), creates a label ("synthetic" "real") and then predicts this label using a random forest. Then you can do clustering using some sort of decomposition of the proximity matrix (the 1:n entries which correspond to the real data), which gives the proportion of times the i,jth observation in the real data co-occurred in the same terminal node. I guess you can get a permutation importance from this as well.
@zmjones Do you know if any R package already implements this? Sounds like it would be a non-trivial amount of work to implement ourselves.
randomForest does it. e.g.
library(randomForest)
data(iris)
fit = randomForest(iris[, -ncol(iris)], type = "unsupervised", proximity = TRUE)
fit$proximity
....
Followed by some decomposition of the resultant matrix.
I have an implementation of it that works for the other packages that I am working on now but it will probably be a while before that ends up on cran.
It is described (poorly imo) in this paper. As far as I am aware there hasn't been anything else written about the method in particular.
Would it be feasible to port this to mlr or is your package going to expose this in some way we can use it from mlr?
Yea when I have it on cran I will integrate it in. We can use the canonical implementation in randomForest without my stuff though. I guess the trick with using it for clustering is going to be choosing a good method for decomposition/clustering of the proximity matrix. Then we can just call your new classification via clustering function.
Wait, wouldn't this work the other way round? I.e. clustering via classification.
Well the point of the unsupervised random forest is to get a rf measure of similarity between observations using only the features which is usually then decomposed and used for clustering. I am not sure what you mean by clustering via classification. You mean to learn the random forest classifier using the target feature, then compute the proximity matrix and decompose that for clustering? That wouldn't really be unsupervised.
Well I'm just not sure what you mean with the last sentence in your previous comment. I don't see how the classification via clustering would be used in this context.
"That wouldn't really be unsupervised"? I am confused about what you are confused about :)
"Then we can just call your new classification via clustering function."
I am probably off my rocker. I don't know why you would want to do classification this way, sorry.
What I meant was that if you can do clustering with the RF in this way (by applying a decomposition method to the unsupervised similarity matrix), then you could plug this into the classif via clustering function. Does that make more sense?
So then the end goal would be to do classification? Sorry I'm slightly lost.
Yes you could do classification with the RF clustering algorithm. Either by applying something like KNN directly to the proximity matrix or by decomposing it using something else and then plugging it into your function. Like I said though, I am off my rocker. I don't think that would be ideal: you would just do classification using the RF which would be superior in (I suspect) all cases.
OK ... after the general confusion last week there seems not to have been any further development in this matter. I have not yet tried the unsupervised random forest either, but have pursued a somehow different path of using sparse PCA as a step before the clustering to reduce dimensionality.
I was wondering if you had any further ideas?
Hi mlr-Experts,
I am attaching an E-mail conversation I had with Bernd at the bottom so that we can get a little more input to the matter.
Here the core points in English:
If you have any ideas/input to the matter, it would be very helpful
Thanks Sebastian
Here the email conversation between Bernd and me (Sorry, it's all German):
On 22.10.2015 10:16, Sebastian Wandernoth wrote: