Featureselection/reduction for Clustering

johndoe1982 commented 9 years ago

Hi mlr-Experts,

I am attaching an E-mail conversation I had with Bernd at the bottom so that we can get a little more input to the matter.

Here the core points in English:

I am new to Clustering algorithms and am currently faced with a task to group together data sets using unsupervised learning algorithms for the first time
My data sets consist of a large amount of features with relatively small amount of cases
The assumption is that only a small amount of features will actually be helpful to the clustering
So I need to do a feature selection (similar to for example an "sfs" in case of a supervised analysis)
I realized that this functionality is not (yet) implemented in mlr
It seems not to be trivial to pick a suitable performance measure
researching a bit on the net I found this paper (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2930825/pdf/nihms201124.pdf) which deals with this issue by a method called sparse clustering
It is implemented in R in the package "sparcl". I am currently trying it our a bit to see, if it is something worth including in mlr

If you have any ideas/input to the matter, it would be very helpful

Thanks Sebastian

Here the email conversation between Bernd and me (Sorry, it's all German):

On 22.10.2015 10:16, Sebastian Wandernoth wrote:

Hallo Bernd,

Das "Du" geht vollkommen in Ordnung. Ich wollte es nur nicht gleich in der ersten E-Mail voraussetzen.

Wir können die Diskussion auch gerne auf Github verschieben. Ich kann ein neues "issue" aufmachen und die E-Mail Konversation anhängen.

-----Ursprüngliche Nachricht----- Von: Bernd Bischl [mailto:bernd_bischl@gmx.net] Gesendet: Mittwoch, 21. Oktober 2015 18:02 An: Sebastian Wandernoth Betreff: Re: Featureselketion bei Clusteringalgorithmen im mlr-Paket

Hi,

Vorbemerkung1: Statt in einer Email wäre es wirklich besser, wenn wir das auf Github diskutieren würden. mlr wird aktuell von >= 6 Personen entwickelt, so bekommen alle mit worum es hier geht. Und es gibt eine höhere W'keit für bessere Antworten ;-)

Vorbemerkung2: Ich schreib jetzt einfach mal "Du" und biete das im Gegenzug auch an. Passt das?

On 21.10.2015 15:37, Sebastian Wandernoth wrote:

Hallo Herr Bischl,

Ich nutze das mlr-Paket um explorativ verschiedene Datensätze in Verbindung zu setzen. Ich beziehe mich hierbei auf die Version 2.4 , welches die neueste release Version des mlr-Paketes sein sollte.

Ok. Ja, 2.5 hätte schon released sein sollen, aber ich bin aktuell krank / verletzt. Aber ich arbeite dran. (Hat aber mit dem Thema auch nicht wirklich viel zu tun)

Das Problem ist, dass meine Datensätze zu Beginn sehr viele Features bei vergleichsweise kleinen Fallzahlen haben. Die meisten dieser Features halten keine oder nur sehr wenig Information für die jeweilige Fragestellung.

Ok, kenne ich. Normalerweise führe ich Klassifizierungsaufgaben mit Hilfe überwachter Algorithmen durch, habe also Meta-Information zu der Gruppenzugehörigkeit während des Trainingsprozesses. Hierfür reduziere ich meine Features zunächst mit Hilfe eines einfachen random.forest.importance Filters und führe dann eine sequentielle Vorwärtsselektion auf dem reduzierten Set durch. Dies funktioniert eigentlich sehr gut.

Klar + hört sich sinnvoll an.

In einer neuen Fragestellung möchte ich allerdings die Datensätze unüberwacht mit Hilfe von Cluster-Algorithmen gruppieren. Ich habe gesehen, dass es zur Zeit weder einen Filter noch eine sequentielle Vowärtsselektion für Cluster-Algorithmen im mlr-Paket gibt. Gibt es hierfür einen tieferen Grund? Es gibt doch Performance Measures für Cluster-Algorithmen. Daher sollte es doch prinzipiell möglich sein diese für eine Featureselektion zu nutzen. Ich weiß, dass Sie die Cluster-Algorithmen erst relativ spät mit in das Paket aufgenommen haben. Ist diese Funktionalität einfach noch nicht integriert oder gibt es andere Gründe, die dagegen sprechen eine Featureselektion für ein Clustering durchzuführen? Gibt es eine Alternative?

Eigentlich gibt es keinen echten Grund. Und es wäre sehr einfach umzusetzen. Die meisten Verfahren in mlr würden sogar "out-of-the-box" dann funktionieren.

Ich habe ein bißchen "Bauchschmerzen" bei dem Performance-Maß. Wenn man dort eins für eine spezielle Anwendung festlegen kann, klar, dann funktioniert alles wie immer. Wenn nicht, geht es halt nicht (enfach). Meine persönliche Meinung: Man müsste vielleicht bei Cluster-Szenarien viel mehr "Multikriteriell" denken. Das kann mlr bis zu einem gewissen Grad auch schon. Was ich meine ist: Vielleicht ist es schwierig, ein Maß zu finden was man optimieren will. Aber vielleicht kann man 2-3 definieren, die einen sinnvollen Trade-Off definieren.

Aber lassen wir das mal und diskutieren mal konkret zu Deiner direkten Frage:

Welche Maße würdest Du denn gerne in der Feat-Sel optimieren bei deiner Anwednung? [Sebastian Wandernoth] Ich bin noch relativ neu beim Thema Clustering. Die Maße, mit denen ich bisher zu tun hatte waren Davies-Bouldin, Dunn und Silhouettenkoeffizient. Ich kann aber auch noch nicht abschätzen welcher dieser Indizes am besten für so eine Featureselektion geeignet wäre. Wahrscheinlich hast Du Recht und man bräuchte eine Kombination aus 2 oder 3 Maßen.

Kennst Du FilterVerfahren (in R) die dann noch verwendbar wären?

Kennst Du Paper, in denen so etwas ähnliches, wie das was wir hier diskutieren, evaluiert / beschrieben worden ist? [Sebastian Wandernoth] Ich bin bei meiner Recherche gestern auf dieses Paper gestoßen:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2930825/pdf/nihms201124.pd f

Hier wird Sparse-Clustering beschireben und hierfür gibt es auch ein Paket in R (sparcl). Ich probier es gerade einmal aus. Meine erste Idee war eine Hauptkomponentenanalyse vor dem Clustering durchzuführen um die Featureanzahl zu reduzieren, aber diese Idee wird gleich am Anfang von Section2 dieses Papers abgetan mit dem mir einleuchtenden Argument, dass die Hauptkomponenten, die die größte Varianz zwischen den Datensätzen beschreiben, nicht unbedingt die sein müssen, die am besten zum Clustern geeignet sind. Außerdem ist es schwierig die Hauptkomponenten wieder auf die ursprünglichen Features zu beziehen, da sie alle Funktionen von allen Features sind.

Ich spiele im Moment ein bisschen mit dem sparcl-Paket herum und kann Dir dann hoffentlich bald sagen, ob es funktioniert und eventuell wert wäre in mlr aufgenommen zu werden.

Ein weiteres Paper, auf das ich gestoßen bin, beschäftigt sich mit Dimensionsreduktion mit Hilfe von neuronalen Netzen

http://webdocs.cs.ualberta.ca/~nray1/CMPUT615/Zip/Deep_NN.pdf

Ich habe mich hier noch nicht komplett reingelesen, aber ich denke was Interpretierbarkeit der resultierenden Features angeht, wird das mit NN nicht unbedingt einfacher.

Viele Grüße, Sebastian

Grüße

Bernd

berndbischl commented 9 years ago

Coming back to one of my orig questions: Would you feel OK (as an alternative to what is discussed above in the papers, that I havent read yet) to do an sfs with a certain perf measure to clustering? If we allow that would that already help?

johndoe1982 commented 9 years ago

I think if we could pick the appropriate performance measure, this would definitely help. It would have been my first try anyway. Unfortunately my experience with clustering is very limited. So I'd appreciate any help in picking the measure.

berndbischl commented 9 years ago

Well you asked for the reason why we currently disallowed normal sfs with a measure for clustering. Because we are really not sure whether this makes sense. I know quite a lot about supervised feature selection, but not so much about in its unsupervised form.

I would really like for somebody to "weigh in" who knows more about this, so we can offer something in mlr which is an accepted approach in this scenario, and not something we come up with in an adhoc fashion....

larskotthoff commented 9 years ago

In principle there's nothing stopping us from optimising e.g. the Dunn Index (and I think this should already be possible for tuning the parameters of the learner?). Since there's no ground truth in clustering and hence you can't really do something completely wrong, I don't have anything against supporting this.

I've had a brief look at the sparcl package and it doesn't seem to implement a way of assigning new data points to clusters. This would need to be implemented for integration with mlr.

johndoe1982 commented 9 years ago

I think, this sounds very reasonable. The Dunn Index should work fine for selecting the features.

For the sparcl package: As far as I know at least the KMeans-algorithm is able to give predictions to assign new data points to clusters. The hierarchical clustering is not meant for this functionality. I am not sure if the sparcl package gives back a "normal" KMeans-object (I know it does for the hierarchical clustering). If that was the case it should be easy to implement the prediction into the package.

johndoe1982 commented 9 years ago

another thing I just read about is that you can actually use random.forests in unsupervised mode to do clustering. would it be possible to include this functionality? then the random.forest.importance could be used as a filter as in the supervised learning case...

I am just throwing out ideas here. If you think they are garbage, just tell me.

berndbischl commented 9 years ago

what exactly is "use random.forests in unsupervised mode" ?

johndoe1982 commented 9 years ago

I read about it here

http://stats.stackexchange.com/questions/39024/how-to-reduce-the-number-of-variables-in-cluster-analysis?rq=1

In the second answer the idea is explained. I also read about it in several other posts, but there were just too many of them to keep track of all. However, this seems to be a fairly common strategy, so I would assume that this functionality should be implemented "somewhere" in R already, preferably of course in the random.forest package you are using in mlr.