Clustering vs Subgroup discovery?

amueller commented 7 years ago

I just saw there's a task "subgroup discovery", but it's not explained on the website: https://www.openml.org/search?type=task_type

Is that clustering or something else? Can we add clustering? That seems like a very natural task to add.

berndbischl commented 7 years ago

Can we add clustering? That seems like a very natural task to add.

+1

joaquinvanschoren commented 7 years ago

Agreed. Let's discuss during the hackathon On Thu, 23 Feb 2017 at 22:10, Bernd Bischl notifications@github.com wrote:

Can we add clustering? That seems like a very natural task to add.

+1

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/371#issuecomment-282122050, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV8aShC-zESbPFouPBjnWhsYIwvjnks5rffW1gaJpZM4MKgAu .

-- Thank you, Joaquin

amueller commented 7 years ago

As I won't be there: I think doing supervised evaluation with ARI and NMI on classification would be cool, but maybe also offering unsupervised evaluation, maybe using some consensus scores?

I would get rid of the cross-validation for this task if that's possible.

berndbischl commented 7 years ago

I think doing supervised evaluation with ARI and NMI on classification would be cool, but maybe also offering unsupervised evaluation, maybe using some consensus scores?

I would get rid of the cross-validation for this task if that's possible.

R (and mlr) offer quite a few standard indices for cluster evaluation. Note that resampling is not completely useless for clustering one also considers the stability of this indices during resampling.

I do think that we potentially need to change and adapt a few things here.

joaquinvanschoren commented 7 years ago

We had this discussion a while ago during one of the first Hackathons.

There are 2 types of evaluation measures

External measures. These require the 'true' cluster labels. To evaluate, one typically does something similar to bootstrapping. Take 50 bootstraps and evaluate the produced clusters.
Internal measures: If you don't, you typically use some internal measure of how 'clean' the clusters are. There is a whole bunch of them:
http://stats.stackexchange.com/questions/21807/evaluation-measure-of-clustering-without-having-truth-labels
http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
http://jwijffels.github.io/RMOA/MOA_2014_04/doc/pdf/Tutorial3.pdf

I think that we will need 2 types of tasks, because one will have the ground truth and the other one not. The internal measures can always be computed. We should also always record the number of clusters generated, as that is something you often want to study, plus there are some measures for finding the optimal number of clusters (e.g. silhouette), and we probably want to support that, too.

We currently only have a task for the supervised case. This was started a few years ago, and I had a student who was going to implement the evaluation in the backend, but he quit fairly early so nothing much is done here. What we do have is a format for uploading clusterings, and some other basic stuff.

Maybe we can sit together and rethink the evaluation engine a bit. I have been thinking for a long time now that we should move to a more flexible system of bots (services). I.e. there would be WEKA/mlr/scikit-learn evaluation bots that just wait for new experiments to be submitted, and then evaluate them and submit the results to the OpenML API. That would be a lot more extensible/maintainable/scalable than the current Java system. We would need to think a bit about hiding duplicate evaluations, but that's a very minor issue.

Cheers, Joaquin

On Fri, Feb 24, 2017 at 1:36 PM Bernd Bischl notifications@github.com wrote:

I think doing supervised evaluation with ARI and NMI on classification would be cool, but maybe also offering unsupervised evaluation, maybe using some consensus scores?

I would get rid of the cross-validation for this task if that's possible.

R (and mlr) offer quite a few standard indices for cluster evaluation. Note that resampling is not completely useless for clustering one also considers the stability of this indices during resampling.

I do think that we potentially need to change and adapt a few things here.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/371#issuecomment-282281296, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQVwDGSYzlw1dy9OWoFhiuLOgoSNm_ks5rfs7TgaJpZM4MKgAu .

-- Thank you, Joaquin

amueller commented 5 years ago

Subgroup discovery is still not explained on the website. @HeidiSeibold maybe you want to add a short description?

HeidiSeibold commented 5 years ago

I don't know what it is either. Isn't that something that @janvanrijn or @mfeurer worked on?

mfeurer commented 5 years ago

I usually stick to supervised learning ;)

amueller commented 5 years ago

Oh must have mixed up something there, sorry. @joaquinvanschoren or @berndbischl should know though :P

janvanrijn commented 5 years ago

I once created the task type. It is a long standing open project with Arno Knobbe and Marving Meeng.

joaquinvanschoren commented 5 years ago

@joaquinvanschoren https://github.com/joaquinvanschoren or @berndbischl https://github.com/berndbischl should know though :P

Yes, but Jan is the expert :) Jan, can you add a description to the task type? https://www.openml.org/tt/8

Cheers, Joaquin -- Thank you, Joaquin

janvanrijn commented 5 years ago

I added a small description to the database.

amueller commented 5 years ago

thanks @janvanrijn

joaquinvanschoren commented 5 years ago

For the record, I ran into this great overview of clustering measures: https://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf

It's probably useful to have it handy. Some of these are only available in R, I believe.

amueller commented 5 years ago

hm well it lists all of them, but it doesn't say anything on whether they are useful at all, right?

openml / OpenML

Clustering vs Subgroup discovery? #371