cluster: sometimes not all cluster are assigned to newdata cases, can lead to problems in performance

berndbischl commented 10 years ago

load_all(".")

task = makeClusterTask(data = iris[,1:4])
lrn = makeLearner("cluster.Cobweb")
m = train(lrn, task)
pred = predict(m, task = task)
p = performance(pred, task = task, measures = db)
print(pred$data$response)
> p = performance(pred, task = task, measures = db)

Warning in max(R[i, ][is.finite(R[i, ])]) :
  no non-missing arguments to max; returning -Inf

> print(pred$data$response)
  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 2 2 3 2 3 3 2 3 2 3 2 2 3 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 2 2 2 2 3 3 2 2 3 3 2 2 2 2 3 3[101] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

In the code above it does not really make sense for cobweb to label with clusters 2 and 3 only, but not 1.

But the problem can of course occur everytime we only predict a subset of data, or just certain observation in a region of the X-space.

larskotthoff commented 10 years ago

I would leave this as is. The clustering of the training data gives you a certain number of clusters, each of which has a well-defined meaning. This is part of the information you want from a clustering and meaningful for the application. Reassigning clusters in predict messes this up.

For example in the case you posted there's nothing in cluster 1. This is important information. "Fixing" the cluster numbers means that it appears that many things are in clusters 1 and 2, but not in cluster 3. This would change the meaning of the clustering completely.

berndbischl commented 10 years ago

I discussed this with Michel.

1) I am sure Cobweb is buggy.

task = makeClusterTask(data = iris[, -5L])
lrn = makeLearner("cluster.Cobweb")
m = train(lrn, task)
mm = m$learner.model
p = predict(m, task = task)

> print(mm)
Number of merges: 0
Number of splits: 0
Number of clusters: 3

node 0 [150]
|   leaf 1 [96]
node 0 [150]
|   leaf 2 [54]

> print(table(mm$class_ids))

 1  2 
81 69 

> print(p$data$response)
  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 2 2 3 2 3 3 2 3 2 3 2 2 3 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 2 2 2
 [88] 2 3 3 2 2 3 3 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

I am betting that someone though that something is 0-based and added a 1 to cluster-predictions. Then the cluster number in the print is max(cluster_inds). So in the example above we really have 2 clusters, but the resulting class_ids and then data$responce values are from 2,3. And the printer says "3 clusters".

For this reason I will remove Cobweb to todo-files now.

berndbischl commented 10 years ago

Ok, actually one of you guys added the +1 in the mlr code without checking.... gblame says it was Lars.

berndbischl commented 10 years ago

2) We need a method "getClusterNumber" which is implemented for every model, or store the number of cluster in the WrappedModel. One always wants this.

3) When we predict newdata and get less clusters in the resulting prediction than were produced in the model itself, we probably needed a warning or an error. Because otherwise you might get the problem I already posted.

larskotthoff commented 10 years ago

Yes, I've added this because some clustering algorithms return 0-based indices, which break stuff in other places. I would much prefer to have factors there instead of numbers, but unfortunately that breaks yet more stuff in other places.

We can't really check if indices are 0-based and add 1 only in those cases for the reasons outlined above. How about just changing the way the number of clusters is determined (unique "labels")?

berndbischl commented 10 years ago

I am not sure I follow you. Cobweb is just a bug right? So we remove the +1.

For getting the number of clusters we have to ask the model. Well we could call internally "predict" and count the labels.

Maybe that is better, but it is a slight speed overhead. Although I dont really care.

larskotthoff commented 10 years ago

Ok, if you're sure that this is a bug with Cobweb, then it should be fine. I'm not 100% sure because it would be inconsistent with the rest of RWeka (but admittedly this behaviour isn't documented anywhere as far as I can see).

berndbischl commented 10 years ago

Can you please try 1-2 more examples and play around with it? I just looked a the one, and "sure" is overstating it.

larskotthoff commented 10 years ago

I've checked a few cases and didn't get any where the lowest cluster number was 1. However, looking at the source code the lowest number is definitely 0 (so we would need to add 1). Is this something you have observed only with Cobweb or with other clusterers as well?

berndbischl commented 10 years ago

Uuh.

I only check cobweb and that was accidental. run at least the same example with the other algos abd check?

larskotthoff commented 10 years ago

Works fine with everything else for me. Based on my review of the Cobweb code, I don't feel comfortable removing the +1 from the code -- at least it technically works at the moment, but things will break if the clustering ever contains a 0. Close and reopen if this actually causes problems?

berndbischl commented 10 years ago

If the current situation for Cobweb is "fishy and nobody knows" ---> Remove it to todo-files.

When somebody is sure it works as intended, it can be added again.

I also guess it makes sense in the unit tests to check if all cluster labels are produced, if we predict again on the train data.

larskotthoff commented 10 years ago

Well it depends on how you define "fishy" :)

It works as intended in that it clusters. The fact that the cluster labels are not consecutive is a (minor) technical detail that I think is not particularly relevant in practice. So my inclination would be to leave it in for now. If you disagree, feel free to move though (or make a branch).

berndbischl commented 10 years ago

I have removed Cobweb for now. I would like to be really sure about stuff we release. It is no problem to add in in later release.

mlr-org / mlr

cluster: sometimes not all cluster are assigned to newdata cases, can lead to problems in performance #159