mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 404 forks source link

Require help for cost-sensitive classification task #109

Closed SchroederFabian closed 10 years ago

SchroederFabian commented 10 years ago

hello everyone, I am interested in cost-sensitive classification and I am new to the package 'mlr'. Maybe I can tease with a few questions to start a discussion. Please correct my code if I am doing anything wrong.

QUESTION A) For the SVM the two following approaches are not equivalent: 1) calibrate a svm, calculate posterior probabilities and change the threshold according to the cost matrix. 2) calibrate a svm considering the cost structure. This approach is valid for lda and most derived classifiers (NaiveBayes, Shrunken centroids, etc.) but not for the SVM. For further insight on this issue, see http://www.di.ens.fr/~fbach/bach06a.pdf. Could your object oriented approach to cost-sensitive classification deal with different approaches?

QUESTION B) How can I use cost-sensitivity for resampling? I have tried to change the measure but this doesn't seem to be the right way to do it???

QUESTION C) It has been mentioned in the literature that posterior probability estimations can be quite poor. Usually Ensemble methods such as bagging or boosting have been applied to improve these estimates. see eg. MetaCost,.... Why is the BaggingWrapper (I love the object orient approach to this ensemble method) not capable of estimating the probabilities.

set parameter

task <- makeClassifTask(id='dat', data=data.frame('frac'=case.control, 'diab'=diab, dat), target='frac', positive='case')
class.a <- makeLearner('classif.svm', predict.type='prob')
cv10 <- makeResampleDesc('CV', iters=10)

cost.matrix <- matrix(c(0,1,5,0), ncol=2, byrow=TRUE, dimnames=list(c('case', 'control'), c('case', 'control')))
assc <- makeCostMeasure(id='dat', minimize=TRUE, costs=cost.matrix, task=task, aggregate=sum)
threshold <- c(case=1/6, control=5/6)

train model

mod.a <- train(class.a, task)
pred.a <- predict(mod.a, task)
pred.b <- setThreshold(pred.a, threshold=threshold)
pred.a$data$prob.case==pred.b$data$prob.case
pred.a$data$response==pred.b$data$response

10-fold cross validation

set.seed(123)
res.a <- resample(class.a, task, cv10, measures=assc, show.info=FALSE)
set.seed(123)
res.b <- resample(class.a, task, cv10, measures=acc, show.info=FALSE)
res.a$pred$data$prob.case==res.b$pred$data$prob.case    
res.a$pred$data$response==res.b$pred$data$response

Thank you in advance and I am looking foreward to a discussion, Fabian

berndbischl commented 10 years ago

Hi,

I will address this point by point. The result should IMHO be at least a short guide that we can put into the tutorial / wiki so people can look this up later.

QUESTION A)

For the SVM the two following approaches are not equivalent: 1) calibrate a svm, calculate posterior probabilities and change the threshold according to the cost matrix. 2) calibrate a svm considering the cost structure. This approach is valid for lda and most derived classifiers (NaiveBayes, Shrunken centroids, etc.) but not for the SVM. For further insight on this issue, see http://www.di.ens.fr/~fbach/bach06a.pdf. Could your object oriented approach to cost-sensitive classification deal with different approaches?

Lets make the setting a bit clearer in a formal way OK?

We are considering the following scenario:

Is that correct so far? Then I would start to show some tools in mlr to help with that and address the rest later.

SchroederFabian commented 10 years ago

Yes, we are dealing with a binary classification problem (lets refer to the positive and the negative class) with given asymmetric costs (C+ and C-, where C+ is the cost induced when a positive instance has been misclassified). We are using a svm with a linear kernel. From a decision theoretic point of view, the approach is to minimize the expected costs. However, the svm does not yield estimates of these probabilities directly. The result of the optimization problem is a linear function f(x) = sign(w'x + b) where w and b are referred to as slope and intercept, respectively. The probability estimates are then estimated from the scores, (?perpendicular distance to the class boundary).

The libsvm implementation of the svm, which you are using, states it calculates the probability estimates by means of a five fold cross validation, whatever this means.

Thus, using the probability estimates to consider asymmetric costs means: 1) optimizing the convex maximization problem with symmetric costs (C+ = C-), yielding w and b. 2) estimating probabilities from the scores 3) setting a threshold to classify the instances. geometrically this means shifting the intercept b of the linear boundary.

Therefore, once the optimization is performed (using symmetric costs) only the intercept is altered. A more correct approach would be to optimize the target function considering the asymmetric cost structure (C+ neq C-), over both w and b.

As far as I understood the functionality of the svm implementation in the libsvm code, one can set the parameters C+ or C-.

berndbischl commented 10 years ago

Fabian,

as we discussed most of this during Skype just now I will close this to keep the issue list to a manageable size. But I will keep in mind to write up more info online how to do cost-sensitive classification with mlr (awaiting your email for that).

berndbischl commented 10 years ago

Ok, I opened up #114