mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 404 forks source link

Integration of custom performance measure requiring `predict.type = "node"` for decision trees #2655

Closed MathieuMarauri closed 4 years ago

MathieuMarauri commented 4 years ago

Hello,

First thank you for the nice work, I am enjoying working with mlr and I look forward to use mlr3.

I have a really specific use case where I want to have a performance measure that is the mean number of variables used for each prediction. In the context of decision tree this can be used to measure the simplicity of the tree (see this paper section 2.4.2 on Fast and Frugal decision trees).

I can produce such a performance measure on trees generated by mlr retrieving the learner.model object but I cannot integrate the measure in mlr using the tutorial as I need the prediction type to be "node" as shown here.

Did I miss something and predict.type = "node" is possible or is there a way to make it possible?

Is what I am trying to achieve too specific and should not be integrated into mlr?

library("mlr")
library("stringi")

# install.packages("rpart")
# install.packages("partykit")

mvu = function(task, model) {
  if (!inherits(model$learner.model, "party")) tree <- as.party(model$learner.model)
  rls = partykit:::.list.rules.party(tree)
  rule = rls[as.character(predict(tree, type = "node"))]
  vu = stri_count_regex(rule, paste0("(", paste(names(getTaskData(task)), collapse = "|"), ")"))
  return(mean(vu))
}

model = train(learner = "classif.rpart", task = iris.task)
mvu(iris.task, model)

Thank you, Mathieu

larskotthoff commented 4 years ago

Is this resolved? How did you solve the problem in the end?

MathieuMarauri commented 4 years ago

I can integrate the measure using the following code.

library("mlr")
library("stringi")
# function to measure the MVU
mvu_func = function(task, model, pred, feats, extra.args) {
  if (!inherits(model$learner.model, "party")) tree <- partykit::as.party(model$learner.model)
  rls = partykit:::.list.rules.party(tree)
  rule = rls[as.character(predict(tree, type = "node"))]
  vu = stringi::stri_count_regex(rule, paste0("(", paste(names(getTaskData(task)), collapse = "|"), ")"))
  return(mean(vu))
}

# generate the measure object
mvu = makeMeasure(
  id = "mcu", name = "Mean Variables Used",
  properties = c("classif", "classif.multi", "regr", "multilabel", "surv", "cluster", "req.model", "req.task"),
  minimize = TRUE, best = 1, worst = Inf,
  fun = mvu_func,
  note = "Only available for decision trees (object that can be converted to party object)"
)

model = train("classif.rpart", iris.task)
pred = predict(model, iris.task)
performance(pred, model = model, task = iris.task, measure = mvu)

This code does not work with the implementation of ctree in mlr. The mlr implementation is from the party package and the code partykit::as.party(model$learner.model) does not work for this package. It works for the partykit::ctree function.

Also note that this function can also theoretically work for every types of model but the way to compute the measure would be completely different (e.g. for regression it is the number of predictors used in the model).

My problem is not so much with mlr but with finding a mvu_func that would work for every trees possible and ideally for every models. As it is not related to mlr I closed the issue.

Anyway if you know a way to compute such a performance measure (average number of predictors used to make predictions) I would love to be pointed to the right direction.

Cheers, Mathieu

larskotthoff commented 4 years ago

Thanks -- I'm not aware of anything that does this in general. As a crude approximation, you could save the model and check the size of the saved file though.