mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
137 stars 25 forks source link

Cannot Stack Learners with Different Task Types #708

Open AustinStephen opened 1 year ago

AustinStephen commented 1 year ago

It is not possible to build a stacked learner using mlr3 pipelines where the feature union uses a different task type than the final prediction. The following example is following the mlr3 book and stacked learner gallery posts. It seems it should be a valid use case to create additional features with unsupervised learners or regression learners then produce a classification prediction (or vice versa). I'm mostly looking to make sure this behavior is intended and not a bug or that there is a solution I cannot find. The main use case I see is not using regression when the target is categorical but using an unsupervised method where the target isn't really relevant.

Thank you!

library(mlr3)
library(mlr3pipelines)
library(reprex)

task <- tsk("iris")

# create classification and regression learners
lrnReg <- lrn("regr.featureless")
lrnClass <- lrn("classif.featureless")
lrnClass2 <- lrn("classif.featureless")
lrnClass2$id <- "classif.featureless.2"

# create graph
grStack <- gunion(list(
  po("learner_cv", lrnReg),
  po("nop", id = "nop1")
)) %>>%
  po("featureunion", id = "featureunion1") %>>%
  po("learner", lrnClass)

# plot graph
# grStack$plot()

# cannot make graph a learner
grStack <- as_learner(grStack)
#> Error: GraphLearner can not infer task_type from given Graph
#> in/out types leave multiple possibilities: classif, regr

# resolves error but will still fail to train
grStack <- GraphLearner$new(grStack, task_type = "classif")

# attempt to train
splits <- partition(task)
grStack$train(task, splits$train)
#> Error in check_item(data[[idx]], typetable[[operation]][[idx]], varname = sprintf("%s %s (\"%s\") of PipeOp %s's $%s()", : Assertion on 'input 1 ("input") of PipeOp regr.featureless's $train()' failed: Must inherit from class 'TaskRegr', but has classes 'TaskClassif','TaskSupervised','Task','R6'.

# same stacked learners graph with both learners inheriting from TaskClassif
grStack <- gunion(list(
  po("learner_cv", lrnClass),
  po("nop", id = "nop1")
)) %>>%
  po("featureunion", id = "featureunion1") %>>%
  po("learner", lrnClass2)

grStack <- as_learner(grStack)
grStack$train(task, splits$train)
#> INFO  [12:50:33.096] [mlr3] Applying learner 'classif.featureless' on task 'iris' (iter 1/3)
#> INFO  [12:50:33.126] [mlr3] Applying learner 'classif.featureless' on task 'iris' (iter 2/3)
#> INFO  [12:50:33.143] [mlr3] Applying learner 'classif.featureless' on task 'iris' (iter 3/3)

Created on 2023-01-27 with reprex v2.0.2

mb706 commented 1 year ago

Sorry, the feature is not currently supported, but I see that it could be useful. Maybe there is a way to hack this with PipeOpTargetTrafo, converting to a task type of a different kind, using a PipeOpLearnerCV, and converting back to the original task. But I think for this one would need to know the internals of mlr3 quite well and I am not sure myself how well this would work.

I will see if we can implement something that makes this possible / more usable.

AustinStephen commented 1 year ago

I wrote a custom pipeOp that takes the original task, creates a TaskClust, and cluster learner. When training, it stores the new task and the trained learner in the state, then uses predict_newdata() during prediction. Like the "learner_cv" pipeOp, the cluster assignment can be augmented to the data using the "featureunion" pipeOp. It may be too narrow of a solution for what you guys are looking for, but I'm happy to polish it and make a pull request, so people have a tool for feature augmentation with clustering if that's helpful.

mb706 commented 1 year ago

Including a Learner in a PipeOp in a way that is consistent with how other PipeOps that wrap things work (e.g. PipeOp hyperparameters contain Learner hyperparameters, state is not an R6 object if it can be avoided to speed up parallelization and avoid issues when cloning, learner and learner_model active bindings) is a bit tricky, see e.g. the PipeOpLearnerCV. R6 doesn't make nested objects that reference each other easy, unfortunately, which is why PipeOpLearnerCV is so big.

If your solution does all of this and you also don't mind writing tests then we can merge it here, otherwise we can also implement something like this when we have some capacity.

AustinStephen commented 1 year ago

I read through the implementation of PipeOpLearnerCV when writing it. The major difference is I operate on the data frame instead of the task because I don't need information from the original task (creates a cluster task using the data). I inherit from the PipeOpTaskPreproc base class and implement the get_state_dt() and transform_dt() functions. It is similar to the implementation of PipeOpScale.R where it computes the center and scale and retrieves them from "self$state". Is it an issue to use state in that context as well?

Happy to write tests and do all of those things of course.

mb706 commented 1 year ago

That actually sounds quite good, I'd be happy to look at your PR.

The tricky thing to keep in mind with the self$state is that it should only contain the Learner's $state, not the trained learner as a whole. (as done in training here, and taking care not to change the state of the PipeOp during prediction here). As long as get_state_dt() only returns the Learner's state and transform_dt() does the on.exit({...}) stuff it should work.

AustinStephen commented 1 year ago

Ah, that makes sense. I will make that change and put together a pull request in the next week or so. Thanks for the help!

bkmontgom commented 3 months ago

Any update on this? I would find this very useful.