mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
137 stars 25 forks source link

Tune parameters used in mutation pipeop. #601

Closed nipnipj closed 1 month ago

nipnipj commented 3 years ago

Quick question. Are parameters from functions used in "mutate pipeop" tuneable? For example, df parameter in splines::ns().

po("mutate", id="mutate1", param_vals = list(
    mutation = list(
      var1 = ~ splines::ns(var1, df = 1)
  )))

Is it possible to add df parameter to ParamSet$new()?

mb706 commented 1 month ago

The way to do this would be with an "extra_trafo" in the search space. You can either define it explicitly for a given search_space, or use to_tune() with a ParamSet that has an extra_trafo that results in one dimension. The extra_trafo can create any kind of object, not necessarily a scalar, that is then assigned to the hyperparameter in question.

Say we have the "mtcars"-Task with features am and carb, among others. A TuneToken that searches over two dimensions and results in something that could set the mutation hyperparameter could look as follows:

library("paradox")
tt <- to_tune(ps(
  am.dg = p_int(1, 3),
  carb.dg = p_int(1, 3),
  .extra_trafo = function(x) {
    list(
      # the following is what `mutation` will ultimately be set to
      output = list(
        am = ~ am ^ x$am.dg,
        carb = ~ carb ^ x$carb.dg
      )
    )
  }
))

It is important that .extra_trafo returns a named list with one element here, but the name of that element is ignored.

(I am using exponentiation instead of splines here because you specifically asked about PipeOpMutate, which can only generate single columns. To create splines, you could use PipeOpModelMatrix instead.)

We can now build the following pipeline:

glrn <- po("mutate", id = "mutate1", mutation = tt) %>>% lrn("regr.lm")

The search space for this pipeline now looks like this:

glrn$param_set$search_space()
#> <ParamSet(2)>
#>         id    class lower upper nlevels        default  value
#>     <char>   <char> <num> <num>   <num>         <list> <list>
#> 1:   am.dg ParamInt     1     3       3 <NoDefault[0]>       
#> 2: carb.dg ParamInt     1     3       3 <NoDefault[0]>       
#> Trafo is set.

and it creates the following kinds of samples

generate_design_random(glrn$param_set$search_space(), 1)$transpose()[[1]]
#> $mutate1.mutation
#> $mutate1.mutation$am
#> ~am^x$am.dg
#> <environment: 0x55d1d3407610>
#> 
#> $mutate1.mutation$carb
#> ~carb^x$carb.dg
#> <environment: 0x55d1d3407610>

This is the value that mutate1.mutation would be set to during optimization: The mutation that happens is determined by the formula, and the specific values of carb.dg and am.dg are stored inside the attached "environment", which gets created (implicitly) in the extra_trafo call.

Note that we could also have set other hyperparameters in the pipeline to TuneToken and the search_space() would have been augmented appropriately.

Tuning with this with mlr3tuning:

library("mlr3tuning")
tr <- tune(tnr("grid_search"), tsk("mtcars"), glrn, rsmp("cv"))
tr
#> <TuningInstanceBatchSingleCrit>
#> * State:  Optimized
#> * Objective: <ObjectiveTuningBatch:mutate1.regr.lm_on_mtcars>
#> * Search Space:
#>         id    class lower upper nlevels
#>     <char>   <char> <num> <num>   <num>
#> 1:   am.dg ParamInt     1     3       3
#> 2: carb.dg ParamInt     1     3       3
#> * Terminator: <TerminatorNone>
#> * Result:
#>    am.dg carb.dg regr.mse
#>    <int>   <int>    <num>
#> 1:     2       1 12.17455
#> * Archive:
#>    am.dg carb.dg regr.mse
#>    <int>   <int>    <num>
#> 1:     1       3 12.38879
#> 2:     2       3 12.38879
#> 3:     3       3 12.38879
#> 4:     2       1 12.17455
#> 5:     2       2 12.47607
#> 6:     1       2 12.47607
#> 7:     3       1 12.17455
#> 8:     3       2 12.47607
#> 9:     1       1 12.17455

As we can see, the result has am.dg set to 2 and carb.dg set to 1. We can also see the specific hyperparameter value that was set:

tr$result$x_domain
#> [[1]]
#> [[1]]$mutate1.mutation
#> [[1]]$mutate1.mutation$am
#> ~am^x$am.dg
#> <environment: 0x55d1d0ca20c0>
#> 
#> [[1]]$mutate1.mutation$carb
#> ~carb^x$carb.dg
#> <environment: 0x55d1d0ca20c0>

The values of carb.dg and am.dg are hidden inside the environment of these formulae:

tr$result$x_domain[[1]]$mutate1.mutation$am
#> ~am^x$am.dg
#> <environment: 0x55d1d0ca20c0>
environment(tr$result$x_domain[[1]]$mutate1.mutation$am) |> as.list()
#> $x
#> $x$am.dg
#> [1] 2
#> 
#> $x$carb.dg
#> [1] 1

We can see what these hyperparameters do to the task by assigning them to the glrn and using the PipeOp:

glrn$param_set$set_values(.values = tr$result_learner_param_vals)

dummy <- as_task_regr(data.frame(am = 2, carb = 2, target = 1), target = "target")
mutated <- glrn$pipeops$mutate1$train(list(dummy))[[1]]
mutated$data()
#>    target    am  carb
#>     <num> <num> <num>
#> 1:      1     4     2

Here, "am" was squared, while "carb" was not.

(Sorry for the late reply; you probably don't have this problem any more, but it may help others searching the archives.)