mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
140 stars 25 forks source link

PipelineLearner #14

Closed zzawadz closed 5 years ago

zzawadz commented 6 years ago

I think that after creating, the graph should be stored inside a PipeLearner class.

As we discuss, it will be required that the last node will be a Learner so it's parameters like task_type and predict_types will be copied to PipeLearner. Other parameters like packages will be gathered during the initialization of the object.

The object will created by passing the first node of the graph, or by passing the list of PipeOp: PipelineLearner$new(list(op1, op2, op3)).

The train method will call the trainGraph function.

I'm still thinking how to manage the parameters for each node.

zzawadz commented 6 years ago

I think that for parameters the most straightforward way it would to traverse the entire graphs and gather the parameters, and add to their names the id of the node. So for example the scale parameter in the scaler node becomes scaler:scale. Then the user will be able to set all parameters' values passing standard list("scaler:scale" = value, ...) like in the case of other learners. Then the trainGraphs will be responsible for setting the proper values for specific nodes.

See the example below (it only creates the new parameters list):

  op1 = PipeOpScaler$new("myscaler")
  op2 = PipeOpPCA$new()
  op1$set_next(list(op2))

  lrn = mlr_learners$get("classif.rpart")
  op3 = PipeOpLearner$new(learner = lrn)
  op2$set_next(list(op3))

  pipeline_gather_params(op1)

  # ParamSet: parset 
  # Parameters: 
  # myscaler:center [logical] (Default: TRUE)
  # myscaler:scale [logical] (Default: TRUE)
  # classif.rpart:minsplit [integer] (Default: 20): {1, ..., Inf}
  # classif.rpart:cp [numeric] (Default: 0.01): [0, 1]
  # classif.rpart:maxcompete [integer] (Default: 4): {0, ..., Inf}
  # classif.rpart:maxsurrogate [integer] (Default: 5): {0, ..., Inf}
  # classif.rpart:maxdepth [integer] (Default: 30): {1, ..., 30}
  # classif.rpart:xval [integer] (Default: 10): {0, ..., Inf}
pfistfl commented 6 years ago

Thanks, this already looks really promising.

Yes, I think this is what Bernd and I also came up with.

We thought about which separator (e.g :) to use, and I think indeed : is the most sensible for now. If we now require unique Id's for every Node in the Graph, this would almost guarantee us to not get any naming clashes.

I think we can again look at how this is done in the mlr wrappers / multiplexer.

pipeline_gather_params() : If we have a Graph / Pipeline class, this is what should automatically be done when we initialize it I think. Adding / Dropping an OP should then also make sure that the ParamSet is refreshed.

I think I will get around to work on this on Monday!

zzawadz commented 6 years ago

I think that unique ids are the must have, and it should be check when the PipeLearner is created. It will be useful for overloading the [[ operator.

berndbischl commented 5 years ago

we have GraphLearner now