mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
137 stars 25 forks source link

Use case: Bagging #12

Closed zzawadz closed 5 years ago

zzawadz commented 5 years ago

Use case: Bagging


k = 100
op1 = PipeOpNULL$new()
op2 = PipeOpDownSample$new(rate = 0.6)
ops2 = repop(k, op2) # Auto-set Ids? # replicate with s3?
op3 = PipeOpLearner$new("classif.rpart")
ops3 = repop(k, op3)
op4 = PipeOpEnsembleAverage$new() # der muss halt wissen dass nur vorher learner nimmt als input

g1 = GraphNode$new(op1)
gs2 = lapply(ops2, GraphNode$new)
gs3 = lapply(ops3, GraphNode$new)

g1$set_next(gs2)
for (i in 1:k)
  gs2[[i]]$set_next(gs3[[i]])
g4$set_prev(gs3)

can we write the above in a shorter, better way?

op1 = PipeOpNULL$new()
op2 = PipeOpDownSample$new(rate = 0.6)
op3 = PipeOpLearner$new("classif.rpart")
op4 = PipeOpEnsembleAverage$new()

Pipeline$new(list(op1, rep(k, op2), rep(k, op3), op4))

My comments:

I think this is a very important stuff for the whole project. I bet that making bagging easy to define using the pipeline will make the whole functionality easy to use.

Let's start with Pipeline$new(list(op1, rep(k, op2), rep(k, op3), op4)). I think it would be better to rephrase this as:

p1 <- Pipeline$new(list(op2, op3))
Pipeline$new(list(op1, rep(k, p1), op4))

# or using some sugar
op1 %>>% rep(k, op2 %>>% op3) %>>% op4

It might be a bit easier to reason about because we know which part will be replicated, and we don't need to worry about the sizes of op2 and op3.

     B1->C1 \
A -> B2->C2 -> D
     B3->C3 /

Probably we need to define what happens when the Node has multiple predecessors. For me, the most natural way is to bind their results together and send them to the next node (but what happens when one of the previous nodes returns SparseMatrix and the second data.frame? I don't know now).

So

A \
B --> D
C /

So in that case, D gets all the results from all previous nodes.

The more interesting problem is when there are multiple successors.

A \   X
B --> Y
C /   Z

Probably the easiest solution is that each node X, Y, Z gets as input all results from previous nodes. So when the nodes lists of nodes will be concatenated it will set all the nodes from the first list as predecessors of the nodes from the second list.

Probably the method name set_prev should be renamed to add_prev, because each node will be able to have multiple predecessors.

So we can rephrase the previous example in the pseudo code as a:

p1 <- list(A, B, C)
p2 <- list(X, Y, Z)
op3 <- ...

Pipeline$new(list(p1, p2, op3))
## It will cause to call the
X$add_prev(A); X$add_prev(B); X$add_prev(C)
Y$add_prev(A); Y$add_prev(B); Y$add_prev(C)
Z$add_prev(A); Z$add_prev(B); Z$add_prev(C)
... operations for op3

# when the size of p1 is not equal to p2
# it works the same way
p1 <- list(A, B, C)
p2 <- list(X, Y)
op3 <- ...

Pipeline$new(list(p1, p2, op3))
X$add_prev(A); X$add_prev(B); X$add_prev(C)
Y$add_prev(A); Y$add_prev(B); Y$add_prev(C)
... operations for op3
pfistfl commented 5 years ago

would be better to rephrase this as:

Absolutely, I think we want to allow as many options as possible, while abstracting many of the complicated things away from the user.

# Bagging as you describe it:
op1 %>>% rep(k, op2 %>>% op3) %>>% op4
# Could then become something like
op1 %>>% pipeOpBagging(n = 100, rate = 0.6, learner = "classif.rpart")

while still allowing for the first option.

For me, the most natural way is to bind their results together and send them to the next node (but what happens when one of the previous nodes returns SparseMatrix and the second data.frame? I don't know now).

This is something we are still discussing. One option would be looking at the DataBackend and writing converters for them I guess. My current idea (which I am not fully conviced of) is:

db1 = DataBackendDataTable() 
db2 = DataBackendSparseMatrix() 

# Either use the correct Op (and implement a new one if it does not exist)
db1 %>>% pipeOpPCA() %>>% PipeOpLearner$new("classif.rpart")
db2 %>>% pipeOpSparsePCA() %>>% PipeOpLearner$new("classif.rpart")

# Or convert
db2 %>>% convertDataBackendToDataTable() %>>% pipeOpPCA() %>>% PipeOpLearner$new("classif.rpart")
# This can maybe happen automatically
db2 %>>% pipeOpPCA() %>>% PipeOpLearner$new("classif.rpart")
zzawadz commented 5 years ago

I have one another problem related to bagging. Right now we (or just me:)) thought that the operator for resampling the data might be put just before the other pipeop, like this ResampleOp >> op, however, I don't think this is possible, because if we resample the data, we still need to send the prediction based on the whole sample to the next node. Otherwise, the PipeOpEnsemble*, won't be able to operate, because the rows from previous nodes won't match.

I think that ResampleOp should wrap the operator, and works like this:

  1. Resamples the data, and creates a temporary task T.
  2. T is used to train the wrapped pipeop. The pipeop result is discarded.
  3. Then it uses the original task to make a prediction (call pipeop$predict(task)) using the wrapped pipeop.
  4. It wraps the result of the prediction to create a new task which is stored as a result and can be used by next nodes.
berndbischl commented 5 years ago

Zygmunt to what piece of pseudo code are you referring to exactly? I do not see a problem with my initial suggestion ? And there we had a downsample operator not a resample operator... M

zzawadz commented 5 years ago
op1 = PipeOpNULL$new()
op2 = PipeOpDownSample$new(rate = 0.6)
op3 = PipeOpLearner$new("classif.rpart")
op4 = PipeOpEnsembleAverage$new()

Pipeline$new(list(op1, rep(k, op2), rep(k, op3), op4))

I think we are talking about a graph like this:

     B1->C1 \
A -> B2->C2 -> D
     B3->C3 /

A - op1 C{1-3} - PipeOpLearner$new("classif.rpart") B{1-3} - PipeOpDownSample$new(rate = 0.6) D - PipeOpEnsembleAverage$new()

So each C1, C2, ..., will get a different (down-)sample to train on (the content of the tasks objects will be different). But we need to have the same data outputted from each node to make an ensemble.

Right now if C1 will get rows 1,3,4, C2 - 2,3,10, C3 - 11,31,1, they will output the from train function (save in the result field), only results for those selected rows. And we won't be able to do an ensemble. We need the result based on the whole data.

berndbischl commented 5 years ago

we have bagging, and uni test this