Closed zzawadz closed 5 years ago
would be better to rephrase this as:
Absolutely, I think we want to allow as many options as possible, while abstracting many of the complicated things away from the user.
# Bagging as you describe it:
op1 %>>% rep(k, op2 %>>% op3) %>>% op4
# Could then become something like
op1 %>>% pipeOpBagging(n = 100, rate = 0.6, learner = "classif.rpart")
while still allowing for the first option.
For me, the most natural way is to bind their results together and send them to the next node (but what happens when one of the previous nodes returns SparseMatrix and the second data.frame? I don't know now).
This is something we are still discussing. One option would be looking at the DataBackend
and writing converters for them I guess.
My current idea (which I am not fully conviced of) is:
db1 = DataBackendDataTable()
db2 = DataBackendSparseMatrix()
# Either use the correct Op (and implement a new one if it does not exist)
db1 %>>% pipeOpPCA() %>>% PipeOpLearner$new("classif.rpart")
db2 %>>% pipeOpSparsePCA() %>>% PipeOpLearner$new("classif.rpart")
# Or convert
db2 %>>% convertDataBackendToDataTable() %>>% pipeOpPCA() %>>% PipeOpLearner$new("classif.rpart")
# This can maybe happen automatically
db2 %>>% pipeOpPCA() %>>% PipeOpLearner$new("classif.rpart")
I have one another problem related to bagging.
Right now we (or just me:)) thought that the operator for resampling the data might be put just before the other pipeop
, like this ResampleOp >> op
, however, I don't think this is possible, because if we resample the data, we still need to send the prediction based on the whole sample to the next node. Otherwise, the PipeOpEnsemble*
, won't be able to operate, because the rows from previous nodes won't match.
I think that ResampleOp should wrap the operator, and works like this:
pipeop$predict(task)
) using the wrapped pipeop.Zygmunt to what piece of pseudo code are you referring to exactly? I do not see a problem with my initial suggestion ? And there we had a downsample operator not a resample operator... M
op1 = PipeOpNULL$new()
op2 = PipeOpDownSample$new(rate = 0.6)
op3 = PipeOpLearner$new("classif.rpart")
op4 = PipeOpEnsembleAverage$new()
Pipeline$new(list(op1, rep(k, op2), rep(k, op3), op4))
I think we are talking about a graph like this:
B1->C1 \
A -> B2->C2 -> D
B3->C3 /
A - op1 C{1-3} - PipeOpLearner$new("classif.rpart") B{1-3} - PipeOpDownSample$new(rate = 0.6) D - PipeOpEnsembleAverage$new()
So each C1, C2, ..., will get a different (down-)sample to train on (the content of the tasks objects will be different). But we need to have the same data outputted from each node to make an ensemble.
Right now if C1 will get rows 1,3,4, C2 - 2,3,10, C3 - 11,31,1, they will output the from train
function (save in the result
field), only results for those selected rows. And we won't be able to do an ensemble. We need the result based on the whole data.
we have bagging, and uni test this
Use case: Bagging
can we write the above in a shorter, better way?
My comments:
I think this is a very important stuff for the whole project. I bet that making bagging easy to define using the pipeline will make the whole functionality easy to use.
Let's start with
Pipeline$new(list(op1, rep(k, op2), rep(k, op3), op4))
. I think it would be better to rephrase this as:It might be a bit easier to reason about because we know which part will be replicated, and we don't need to worry about the sizes of op2 and op3.
Probably we need to define what happens when the Node has multiple predecessors. For me, the most natural way is to bind their results together and send them to the next node (but what happens when one of the previous nodes returns SparseMatrix and the second data.frame? I don't know now).
So
So in that case,
D
gets all the results from all previous nodes.The more interesting problem is when there are multiple successors.
Probably the easiest solution is that each node
X, Y, Z
gets as input all results from previous nodes. So when the nodes lists of nodes will be concatenated it will set all the nodes from the first list as predecessors of the nodes from the second list.Probably the method name
set_prev
should be renamed toadd_prev
, because each node will be able to have multiple predecessors.So we can rephrase the previous example in the pseudo code as a: