mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
141 stars 25 forks source link

PipeOpFeatureUnion with differing row IDs? #216

Open mb706 opened 5 years ago

mb706 commented 5 years ago

PipeOpFeatureUnion could under some circumstances want to unite tasks that have differing row IDs, e.g. after PipeOpSubsample on two different paths sampled (and $filter()ed) different sets of rows.

graph = greplicate(PipeOpSubsample$new() %>>%
    PipeOpLearnerCV$new("classif.rpart"), 2) %>>%
  PipeOpFeatureUnion$new()
graph$plot()  # this is what it looks like

graph$train("iris")  # assertion error

mlr-org/mlr3#309 could solve part of this, but the problem goes deeper:

pfistfl commented 4 years ago

what if we do sampling with replacement?

If we subsample with replacement, we basically have the row_id twice in the data used for training the learner. cv-ed predictions will be the same for duplicated instances, this should not be a problem. We might get a problem when we do something stochastic after PipeOpSubsample, but I currently can not think about anything. As a solution, I would suggest to simply choose the first element if multiple are provided until we encounter a solution where this yields invalid result. Note also, that the first problem we'd encounter here is an invalid Resampling objects as rows could end up in train as well as in test data I guess.

what if PipeOpLearnerCV has a resampling that predicts some entries multiple times, e.g. RepCV or bootstrapping? We currently only allow cv for learnercv and therefore this should not be a problem.

A more future proof version would be doing aggregation, but how correct aggregation would look like is unclear and depends on the situation. Therefore I would argue to not tackle this now.

Currently blocked by mlr-org/mlr3#309

mb706 commented 3 years ago

Official stance of mlr3 is now that we should solve this "manually" from within mlr3pipelines.

berndbischl commented 3 months ago

not sure if i see a convincing usecase here. if we don't find it, we should close