Open mb706 opened 5 years ago
what if we do sampling with replacement?
If we subsample
with replacement, we basically have the row_id
twice in the data used for training the learner. cv-ed predictions will be the same for duplicated instances, this should not be a problem. We might get a problem when we do something stochastic after PipeOpSubsample
, but I currently can not think about anything.
As a solution, I would suggest to simply choose the first element if multiple are provided until we encounter a solution where this yields invalid result.
Note also, that the first problem we'd encounter here is an invalid Resampling
objects as rows could end up in train as well as in test data I guess.
what if PipeOpLearnerCV has a resampling that predicts some entries multiple times, e.g. RepCV or bootstrapping? We currently only allow
cv
for learnercv and therefore this should not be a problem.
A more future proof version would be doing aggregation, but how correct aggregation would look like is unclear and depends on the situation. Therefore I would argue to not tackle this now.
Currently blocked by mlr-org/mlr3#309
Official stance of mlr3 is now that we should solve this "manually" from within mlr3pipelines.
not sure if i see a convincing usecase here. if we don't find it, we should close
PipeOpFeatureUnion could under some circumstances want to unite tasks that have differing row IDs, e.g. after PipeOpSubsample on two different paths sampled (and
$filter()
ed) different sets of rows.mlr-org/mlr3#309 could solve part of this, but the problem goes deeper: