predict on test set rows

mb706 commented 1 year ago

go through each pipeop and let it predict directly, since some pipeops need the predicted set

mb706 commented 7 months ago

PipeOpTaskPreproc checks if there are test set rows, if so calls .predict_task
tests should verify that all pipeops that have "Task" inputs and outputs handle this correctly: $predict() and $train()[test set rows, ] are the same. The autotests have some flag that indicates if $predict() is random? in expect_datapreproc_pipeop_class
some pipeops that do task preprocessing but dont inherit from pipeoptaskpreproc are worrying here: pipeopimpute. It should probably just inherit from pipeoptaskpreproc.
PipeOpTaskPreproc always handles test rows if they are present. GraphLearner removes test rows in most cases, unless a PipeOp needs them. Determines this by checking each PipeOp: Does it contain a $uses_test_rows active binding / field? (choose not to just look for attached $learner, if present. Problem here: e.g. ImputeLearner should not forward its 'needs-test-set' since it can not be used)
GraphLearner can set its own relevant property

sebffischer commented 7 months ago

There is a memory inefficiency with the suggested approach:

Let's say we have a task with data of the form

row_id	y	x1
1
...
n_use
n_use + 1
...
n_use + n_test

Lets say we apply two preprocessing operations to x1 in two subsequent PipeOps.

In the approach we discussed, we first task$cbind() the preprocessed train rows (pipeop$.train_task()and then task$rbind() the preprocessed test rows, resulting from pipeop$.predict_task(). The problem with this is now, that the task$rbind() of the test rows, happens in both preprocessing PipeOps. This means in both PipeOps, a table of the form below is rbinded.

row_id	y	x1
1
...
n_test

The wasteful thing here is that we rbind column y, as it contains information that is simply already present in the tasks's backend and is now being duplicated. This problem comes worse when there are many columns that are left untouched by a preprocessing pipeop.

The virtual backend that would result of applying two preprocessing operations on x1 would internally store n_use + n_test + n_test * n_pipeop rows, (containing all un-preprocessed columns + newly created columns) which is not so great. Fortunately there is a more memory efficient way to achieve the same results, which works (at least) for the standard scenario of having distinct and unique use rows and test rows (which is usually the case, except for bootstrapping and insample resampling). That approach requires the following:

we modify mlr3's task$cbind() function, to support $cbind()-ing test rows (straightforward)
Preprocessing PipeOps must (I think) call $.train_dt() and$.predict_dt(), then rbind their result and cbind them to the task. (not so straightforward, because we are changing the structure of PipeOpTaskPreproc)

sebffischer commented 7 months ago

In addition to that, retrieving the use roles will become more and more expensive, because of the many DataBackendRbinds. This is, because calling databackendrbind$data(), first goes through the rbinded backend (b2) to find the rows and then through b1.

sebffischer commented 7 months ago

Some possible solutions:

Accept this inefficiency (I am not a huge fan)
Hack our way around it (removing the internal cbinded backend containing the preprocesses use rows after $.train_task(), then rbind its data with the preprocessed test rows and $cbind() it again to the task) (I am also not a huge fan)
Change the $.train_task() method everywhere (maybe (?))
Somehow not $rbind() the test and use tasks, to avoid the problem (I think this should be considered)

mb706 commented 3 weeks ago

Now solved by #770

mlr-org / mlr3pipelines

predict on test set rows #706