mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
137 stars 25 forks source link

predict on test set rows #706

Closed mb706 closed 3 weeks ago

mb706 commented 1 year ago

go through each pipeop and let it predict directly, since some pipeops need the predicted set

mb706 commented 7 months ago
sebffischer commented 7 months ago

There is a memory inefficiency with the suggested approach:

Let's say we have a task with data of the form

row_id y x1
1
...
n_use
n_use + 1
...
n_use + n_test

Lets say we apply two preprocessing operations to x1 in two subsequent PipeOps.

In the approach we discussed, we first task$cbind() the preprocessed train rows (pipeop$.train_task()and then task$rbind() the preprocessed test rows, resulting from pipeop$.predict_task(). The problem with this is now, that the task$rbind() of the test rows, happens in both preprocessing PipeOps. This means in both PipeOps, a table of the form below is rbinded.

row_id y x1
1
...
n_test

The wasteful thing here is that we rbind column y, as it contains information that is simply already present in the tasks's backend and is now being duplicated. This problem comes worse when there are many columns that are left untouched by a preprocessing pipeop.

The virtual backend that would result of applying two preprocessing operations on x1 would internally store n_use + n_test + n_test * n_pipeop rows, (containing all un-preprocessed columns + newly created columns) which is not so great. Fortunately there is a more memory efficient way to achieve the same results, which works (at least) for the standard scenario of having distinct and unique use rows and test rows (which is usually the case, except for bootstrapping and insample resampling). That approach requires the following:

sebffischer commented 7 months ago

In addition to that, retrieving the use roles will become more and more expensive, because of the many DataBackendRbinds. This is, because calling databackendrbind$data(), first goes through the rbinded backend (b2) to find the rows and then through b1.

sebffischer commented 7 months ago

Some possible solutions:

  1. Accept this inefficiency (I am not a huge fan)
  2. Hack our way around it (removing the internal cbinded backend containing the preprocesses use rows after $.train_task(), then rbind its data with the preprocessed test rows and $cbind() it again to the task) (I am also not a huge fan)
  3. Change the $.train_task() method everywhere (maybe (?))
  4. Somehow not $rbind() the test and use tasks, to avoid the problem (I think this should be considered)
mb706 commented 3 weeks ago

Now solved by #770