mlr-org / mlr3

mlr3: Machine Learning in R - next generation
https://mlr3.mlr-org.com
GNU Lesser General Public License v3.0
936 stars 85 forks source link

option to "flatten" backends #942

Open mb706 opened 1 year ago

mb706 commented 1 year ago

Apparently we get some overhead when mlr3pipelines builds tasks with many BackendCbinds. One way to fix this would be if there were an option to "flatten" cbinded tasks. Suggested interface:

Task$flatten(force = FALSE)  # default

creates a task with a single BackendDataTable, unless this is for some reason a bad idea, e.g. when a backend is a database backend. (A Backend class would need to report whether flattening is a "bad idea", possibly with an active binding, e.g. a database backend could say flattening is okay if the size is less than X MB)

Setting force = TRUE should OTOH flatten the task always, equivalent to creating a new task with the task$data().

Example: TaskClassif that consists of two cbinded data.tables that were cbinded with a database backend: (abbreviating (DataBackend as DB)

                TaskClassif
                  |
               DBCbind
              /       \
         DBCbind      DBDataBase
        /       \ 
 DBDataTable DBDataTable

$flatten(force = FALSE):

                TaskClassif
                  |
               DBCbind
              /       \
     DBDataTable      DBDataBase

$flatten(force = TRUE):

               TaskClassif
                  |
               DBDataTable

We could think whether it is a good idea if mlr3pipelines does this with all its output tasks by default.

Another question is whether that should be an in-place operation that swaps out a task's data backend, or whether this should create a new task.

Another question is what to do with columns that do not have any column role. Maybe a good default would be to drop backends that do not provide columns that have a role (and are therefore ignored in many cases).

Maybe we would want to have a DataBackendMultiCBind that can cbind multiple sources, so even a task that has many different database backends will only be one level deep at the most after flattening. The $flatten(force = FALSE) -operation would have to check, for each column, if it comes from a data backend that reports it does not want to be flattened. There should be a method in DataBackend that does this recursively. $flatten() would then construct the desired DataBackendMultiCBind.