mlr-org / mlr3

mlr3: Machine Learning in R - next generation
https://mlr3.mlr-org.com
GNU Lesser General Public License v3.0
914 stars 84 forks source link

feat: validation task #983

Closed sebffischer closed 1 month ago

sebffischer commented 7 months ago

TODOs:

This PR enables to solve the problem that the test rows, that can e.g. used for early stopping by xgboost, can be preprocessed in a graph learner and that early stopping xgboost in a graph learner now works.

Some explanations for the changes:

task = tsk("iris") task

> (150 x 5): Iris Flowers

> * Target: Species

> * Properties: multiclass

> * Features (4):

> - dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

task$divide(1:10, "test") task

> (140 x 5): Iris Flowers

> * Target: Species

> * Properties: multiclass

> * Features (4):

> - dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

> * Test Task: (10x5)

task$test_task

> (10 x 5): Iris Flowers

> * Target: Species

> * Properties: multiclass

> * Features (4):

> - dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

po_pca = po("pca")

taskout = po_pca$train(list(task))[[1L]] taskout$test_task

> (10 x 5): Iris Flowers

> * Target: Species

> * Properties: multiclass

> * Features (4):

> - dbl (4): PC1, PC2, PC3, PC4



<sup>Created on 2024-02-16 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

* `PipeOp`s always preprocess the test_task when it is provided. However, a `GraphLearner` only wants to do the preprocessing on the test rows, when they are needed otherwise this is unnecessary computation (as they are currently not used for the learner's `$predict()` step. To communicate this, the 'uses_test_task' property was introduced.
 Because the 'uses_test_task' property is not fixed (its presence depends e.g. on whether he `early_stopping_set` parameter from XGBoost is set to `"test"` or `"none"`), it was necessary to add the ability to dynamically generate a learner's properties. This was done using the private method `.contingent_properties()` that can be overwritten by learners. It is necessary to set this method in the `Learner` base class to a function returning `character(0)` (and not `NULL`), because of a bug in `R6`.
* Retired interface: We previously had the API `task$set_row_roles(1, "test")` or `task$set_row_roles(1, "holdout")`.
  Because we now introduced the `$test_task` field, there would have been two ways to achieve something similar. This made code messy and the interface confusing. For this reason, both the `holdout` and `test` row-roles were removed.

Because this PR breaks some existing packages (because of the removal of the 'holdout' and 'test' row roles), I have already created Pull Requests in some packages: 
* [x] TODO: check whether I really got all packages (only checked those that I have locally available)

The general plan to merge this feature is to: 

1. Make releases for these PRs:
   * `mlr3learners`: https://github.com/mlr-org/mlr3learners/pull/288 (Xgboost, only dev and paramtest are failing)
   * `mlr3tuning`: https://github.com/mlr-org/mlr3tuning/pull/413 (holdout set is used)
   * `mcboost`: https://github.com/mlr-org/mcboost/pull/44 (vignette uses holdout set)
   * `mlr3fairness` https://github.com/mlr-org/mlr3fairness/pull/74 (there is a bug that I did not cause) 
   * `mlr3pipelines`https://github.com/mlr-org/mlr3pipelines/pull/761/files (this is needed, because of the way the graphlearner sets its properties)

2. Merge this branch and make a release on CRAN
3. Implement the feature in pipelines and make a release from this branch: 
  * https://github.com/mlr-org/mlr3pipelines/pull/760
4. Make changes in `mlr3extralearners` and bump mlr3 dependency
5. Make a gallery post about this