mlr-org / mlr3

mlr3: Machine Learning in R - next generation
https://mlr3.mlr-org.com
GNU Lesser General Public License v3.0
947 stars 85 forks source link

Behavior of `$set_row_roles` changed #1175

Closed sebffischer closed 2 months ago

sebffischer commented 2 months ago
library(data.table)
library(mlr3oml)
library(mlr3)
library(mlr3pipelines)

oml = OMLData$new(42730)
data = oml$data

tsk = TaskRegr$new("communities_crime", data, target = "ViolentCrimesPerPop")
n = length(tsk$row_roles$use)
split = partition(tsk)
tsk$set_row_roles(split$train, add_to = "use")
length(tsk$row_roles$use) == length(split$train) + n
#> TRUE

Previously this did not cause duplicated ids.

This was probably caused by: https://github.com/mlr-org/mlr3/commit/bb4dd8d3fa63d288f2123aa2cffbee7104a27d3c

be-marc commented 2 months ago

Prior to https://github.com/mlr-org/mlr3/commit/bb4dd8d3fa63d288f2123aa2cffbee7104a27d3c, task$set_row_roles(split$train, "use") silently dropped duplicated ids. In fact this call did not change the "use" set at all. The ids in split$train where silently dropped because they were already in "use". Now mlr3 adds split$train to "use" and we get duplicated ids but this is the rights behavior. I think https://github.com/mlr-org/mcboost/blame/93f4514549589a46011ac10775fb8fd426485bdc/vignettes/mcboost_basics_extensions.Rmd#L524 was an user error. The author should have used $row_roles$use = split$train instead.