Open zecojls opened 6 months ago
Hey, sorry I can't reproduce the issue. I create a clean environment with renv
.
renv::init(bare = TRUE)
renv::install(c("mlr3@0.17.1", "mlr-org/mlr3extralearners@*release", "randomForest"))
Your code runs without any problems.
task = tsk("boston_housing")
task$select(c("age", "b", "chas"))
learner = lrn("regr.randomForest", importance = "mse")
learner$train(task)
rr = resample(task, learner, rsmp("cv", folds = 10))
Session info.
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 23.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] mlr3extralearners_0.7.1 mlr3_0.17.1
loaded via a namespace (and not attached):
[1] digest_0.6.33 backports_1.4.1 R6_2.5.1 codetools_0.2-19 randomForest_4.7-1.1 lgr_0.4.4 parallel_4.3.1 RhpcBLASctl_0.23-42 palmerpenguins_0.1.1
[10] mlr3misc_0.13.0 parallelly_1.36.0 pak_0.7.1 future_1.33.1 renv_1.0.3 data.table_1.14.10 compiler_4.3.1 paradox_0.11.1 globals_0.16.2
``
My Kaggle kernel has R 4.0 and the Ubuntu 20 installed by default. Not sure if I can change that. What do you recommend?
I can confirm that there is a bug on kaggle. It is not the subsetting of the task and not the task itself. The error does not occur with regr.rpart
but with regr.randomForest
and regr.ranger
. I cannot reproduce the bug on my local machine or in a rocker image with R 4.0.5. The error looks like mlr3 is not passing data to the predict
function of the upstream packages. Such an error would definitely have been noticed in our unit tests. Yes, that is quite tricky now. We can't debug easily on Kaggle.
I believe the issue is this line in the randomforest learner:
This executes
task$data(cols = intersect(names(learner$state$data_prototype),
task$feature_names))
When I stop here, the learner's learner$state$data_prototype
is NULL
(this is the bug, see below), and, in modern R versions, the intersect()
is also NULL
leading to the call task$data(cols = NULL)
and all columns are returned.
However, in older R versions, intersect(NULL, <character>)
is not NULL
, it is character(0)
. This leads to task$data(cols = character(0))
being called, and ordered_features()
in the line linked above therefore returning a 0-column data.table
.
Idk when this new behaviour of intersect()
was introduced, it appears to be this diff and this entry in R 4.2.0 NEWS sounds matching:
The set utility functions, notably
intersect()
have been tweaked to be more consistent and symmetric in their two set arguments, also preserving a commonmode
.
.... although the timing does not seem to match. But somewhere between 4.1.2 and 4.2.0 I think. Too lazy to check.
Now to the bug in our code: I assume the problem is that resampling does not set the data_prototype
any more during resampling, since this patch. resample()
does not call the learner's train()
, so data_prototype
is not set.
(It may be unnecessary, currently, to set data_prototype in resampling, since the task remains the same, but this may change with the new holdout task thing that may be introduced. Also we should make sure other places handle data_prototype
being NULL
correctly)
Hi, I'm using MLR3 on a Kaggle kernel and found issues with the
resample
function. The error message mentions some issues withdata.table
column selection andfuture.apply
.I'm currently able to use
mlr3
v0.16.1 and the latest release of mlr3extralerners, but forcingdata.table
andfuture.apply
to not upgrade by default (as they are dependencies to both).Reproducible code:
Session info: