mlr-org / mlr3

mlr3: Machine Learning in R - next generation
https://mlr3.mlr-org.com
GNU Lesser General Public License v3.0
936 stars 85 forks source link

why mlr3 randomforest importance is different from randomForest package #974

Closed slecee closed 10 months ago

slecee commented 10 months ago

Description I think the two methods get the same importance, but the results are not the same...

Reproducible example

tasks = as_task_classif(iris, target = 'Species')
learners = lrn("classif.randomForest" ,predict_type = "prob",importance= c('gini'))
set.seed(123, kind = "Mersenne-Twister")
split = partition(tasks)
split
a = learners$train(tasks,row_ids = split$train)
a$model$importance

  setosa  versicolor  virginica MeanDecreaseAccuracy MeanDecreaseGini
Petal.Length 0.318035851 0.266299298 0.30394373          0.291450470        29.166804
Petal.Width  0.343312702 0.285824160 0.24060252          0.287471876        29.279836
Sepal.Length 0.044516441 0.017425256 0.03575046          0.032833987         7.045327
Sepal.Width  0.007014524 0.009913653 0.00355260          0.006423498         1.783527

set.seed(123, kind = "Mersenne-Twister")
tmp <- randomForest(iris[split$train,1:4], 
                    iris$Species[split$train], 
                    importance = TRUE)
tmp[["importance"]]
setosa versicolor   virginica MeanDecreaseAccuracy MeanDecreaseGini
Sepal.Length 0.028980891 0.01123579 0.041315479          0.027508393         6.729108
Sepal.Width  0.007498441 0.01019763 0.006658933          0.008441336         1.853704
Petal.Length 0.300187608 0.25654441 0.310950138          0.285067350        28.924225
Petal.Width  0.361163535 0.29720214 0.250040471          0.299030008        29.750160
be-marc commented 10 months ago

You call partition() after setting the seed. This function already uses your seed to sample random splits. The importance values are equal when you move the partition() call.

library(mlr3extralearners)

tasks = as_task_classif(iris, target = 'Species')
learners = lrn("classif.randomForest" ,predict_type = "prob",importance= c('gini'))
split = partition(tasks)
split
set.seed(123, kind = "Mersenne-Twister")
a = learners$train(tasks,row_ids = split$train)
a$model$importance

library(randomForest)

set.seed(123, kind = "Mersenne-Twister")
tmp <- randomForest(iris[split$train,1:4], 
                    iris$Species[split$train], 
                    importance = TRUE)

tmp[["importance"]]
slecee commented 10 months ago

the results also differ... results