Open Kodiologist opened 5 years ago
I had an interesting further finding. The problem might lie in the predict function of autoxgboost. If I extract the parameters using mlr::getHyperPars
and run a separate xgboost::xgboost
, both the testing and training error went back to around 1.1, which looks right. I am glad I always extracted the hyperparameters... Strangely enough, I never noticed such a problem before. And indeed the difference is not obvious in some large dataset.
Following Kodi's code above, if we continue to run:
param_dart <- mlr::getHyperPars(m.axgb$final.learner)
set.seed(1234)
m.xgboost <- xgboost::xgboost(data = as.matrix(d[train, 1:3]),
label = d[train, "y"],
params = param_dart, nrounds = param_dart$nround,
verbose = T, print_every_n = param_dart$nround)
xgb_pred <- predict(m.xgboost, as.matrix(d[!train, 1:3]), ntreelimit = param_dart$nrounds)
(rmse_xgb <- sqrt(mean((d[!train, "y"] - xgb_pred)^2)))
we get 1.193
as the testing rmse
An extra issue is that to use 'dart' correctly, we need to pass argument ntreelimit = param_dart$nrounds
to the predict
function, otherwise the results would be inconsistant:
# And this is what happened if we do not set `ntreelimit`:
n0 <- 1000
rmse_xgb <- rep(NA, n0)
set.seed(1234)
for (i in 1:n0){
xgb_pred <- predict(m.xgboost, as.matrix(d[!train, 1:3]))
rmse_xgb[i] <- sqrt(mean((d[!train, "y"] - xgb_pred)^2))
}
summary(rmse_xgb)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.193 1.193 1.236 1.254 1.290 1.658
After setting ntreelimit = param_dart$nrounds
we will only get 1.193.
This is a minor and extra issue but I guess there is no way to pass a further argument to autoxgboost's predict. And of course, not using dart won't correct the main problem above (difference in predict output).
@ja-thomas , I am not very familiar with mlr... I am curious how is the predict
function called when the object is autoxgboost?
Hi,
sorry for the late reply, I was away for a few days.
thanks for the issue, this is indeed very surprising and I found the problem to be that we call cpoDropConstants
which seems to drop features that are far away from constant. This is a bug in mlrCPO.
For now I'll drop this step from the preprocessing, until it is fixed in mlrCPO.
I'm not familiar with the statistical approach taken by mlrMBO, so excuse me if I'm missing something. Anyway, I was going to ask a question about overfitting in autoxgboost, but it looks like I actually have an underfitting problem. Below is a simple example using DART and mostly default settings. Training and test error for the autoxgboost model hovers near the SD and is much worse than that of linear regression. Increasing the number of iterations to 500 didn't seem to help. What am I doing wrong?
Output:
CCing my cow-orkers: @allanjust, @liuyanguu