Poor performance on a simple dataset

Kodiologist commented 5 years ago

I'm not familiar with the statistical approach taken by mlrMBO, so excuse me if I'm missing something. Anyway, I was going to ask a question about overfitting in autoxgboost, but it looks like I actually have an underfitting problem. Below is a simple example using DART and mostly default settings. Training and test error for the autoxgboost model hovers near the SD and is much worse than that of linear regression. Increasing the number of iterations to 500 didn't seem to help. What am I doing wrong?

library(autoxgboost)
library(mlr)
library(ParamHelpers)

set.seed(456)
xgb.threads = 10

autoxgbparset.dart = makeParamSet(
    makeNumericParam("eta", lower = 0.01, upper = 0.2),
    makeNumericParam("gamma", lower = -7, upper = 6, trafo = function(x) 2^x),
    makeIntegerParam("max_depth", lower = 3, upper = 20),
    makeNumericParam("colsample_bytree", lower = 0.5, upper = 1),
    makeNumericParam("colsample_bylevel", lower = 0.5, upper = 1),
    makeNumericParam("lambda", lower = -10, upper = 10, trafo = function(x) 2^x),
    makeNumericParam("alpha", lower = -10, upper = 10, trafo = function(x) 2^x),
    makeNumericParam("subsample", lower = 0.5, upper = 1),
    makeDiscreteParam("booster", values = "dart"),
    makeDiscreteParam("sample_type", values = c("uniform", "weighted")),
    makeDiscreteParam("normalize_type", values = c("tree", "forest")),
    makeNumericParam("rate_drop", lower = 0, upper = 1),
    makeNumericParam("skip_drop", lower = 0, upper = 1),
    makeLogicalParam("one_drop"))

N = 2000
d = transform(data.frame(
    x1 = rnorm(N),
    x2 = rnorm(N),
    x3 = rnorm(N)),
    y = 2*x2 + (abs(x3) < 1) + rnorm(N))
train = (1 : N) <= 1000

m.lm = lm(y ~ ., data = d[train,])
m.axgb = autoxgboost(
    task = makeRegrTask(target = "y", data = d[train,]),
    measure = rmse,
    par.set = autoxgbparset.dart,
    design.size = 30L,
    nthread = xgb.threads)

f = function(a, b) sqrt(mean((a - b)^2))
print(t(data.frame(
    SD = sd(d[!train, "y"]),
    perfect.train = f(d[train, "y"],
        with(d[train,], 2*x2 + (abs(x3) < 1))),
    perfect.test = f(d[!train, "y"],
        with(d[!train,], 2*x2 + (abs(x3) < 1))),
    lm.train = f(d[train, "y"], predict(m.lm)),
    lm.test = f(d[!train, "y"], predict(m.lm, newdata = d[!train,])),
    axgb.train = f(d[train, "y"],
        predict(m.axgb, newdata = d[train,])$data$response),
    axgb.test = f(d[!train, "y"],
        predict(m.axgb, newdata = d[!train,])$data$response))))

Output:

                  [,1]
SD            2.258125
perfect.train 1.001122
perfect.test  1.021544
lm.train      1.119062
lm.test       1.137798
axgb.train    2.114853
axgb.test     2.234926

CCing my cow-orkers: @allanjust, @liuyanguu

liuyanguu commented 5 years ago

I had an interesting further finding. The problem might lie in the predict function of autoxgboost. If I extract the parameters using mlr::getHyperPars and run a separate xgboost::xgboost, both the testing and training error went back to around 1.1, which looks right. I am glad I always extracted the hyperparameters... Strangely enough, I never noticed such a problem before. And indeed the difference is not obvious in some large dataset.

Following Kodi's code above, if we continue to run:

param_dart <- mlr::getHyperPars(m.axgb$final.learner)
set.seed(1234)
m.xgboost <-  xgboost::xgboost(data = as.matrix(d[train, 1:3]),
                               label = d[train, "y"], 
                               params = param_dart, nrounds = param_dart$nround, 
                               verbose = T, print_every_n = param_dart$nround)
xgb_pred <- predict(m.xgboost, as.matrix(d[!train, 1:3]), ntreelimit = param_dart$nrounds)
(rmse_xgb <- sqrt(mean((d[!train, "y"] - xgb_pred)^2)))

we get 1.193 as the testing rmse

An extra issue is that to use 'dart' correctly, we need to pass argument ntreelimit = param_dart$nrounds to the predict function, otherwise the results would be inconsistant:

# And this is what happened if we do not set `ntreelimit`:
n0 <- 1000
rmse_xgb <- rep(NA, n0)
set.seed(1234)
for (i in 1:n0){
  xgb_pred <- predict(m.xgboost, as.matrix(d[!train, 1:3]))
  rmse_xgb[i] <- sqrt(mean((d[!train, "y"] - xgb_pred)^2))
}
summary(rmse_xgb)

Output:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.193   1.193   1.236   1.254   1.290   1.658

After setting ntreelimit = param_dart$nrounds we will only get 1.193. This is a minor and extra issue but I guess there is no way to pass a further argument to autoxgboost's predict. And of course, not using dart won't correct the main problem above (difference in predict output).

liuyanguu commented 5 years ago

@ja-thomas , I am not very familiar with mlr... I am curious how is the predict function called when the object is autoxgboost?

ja-thomas commented 5 years ago

Hi,

sorry for the late reply, I was away for a few days.

thanks for the issue, this is indeed very surprising and I found the problem to be that we call cpoDropConstants which seems to drop features that are far away from constant. This is a bug in mlrCPO.

For now I'll drop this step from the preprocessing, until it is fixed in mlrCPO.

see here: https://github.com/mlr-org/mlrCPO/issues/59

ja-thomas / autoxgboost

Poor performance on a simple dataset #62