Open giuseppec opened 5 years ago
I would be happy with replacing this line here
https://github.com/ja-thomas/autoxgboost/blob/b64048e603751bcba9b6e212c775baff8ababccb/R/autoxgboost.R#L171
with crossval(lrn, task.train, measure)$aggr
.
And yes, I would ignore task.test
data here completely (on which the early-stopping is based). But maybe it is better to let the user decide if he really wants to do this or not. Or do you see any other problem here?
The main idea is that no resampling should be necessary and xgboost can utilize the full parallelism of the system. But I see the point that there are cases in which this would be totally useful.
This is usually how it is done, otherwise a lot of noise is added. I experimented on some datasets to see how bad the overfitting is, but I couldn't directly find (or create artificially) any "overtuning" on the holdout data. But in general this is something I'm quite interested in to improve, but I need to find cases where this is actually a problem first
As far as I can see, the autoxgboost function internally uses holdout for the objective function within the mbo tuning (it is hard-coded). 1) Wouldn't it be cool if users could also specify their own objective here? For example, I want to use 3-fold CV (or stratified CV) instead of the hard-coded holdout. 2) Currently, mbo seems to use the same test-set in each iteration as the resample instance (e.g. test splits) are computed outside from the objective function. This way I am not able to different test splits in each iteration, right? Isn't mbo somehow starting to overfit for those holdout test splits at some point?