szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

best boosting AUC? #15

Closed szilard closed 9 years ago

szilard commented 9 years ago

@tqchen @hetong007 I'm trying to get a good AUC with boosting for the largest dataset (n = 10M). Would be nice to beat random forests :)

So far I did some basic grid search https://github.com/szilard/benchm-ml/blob/master/3-boosting/0-xgboost-init-grid.R for n = 1M (not the largest dataset) and seems like deeper trees, min_child_weight = 1 subsample = 0.5 work well.

I'm running now https://github.com/szilard/benchm-ml/blob/master/3-boosting/6a-xgboost-grid.R with n = 10M by just looping over max_depth = c(2,5,10,20,50) but it's been running for a while.

Any suggestions?

Smallest learning rate I'm using is eta = 0.01, any experience with smaller values?

PS: See results so far here: https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines

tqchen commented 9 years ago

One thing that might be interesting to try is use integer encoding for the dates(seemed i got far better result with simply depth=6)

tqchen commented 9 years ago

If no overfitting was happening, eta=0.01 was good enough. Another interesting thing to try, which I always do to optimize AUC, is to re-balance the weight. In particular, set

scale_pos_weight= num_neg_example/num_pos_example

Will re-balance the positive and negative weight, usually work better for AUC. Though the effect was more significant on unbalanced dataset, I am not sure what will happen here

szilard commented 9 years ago

Well, I deliberately considered that "forbidden" :) See some discussion here: https://github.com/szilard/benchm-ml/issues/1

The reason is that I want the benchmark on a dataset with a mix of categoricals and numerics (with more categoricals) - similar to a industry/business dataset. So I'm making the day of the week, month etc. somewhat artificially categoricals:

for (k in c("Month","DayofMonth","DayOfWeek")) {
  d_train[[k]] <- as.factor(d_train[[k]])
}

If I made them ordinal/numeric those variables would dominate importance-wise the prediction and the dataset would have more too many "numeric" features.

So, the game between RF and boosting is on in the sense that those vars need to be categoricals and no other feature engineering ;)

szilard commented 9 years ago

Re @tqchen 2nd comment:

This dataset is pretty well balanced. What's very handy in xgboost (and missing from the other tools) is the early stopping :) And I use it with eval_metric = "auc" :)

I'll let you know when this run finishes: https://github.com/szilard/benchm-ml/blob/master/3-boosting/6a-xgboost-grid.R

szilard commented 9 years ago

I might also try eta=0.001 though already eta=0.01 is painfully slow. Btw is there any paper/result about decreasing learning rates during training (for boosting)? Or even some "adaptive" learning rate, see e.g. Vowpal Wabbit for the linear case.

tqchen commented 9 years ago

I do not know if there is any theory on decreasing learning rate. I see you set subsample parameter, there is another colsample_bytree which sub-samples columns, usually makes result less easier to overfit and running time faster(set it to 0.5 or 0.3)

tqchen commented 9 years ago

although gbm usually wins on complicated cases with more features and many feature played an important role, maybe this dataset was not in that case. Since the things are made explicitly categorical, currently there was too few integer features

szilard commented 9 years ago

Hm... Quick google does not bring up anything. Besides VW, there is simulated annealing in various contexts (e.g. neural nets) etc. This might be useful for boosting...

Yeah, colsample_bytree is needed for RF. I'll try out for boosting, thanks :)

szilard commented 9 years ago

Yeah, I wish I chose a dataset with more columns... Anyway, my main focus here is see what tools can run on 10M rows in decent time and with decent AUC, and AUC for boosting is pretty close to RF.

hetong007 commented 9 years ago

Just for reference, I sometimes try this adaptive learning rate method: http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf . It is not implemented in the xgboost.

szilard commented 9 years ago

Oh, now I remember reading this http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf which is implemented in H2O deep learning.

Btw do you think this has been adapted for GBMs?@hetong007: when you say "sometimes I try this..." what tool are you using? (is it boosting/GBM or something else like linear etc)

tqchen commented 9 years ago

The nature of boosting was very different from linear solver, which makes adagrad may not be a directly applied here. Actually, even for deeplearning, adagrad was not the best choice for common convnet(maybe a safe choice).

However, it was actually straight forward to tweak the R/python code of xgboost to implementing decay learning rate without touching the cpp part, so maybe it was interesting to try.

hetong007 commented 9 years ago

Sometimes in my research I will write some small prototypes to compare, but basically matrix factorization.

@tqchen I think currently we cannot get the gradient of each update from R/python, so at least the one I posted is not applicable.

szilard commented 9 years ago

Here is some options for decay (after quick web search): http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40808.pdf I was doing some simulated annealing related stuff in physics 20 yrs ago ;)

Anyway, maybe it's easy to change it in the C++ level code in xgboost?

tqchen commented 9 years ago

there is a thing called customized loss function in xgboost, which should do the job in R. Adding learning rate was equivalent to scale the h statistics:)

tqchen commented 9 years ago

BTW, @szilard maybe it is interesting to do some feature importance analysis on the trees learnt, see this example

I guess the result will have a few very important features

szilard commented 9 years ago

Yeah, I think I took a quick look at the variable importance in RF in H2O. There are only 8 variables though...

szilard commented 9 years ago

I'll have to look at custom loss functions, probably tomorrow...

szilard commented 9 years ago

@tqchen @hetong007 See some boosting results here (GBM beats RF now) https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines

tqchen commented 9 years ago

great, thanks @szilard

szilard commented 9 years ago

Here is Time/AUC for a few settings:

n = 10M (dataset size) nround = 5000 max_depth = 20 eta = 0.01 min_child_weight = 1

subsample = 0.5 Time: 50000s AUC: 0.811 subsample = 1 Time: 49000s AUC: 0.805 subsample = 0.5 colsample_bytree = 0.5 Time: 35000s AUC: 0.810