microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.54k stars 3.82k forks source link

Cross validation early stopping #5683

Open segatrade opened 1 year ago

segatrade commented 1 year ago

Now cross validation early stopping happen based on mean. But seems it's more correct to use minimum (worst) from all folds in iteration, if we want to choose num_iterations based on best_iteration for train model on complete dataset after cv.

https://sites.google.com/site/lauraeppx/xgboost/cross-validation also seems @Laurae2 tell here about it

For example, if on 3 folds cv we got accuracy on iteration 35) 0.9, 0.9, 0, mean = 0.6 29) 0.59, 0.58, 0.57, mean 0.58 - seems iteration 29 is better to choose for num_iterations train model on complete set, but mean on 35 is better.

Is any way to change lgbm.cv from mean to min mode? Or only my own cv with usual lgbm.train calls? Also if make my own - does lgbm.cv have performance benefits than call several time lgbm.train that I can use? It load data 1 time or several?

3zhang commented 1 year ago

I don't think so. It's possible that you have a fold whose error is monotonically decreasing but still higher than other folds whereas other folds do have their minimums in early rounds . Then choosing the worst error will always set the best iter to the total number of iters.