microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.61k stars 3.83k forks source link

[New Feature] support "Random Rotation Ensembles" #353

Closed gugatr0n1c closed 7 years ago

gugatr0n1c commented 7 years ago

There is nice research paper (seems to be reasonable tested) about Random Rotation Ensembles. http://www.jmlr.org/papers/volume17/blaser16a/blaser16a.pdf

Idea is to add some kind of matrix rotation before each tree is built. There are some ideas about matrix preprocessing (numerical versus categorical values, scaling) a I believe we can assume that matrix is ready (there is several tools to do that).

Rotations can be also pre-generated not to slow down training.

It is tested mainly for RF, but at the end briefly also for boosting as well. Performance seems to be better, speed thx some tricks is not affected much as mention in paper.

Seems to be good method to try (I mean for boosting as a optional setting).

Laurae2 commented 7 years ago

@gugatr0n1c Is it feasible with binned data, or should the user perform and append several matrix rotation to the dataset then use subsampling as an alternative?

gugatr0n1c commented 7 years ago

@Laurae2 From my understaning the core idea is pre-processing of raw data, not binned. Not sure what can happen do this on binned data - linear transformation may not work nicely.

I guess better way is to do matrix rotation and then binning - I mean from accuracy perspective.

Laurae2 commented 7 years ago

@gugatr0n1c It would be extremely expensive to do matrix rotation then binning everytime.

gugatr0n1c commented 7 years ago

@Laurae2 yeap, I understand.. not in details too much, so even with gpu this will be coslty? if so, we can closed this...

Laurae2 commented 7 years ago

@gugatr0n1c Most of the time would be spent in performing matrix rotation then binning, the latter (binning) taking a lot of time.

For instance on reput dataset, it would require at least an additional 240 seconds per iteration just to make the binning every iteration (compared to only 5-40 seconds for model training per iteration).

Binning is also a very expensive process in xgboost, so big it may take hours: https://github.com/dmlc/xgboost/issues/2326#issuecomment-315290341

gugatr0n1c commented 7 years ago

ok, I will close this, thx for info