Closed gugatr0n1c closed 7 years ago
@gugatr0n1c Is it feasible with binned data, or should the user perform and append several matrix rotation to the dataset then use subsampling as an alternative?
@Laurae2 From my understaning the core idea is pre-processing of raw data, not binned. Not sure what can happen do this on binned data - linear transformation may not work nicely.
I guess better way is to do matrix rotation and then binning - I mean from accuracy perspective.
@gugatr0n1c It would be extremely expensive to do matrix rotation then binning everytime.
@Laurae2 yeap, I understand.. not in details too much, so even with gpu this will be coslty? if so, we can closed this...
@gugatr0n1c Most of the time would be spent in performing matrix rotation then binning, the latter (binning) taking a lot of time.
For instance on reput dataset, it would require at least an additional 240 seconds per iteration just to make the binning every iteration (compared to only 5-40 seconds for model training per iteration).
Binning is also a very expensive process in xgboost, so big it may take hours: https://github.com/dmlc/xgboost/issues/2326#issuecomment-315290341
ok, I will close this, thx for info
There is nice research paper (seems to be reasonable tested) about Random Rotation Ensembles. http://www.jmlr.org/papers/volume17/blaser16a/blaser16a.pdf
Idea is to add some kind of matrix rotation before each tree is built. There are some ideas about matrix preprocessing (numerical versus categorical values, scaling) a I believe we can assume that matrix is ready (there is several tools to do that).
Rotations can be also pre-generated not to slow down training.
It is tested mainly for RF, but at the end briefly also for boosting as well. Performance seems to be better, speed thx some tricks is not affected much as mention in paper.
Seems to be good method to try (I mean for boosting as a optional setting).