[New Feature] gcForest - Deep Forest: Towards An Alternative to Deep Neural Networks

coforfe commented 7 years ago

Hello,

Some days ago it was presented an alternative to deep neural networks based on a new concept based on deep random forest. Based on the maturity of ranger, it would be very nice to have this new approach as an alternative within ranger

References: Deep Forest: Towards An Alternative to Deep Neural Networks [https://arxiv.org/abs/1702.08835]

Thanks! Carlos.

hexhead commented 7 years ago

Has anyone seen an implementation? No software mentioned to replicate results.

thierrygosselin commented 7 years ago

LightGBM is getting there

Laurae2 commented 7 years ago

@hexhead You can try reproducing them using my implementation: https://github.com/Laurae2/Laurae - The issue it has is the model is massive at the end, alleviated using external model storage (to keep RAM usage constant).

If you don't get better performance than regular strong single models, it would be strange because it is a stacking ensemble model.

hexhead commented 7 years ago

@Laurae2 looks very nice! I will check it out.

Laurae2 commented 7 years ago

You can test deep forest on MNIST here also: https://github.com/Laurae2/Laurae/blob/master/demo/DeepForest_mnist.R (you will need to download the train/test CSVs separately) - if you want to train on all data, you need to write "N" when it asks for subsampling (there is a typo in the script message.

Performance is average here, because CNNs are able to learn better from fewer observations than typical non-linear models due to NNs being a mix of linear and non-linear models. More data and stride=1 should help Deep Forest (but explodes CPU time).

In addition, the speed is faster for CNN because I use Intel MKL. Also, the CNN I used was overpushed in hyperparameters (good parameters for CNN vs bad parameters on Cascade Forest / Multi-Grained Scanning / Deep Forest for maximum speed). If someone could test on all training data and report here, it would be great for comparing:

Ah, and yes, the file size explodes with such amount of models in Deep Forest (500+ MB, but I'm sure it is well compressible).

Laurae2 commented 7 years ago

You may find insanity results below on model size. See https://github.com/Microsoft/LightGBM/issues/331#issuecomment-288394696 or the text below (identical), just for Cascade Forest:

Here on Adult dataset. I'm also getting the same issue you described for Adult dataset: better RF out of the box (86.57%) than their results (86.17%). I'm sure Deep Forest/Boosting can push it even further easily (87.58%+).

Train/Test set: as defined on the Adult dataset (32,561 training, 16,281 testing)
k-Folds for Cascade Forest: 3
4 threads on i7-4600U (usage of hyperthreading for more speed)
xgboost: depth=6, eta=0.10, tree_method="hist", grow_policy="depthwise", auto-threshold for Accuracy
xgboost Random Forest: "infinite" (65535) depth, eta=1.00, subsample=0.632, colsample_bylevel=ceiling(sqrt(13))/13=4/13, 2000 trees, auto-threshold for Accuracy
Cascade Forest: 2 Random Forest + 2 Complete-Random Tree Forests, 1000 trees per forest, early stopping of 4 (stops at 5th consecutive performance loss), auto-threshold for Accuracy

Cascade Forest improvements (reported: mean without standard deviation for the forests):

Layer	F1 (RF1)	F2 (RF2)	F3 (CRTF1)	F4 (CRTF2)	Avg Forest	Perf.
Layer 1	86.3972%	86.4238%	78.7237%	77.9620%	85.3142%	=
Layer 2	86.4095%	86.3972%	86.4443%	81.9299%	86.5549%	+
Layer 3	86.5262%	86.4955%	86.6286%	83.9117%	86.6900%	Best
Layer 4	86.4976%	86.4750%	84.0305%	83.7930%	86.5856%	-
Layer 5	86.3972%	86.4198%	83.7930%	86.5651%	86.5057%	-
Layer 6	86.3645%	86.3849%	86.4566%	80.6769%	86.4996%	-
Layer 7	86.3993%	86.4300%	80.6769%	86.3993%	86.4627%	-
Layer 8	86.3993%	86.4218%	86.3440%	86.3542%	86.6163%	-

Model size:

Model	Size (bytes)	Size (better unit)
8 Layers	2,108,466,480 bytes	1.96GB
3 Layers	734,047,020 bytes	700MB

yes, the 8 layer model size is 1.96GB, not a typo.

Training time:

Model	Iterations	Accuracy	Time (s)	Perf.
Official paper Deep Forest	see official paper	86.17xx%	?s	6th
xgboost Mode: Random Forest	Full: 2000 iterations	86.5733%	62.001s	5th
xgboost Mode: Boosted Trees	Best: 134 iterations Train: 184 iterations	87.5868%	5.737s	1st
Cascade Forest	Best: 3 layers Train: 8 layers	86.6900%	1601.794s	4th
Cascade Forest Stack: Random Forest	Full: 2000 iterations	86.7023%	65.434s	3rd
Cascade Forest Stack: Boosted Trees	Best: 99 iterations Train: 149 iterations	87.2797%	5.983s	2nd

Boosting speed is not typo, it is really ~6 seconds, and it gives the best accuracy out of the box without even any parameter tuning.

mnwright commented 7 years ago

We probably won't add this to ranger. Please reopen if needed.

imbs-hl / ranger

[New Feature] gcForest - Deep Forest: Towards An Alternative to Deep Neural Networks #170