microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

https://lightgbm.readthedocs.io/en/latest/

MIT License

16.53k stars 3.82k forks source link

Support gcForest #331

Closed guolinke closed 7 years ago

guolinke commented 7 years ago

refer to https://arxiv.org/pdf/1702.08835.pdf

Welcome to contribute for this.

wxchan commented 7 years ago

Just read this paper. I am interested in looking into it this weekend.

wxchan commented 7 years ago

Let me take some notes here, and some discussions.

Cascade Forest input: m samples n features (1st level), m samples (n features + num_class) output: m samples num_class each level: num_forest num_trees

other detail: k-fold cv (ensemble)

Multi-Grained Scanning kinda like sliding windows + one level of Cascade Forest

So the basic thing we need to implement is one level of Cascade Forest.

Is there anything wrong?

Discussions:

There are two kind of forests: random forest and complete-random tree forest. Should we try to use gbdt to replace them or not? @guolinke
how do we know the sample distributions on each classes of one leaf?
seem it only works for classification task?

Some Chinese discussion: http://weibo.com/zhouzh2012 https://www.zhihu.com/question/56474891

voe09 commented 7 years ago

Just read this paper today. Feel glad if you could implement it in LightGBM.

I think, with bagging, replacing random forest with gbdt or dart might work well. But replacing complete-random tree forest would discourage the diversity, which may lead to converge too early. Would like to get the results after your implementation.
Why not develop the gcForest with one-vs-all classifier first? Might be better in large dataset as the multi-class decision tree would grow too deeply which would discourage the diversity.
The Multi-Grained Scanning part is similar to 1D and 2D convolution, would it be possible to use convolution to replace that part?

mglowacki100 commented 7 years ago

I think that complete-random tree forest could be replaced by random rotation, it also encourages diversity. Anyway, random rotation could be interesting itself, regardless gc-forest. Multi-grained scannig could be skipped at first, it is task specific. [http://www.ijicic.org/ijicic-11-02046.pdf]() [http://stats.lse.ac.uk/fryzlewicz/rre/rre.pdf]() [https://github.com/tmadl/sklearn-random-rotation-ensembles]()

Laurae2 commented 7 years ago

@wxchan here the implementation details and some answers to your questions (@guolinke may verify if I got stuff wrong or not), this is just stacking of random models:

Cascade Forest (a level of it)

Complete-Random Tree Forest = ensemble of Random Trees (number of forests parameter)
Random Tree split = take a random feature each time, grow indefinitely until there are only no more than 10 instances of the same class (or there is only one class at the end) (minimum number of instances parameter)
Random Tree Forest = take only sqrt(number of features) per tree (column sampling by tree parameter)
Split value used = Gini (but I would say using Information Criterion could be better for random forests) (potentially loss function parameter)
Number of features generated by a level = number of complete-random tree forest * number of classes, in case of regression this would be number of complete-random tree forest
Each Complete-Random Tree Forest is generated using cross-validation (stacking like in Kaggle competitions) and using average when you predict (parameter - input folds)
Each time a Cascade Forest is finished, compare to validation set. If the new Cascade Forest is worse than the previous one, terminate training immediately (bonus: early_stopping_rounds would be able to determine how many bad Cascade Forest needed to stop, and maybe prediction should give us the ability to choose how many forests to use?).

Parallelization should be done by tree I guess for maximum performance (one can build random trees fast).

When adding multiple levels, it is the same as stacking models over models but we are using random models instead of real models.

One can view a Cascade Forest as a cross-validated ensemble of Random Forest (with feature parameter set to pick only one feature for each split, and no bootstrap = no subsampling).

pseudo-code:

# indexing like in Python, e.g [i, j) = from i included to j excluded

data # this is our data
model # this is our model, in form model[Cascade][Forest][Fold][Iteration]
folds # this is our folds, we have n folds, in form folds[number] indicating what observation belong to which n-th fold

# Loop for each Cascade
for (cascade in 0:infinite) {

    # Loop for each Forest
    for (forest in 0:num_forest) {

        # Subsample / Colsample
        forest_features <- (0:(ncol(data)))[take sqrt(features) features only]

        # Cross-validation per fold for each forest
        for (crossvalidation in 0:max(folds)) {

            # Train a random tree
            for (iteration in 1:num_iterations) {

                model[cascade][forest][crossvalidation][iteration] = train a random tree on n-1 folds, only one random feature provided per split

            }

        }

    }

    # Change what we are assessing
    data = out of folds predictions

    # Evaluate metric
    some evaluation metric comparison on data, break if loss is higher than previously

}

# End
return(model)

Multi-Grained Scanning

Just a Cascade Forest + sliding features.

That one should be skipped, unless we really need it. We can create the architecture ourselves I think.

Boosted models are not appropriate for many levels because they would end up learning the same thing without additional diversity. I had such issue in Kaggle's Santander + Bosch competition where I did multi-level boosted tree forest (issue was because boosting - with xgboost random forest it was much better).

GBDT should not fully replace random tree forest (for diversity needs), we should give the user at least such splitting parameters:

Gini (Classification)
Information Criterion (Classification)
Logloss (Classification)
MSE (Class/Reg)
MAE (Class/Reg)
...
Custom metric (which can be implemented in R/Python, to minimize preferably, and do -1*{Gini or IC} so they are negative by doing instead to not have to perform specific comparisons to check if it is a minimization or maximization problem of the objective function, unlike the evaluation metric)

This would leave the user the freedom to choose the default (Gini) or custom ones. would also enable to use it for regression (for regression case, disable "10 instances of the same class" and use minimum number of samples in split instead).

However, one could use GBDT-type and RF-type but potentially additional diversity, but GBDT might kill RF diversity.

Therefore, to implement this we need to have Random Forest (issue #47 ) first. Once it is done, we only need to add special metrics (Gini, Information Criterion, etc).

The major problem is cross-validation, we should give the user the ability to specify the folds because it may not be trivial in many cases (time series, patient-based split, etc.).

wxchan commented 7 years ago

We don't need to do exactly like in paper as long as the performance is good.

I wrote some api scratch now (#333 ). I do some test, performance seems not good, maybe something goes wrong.

Is there a good implement of Complete-Random Tree Forest in python now?

Laurae2 commented 7 years ago

@wxchan what target metric are you using? If you use logloss or similar, as the predictions are uncalibrated, the performance is poor. Make sure to use a metric which does not require calibration (AUC, automated thresholding Accuracy, etc. for instance).

I don't think there are any Complete-Random Tree Forest implementations in Python (or they might be there under another name - you can attempt to use RandomForest from scikit-learn but it doesn't have the split per level feature.

In R, you can find my implementation of gcForest using xgboost here: https://github.com/Laurae2/Laurae

I'm seeing good results as long as we don't land in the calibration requirement (decision trees and Random Forest have poor calibration out of the box most of the times). Example of one of my logs using my own Cascade Forest (2x Random Forest, 2x Random Tree Forest) and logloss (on agaricus data - performance is poor because it's boosting 1 iteration, needs calibration, and not true decision trees for outputting predictions):

Layer 1, Forest 1: 0.508035+0.026157
Layer 1, Forest 2: 0.498261+0.014815
Layer 1, Forest 3: 0.644824+0.000366
Layer 1, Forest 4: 0.559145+0.000465
Layer 1, Average Forest: 0.552566+0.007046
Layer 2, Forest 1: 0.494204+0.012198
Layer 2, Forest 2: 0.498276+0.011435
Layer 2, Forest 3: 0.559145+0.000465
Layer 2, Forest 4: 0.576243+0.007612
Layer 2, Average Forest: 0.531967+0.003784
Layer 3, Forest 1: 0.492363+0.010569
Layer 3, Forest 2: 0.488281+0.003284
Layer 3, Forest 3: 0.576243+0.007612
Layer 3, Forest 4: 0.507230+0.001121
Layer 3, Average Forest: 0.516029+0.002694

With AUC, boosting scaling does not matter much as it works like real decision trees from that point. As we can see, Random Forest (even with boosting 1 iteration scaling issue) nails this perfectly already at the first layer:

Layer 1, Forest 1: 1.000000+0.000000
Layer 1, Forest 2: 1.000000+0.000000
Layer 1, Forest 3: 0.793687+0.000330
Layer 1, Forest 4: 0.932049+0.003083
Layer 1, Average Forest: 0.931434+0.000794
Layer 2, Forest 1: 1.000000+0.000000
Layer 2, Forest 2: 1.000000+0.000000
Layer 2, Forest 3: 0.932049+0.003083
Layer 2, Forest 4: 0.853329+0.001216
Layer 2, Average Forest: 0.946344+0.000956
Layer 3, Forest 1: 1.000000+0.000000
Layer 3, Forest 2: 1.000000+0.000000
Layer 3, Forest 3: 0.853329+0.001216
Layer 3, Forest 4: 0.875266+0.010141
Layer 3, Average Forest: 0.932149+0.002429
Layer 4, Forest 1: 1.000000+0.000000
Layer 4, Forest 2: 1.000000+0.000000
Layer 4, Forest 3: 0.914535+0.002435
Layer 4, Forest 4: 0.935235+0.000297
Layer 4, Average Forest: 0.962442+0.000593
Layer 5, Forest 1: 1.000000+0.000000
Layer 5, Forest 2: 1.000000+0.000000
Layer 5, Forest 3: 0.935235+0.000297
Layer 5, Forest 4: 0.971791+0.000338
Layer 5, Average Forest: 0.976756+0.000145

wxchan commented 7 years ago

@Laurae2 can you reproduce similar results to the paper with your implement?

Laurae2 commented 7 years ago

@wxchan I can't reproduce them because I run out of memory on my local computer (+ it is very long when data gets large), but I get very good results on MNIST.

I just tested MNIST with the following settings:

1000 first observations = training set
test set (10000 observations) = test set
3 folds for Cascade Forest / gcForest
Accuracy is on test set (Cascade Forest / gcForest is just taking max mean of probability classes on last layer for Accuracy)
1 Layer of Cascade Forest / gcForest is 4 Random Forest + 4 Complete-Random Tree Forests (I get nearly identical results with 2+2)
Multi-Grained Scanning is done using 1 Random Forest + 1 Complete-Random Tree Forest
Multi-Grained Scanning is done approximately using a stride of 2 for fast computation and only 1 layer because I run out of memory
Only 100 trees for Cascade Forest / gcForest because I'm always out of memory on my small laptop

The results are below:

Model	xgboost	Random Forest	Cascade Forest	gcForest
Information	depth = 6 eta = 0.10 iters = 250 (converge) Features = 784	base = xgboost trees = 200 Features = 784	4 Layers Trees = 100 Features = 784	Depth = 10 Stride = 2 Features = 784+1420 +1 Layer Cascade
Accuracy	90.53%	89.75%	89.95%	91.12%
Time (est.)	5+ min	1+ min	15+ min	1+ hour

wxchan commented 7 years ago

@Laurae2 1000 samples for training may be not enough. I train a single lightgbm model w/ the entire MNIST dataset w/o much parameter tuning, the accuracy is 98.06%. The code as follow:

import numpy as np
import lightgbm as lgb
from sklearn.datasets import load_svmlight_file
from sklearn.metrics import accuracy_score

X_train, y_train = load_svmlight_file('mnist', n_features=784)
X_test, y_test = load_svmlight_file('mnist.t', n_features=784)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {'objective': 'multiclass', 'num_class': 10, 'num_leaves': 31, 'learning_rate': 0.05, 'feature_fraction': 0.9, 'bagging_fraction': 0.8, 'bagging_freq': 5, 'verbose': 0}

gbm = lgb.train(params, lgb_train, num_boost_round=1000)
y_pred = np.argmax(gbm.predict(X_test), axis=1)
print('accuracy_score:', accuracy_score(y_pred, y_test))

Actually I didn't see much advantage on gcForest compared to the ensemble of lightgbm or xgboost. I might test on other datasets in paper when I have time. And I plan to wait the author releases their codes first or someone tests gcForest on datasets like ImageNet before further implementation.

wxchan commented 7 years ago

I plan to close my pr for now until author releases the codes. Glad to help if someone else would like to contribute this feature.

Laurae2 commented 7 years ago

@wxchan I can't train on 60K samples a Cascade Forest, I automatically run out of RAM on a 4GB laptop. I used only 1K sample to make it fast, because training a Cascade Forest is like training multiple lightgbm/xgboost which means it takes forever (it was already taking forever for 1K samples, so for 60K I wouldn't even have finished by today the Mutli-Grained Scanning). The paper authors used a large server with 2x Intel E5-2670 v3 (2x 12 cores). Cascade Forest / gcForest is a massive toll on CPU.

I used 1K only for low RAM usage, low CPU time, and keeping consistency with what I can do with the hardware I currently have. I know already I can train on all data using a single boosted model and get 98%+ (https://www.kaggle.com/c/digit-recognizer/leaderboard), which is not what I want to do.

gcForest is mostly better than LightGBM/xgboost because:

It is a meta-model (like the ones used in Kaggle competitions by top competitors)
It is a stacking ensemble (therefore, you are free to use the models you want - authors decided RF + CRTF, I would say it's OK and for users they should add xgboost/LightGBM on top of that, because it helps a a lot - see at the end of this post)
It is an ensemble of diverse models (so diverse that even poor Random Forest + CRTF are outperforming a single boosted model - but at a massive CPU cost)
It is not a single model like LightGBM/xgboost (the synergy of two models is usually better than a strong single model)

Therefore, getting 99% of MNIST with gcForest is by far not a hard task if you add xgboost/LightGBM on top of gcForest (you should get higher than the reported 98.96% accuracy)

I have put xgboost on top of my gcForest (1K samples only) and got 92.77% accuracy (+2.24% - and I had to interrupt it because it took too long, did not converge), which means it helps tremendously to stack models (which is exactly what happens in Kaggle for instance).

mglowacki100 commented 7 years ago

@Laurae2 could you share your code? I've laptop with 64GB so I could do more memory-intensive tests.

Laurae2 commented 7 years ago

@mglowacki100 Try this code for MNIST (make sure to read the code comments first):

# Load libraries
library(Laurae) #devtools::install_github("Laurae2/Laurae@99ad3c2")
library(data.table)
library(Matrix)
library(xgboost) # COMPILED FROM SOURCE ONLY

# Set working directory accordingly
setwd("D:/")
# Download MNIST data here: https://pjreddie.com/projects/mnist-in-csv/
train <- fread("mnist_train.csv")
test <- fread("mnist_test.csv")

label_train <- train$V1
label_test <- test$V1
train$V1 <- NULL
test$V1 <- NULL

# debug xgboost, mandatory
for (i in 1:784) {
  train[[i]] <- as.numeric(train[[i]])
  test[[i]] <- as.numeric(test[[i]])
}

# SUBSAMPLE DATA - 2000 SAMPLES ONLY - COMMENT TO NOT SUBSAMPLE
valid <- train[2001:60000, ]
train <- train[1:2000, ]
label_valid <- label_train[2001:60000]
label_train <- label_train[1:2000]

# Create folds - set it to larger like 5 but it gets slower
folds <- kfold(label_train, k = 3)

# Do Cascade Forest using xgboost Random Forest / Complete-Tree Random Forest behind the wheels
# 0.899+
model <- CascadeForest(training_data = train,
                       validation_data = test,
                       training_labels = label_train,
                       validation_labels = label_test,
                       folds = folds,
                       boosting = FALSE,
                       nthread = 4, # More threads if you feel so
                       cascade_lr = 1,
                       training_start = NULL,
                       validation_start = NULL,
                       cascade_forests = c(rep(8, 4), 0), # c(rep(4, 2), 0) should be enough
                       cascade_trees = 100, # Set this to much higher like 1000 (cf official paper)
                       cascade_rf = 4, # If you changed cascade_forests, change this value accordingly
                       objective = "multi:softprob",
                       eval_metric = Laurae::df_acc,
                       multi_class = 10,
                       early_stopping = 4, # Keep it otherwise if you make it longer it will take forever to stop
                       maximize = TRUE,
                       verbose = TRUE,
                       low_memory = FALSE,
                       essentials = TRUE,
                       garbage = TRUE)

# Now compare to xgboost
dtrain <- xgb.DMatrix(data = Laurae::DT2mat(train), label = label_train)
dtest <- xgb.DMatrix(data = Laurae::DT2mat(test), label = label_test)
gc()

# [250] train-merror:0.000000   test-merror:0.094700
# 0.905300 accuracy
gc()
set.seed(11111)
model2 <- xgb.train(params = list(nthread = 4, # More threads if you feel so
                                  eta = 0.10,
                                  max_depth = 6,
                                  booster = "gbtree",
                                  tree_method = "hist",
                                  grow_policy = "depthwise"),
                    objective = "multi:softprob",
                    num_class = 10,
                    eval_metric = "merror",
                    nrounds = 1000000,
                    early_stopping_rounds = 50,
                    data = dtrain,
                    watchlist = list(train = dtrain, test = dtest),
                    verbose = 1)

# Try with Multi-Grained Scanning for gcForest
library(plyr)
create_progress_bar(name = "win") # Get rid of progress bar if you don't like, or set to text
new_train <- plyr::alply(train, 1, function(x) {matrix(x, nrow = 28, ncol = 28)}, .progress = "win")
new_test <- plyr::alply(test, 1, function(x) {matrix(x, nrow = 28, ncol = 28)}, .progress = "win")

# Run Multi-Grained Scanning
new_model <- MGScanning(data = new_train,
                        labels = label_train,
                        folds = folds,
                        dimensions = 2,
                        depth = 10, # Default official implementation
                        stride = 2, # Set this to 1 if you want maximum performance, but it is VERY SLOW + adds many features
                        nthread = 4, # More threads if you feel so
                        n_forest = 2, # Following official implementation
                        n_trees = 30, # Following official implementation
                        random_forest = 1, # Following official implementation
                        seed = 0,
                        objective = "multi:softprob",
                        eval_metric = df_acc,
                        multi_class = 10,
                        garbage = TRUE)

# Predict on train data
new_train2 <- MGScanning_pred(model = new_model,
                              data = new_train,
                              folds = folds,
                              dimensions = 2,
                              multi_class = 10)

# Predict on test data
new_test2 <- MGScanning_pred(model = new_model,
                             data = new_test,
                             folds = NULL,
                             dimensions = 2,
                             multi_class = 10)

# Create new datasets
new_train2 <- Laurae::DTcbind(new_train2, train)
new_test2 <- Laurae::DTcbind(new_test2, test)

# Do Deep Forest / gcForest
# 0.9112+
model <- CascadeForest(training_data = new_train2,
                       validation_data = new_test2,
                       training_labels = label_train,
                       validation_labels = label_test,
                       folds = folds,
                       boosting = FALSE,
                       nthread = 1, # More threads if you feel so
                       cascade_lr = 1,
                       training_start = NULL,
                       validation_start = NULL,
                       cascade_forests = c(rep(8, 4), 0), # c(rep(4, 2), 0) should be enough
                       cascade_trees = 100, # Set this to much higher like 1000 (cf official paper)
                       cascade_rf = 4, # If you changed cascade_forests, change this value accordingly
                       objective = "multi:softprob",
                       eval_metric = Laurae::df_acc,
                       multi_class = 10,
                       early_stopping = 2,
                       maximize = TRUE,
                       verbose = TRUE,
                       low_memory = FALSE,
                       essentials = TRUE,
                       garbage = TRUE)

# Try with xgboost as final booster instead of Cascade Forest
dtrain2 <- xgb.DMatrix(data = Laurae::DT2mat(new_train2), label = label_train)
dtest2 <- xgb.DMatrix(data = Laurae::DT2mat(new_test2), label = label_test)
gc()

# [223] train-merror:0.000000   test-merror:0.072300 
# 0.927700 accuracy, not converged
gc()
set.seed(11111)
model2 <- xgb.train(params = list(nthread = 1, # More threads if you feel so
                                  eta = 0.10,
                                  max_depth = 6,
                                  booster = "gbtree",
                                  tree_method = "hist",
                                  grow_policy = "depthwise"),
                    objective = "multi:softprob",
                    num_class = 10,
                    eval_metric = "merror",
                    nrounds = 1000000,
                    early_stopping_rounds = 50,
                    data = dtrain2,
                    watchlist = list(train = dtrain2, test = dtest2),
                    verbose = 1)

If you run out of memory, my latest R package has support for storing models externally so it does not eat memory quickly.

It would be great if LightGBM has support for Random Forest so gcForest speed is increased by more than 10X. Currently, xgboost cannot handle fast histogram in Random Forest mode.

wxchan commented 7 years ago

@Laurae2 I know gcForest is better than single xgboost/lightgbm, I meant the 'ensemble' of xgboost/lightgbm.

To conclude, I think gcForest is not a very novel idea, just stacking ensemble, I think many kaggle masters can think out this idea, far from 'An Alternative to Deep Neural Networks'.

Laurae2 commented 7 years ago

@wxchan ensemble of RF/CRTF should be better (or at least equivalent) because the xgboost/lightgbm boosting default diversity is lower (ensemble of random > ensemble of strong). Add on top of that the stochastic stacking which adds information xgboost/lightgbm will not be able to handle.

gcForest is a stochastic stacking ensemble, nothing new in it as you said because it's mostly a manual implementation to add to have it (and some of us were already using it for a long while ago - and in using a wayyyy stronger architecture).

chrisorm commented 7 years ago

Have you guys been able to replicate their benchmarks? Sklearns RF gets close to the benchmarks for Random Forests that they report, so I am confident implementations are close enough. I use Extra Trees from Sklearn for the completely random trees.

I have an implementation of this paper (particularly the Cascade Forest) using the Scikit-learn API, but have not been able to reach anything like their benchmark. I emailed the authors of the paper asking to clarify incase I misinterpreted some part of their description, but the response was that 'other teams have implemented based on the paper, so the paper is sufficient'.

https://github.com/chrisorm/Cascade-Forests

If you spot any mistakes or misinterpretation please let me know, but at the moment I struggle to beat a well chosen RF using this architecture.

I am also unsure as to the quality of the benchmarks they use. They say they use a single stock RF (2000 trees) for all problems, which does not seem like a fair test when their model does extensive cross validation to select the best configuration. Hardly like-for-like IMO, and the computational cost of this model is pretty big.

Laurae2 commented 7 years ago

@chrisorm You need to use Multi-Grained Scanning to achieve the optimal performance they depict (close to 99%). 98% using Cascade Forest should be achievable on MNIST, but we need someone with a very large computational power to test it using all MNIST data instead of 2,000 samples + poor parameters I used for maximum speed (NB: not 1,000, was a mistake in my previous posts).

That's for Cascade Forest alone.

However, Multi-Grained Scanning on MNIST is huge; very huge. The worst case is ((28-2 + 1)^2) x2 = 729 x2 Cascade Forests and 7,290 x2 dimensions to add, while the best case is ((28-7 + 1)^2) x2 = 484 x2 Cascade Forests and 4,840 x2 dimensions to add. Sum all this together... (7,290 + 6,760 + 4,840) x2 = 37,780 dimensions to stack with the original dataset. Then, we must concatenate the data and do a full Cascade Forest on it... which is a lot of time required considering such feature size (in addition to training multiple models just for one Cascade).

The model gets excessively large even if someone uses my "poor parameters" (stride=2, 100 trees only for Cascade Forest, etc.), as you can see below:

Just the Multi-Grained Scanning would be massive: stride=2, depth=10 == 9x9 scan on 28x28 feature => 81 Cascade Forests (1,620 features to add) = 250+MB, is massive. 250 * (37,780/1,620) = 5.83GB model

This also means 70,000 37,780 8 = 20GB RAM for the matrix alone.

It might explain why they resorted to using a server with Dual Intel Xeon E5-2670v3 and they couldn't finish running their experiments: the models are very fat on CPU.

chrisorm commented 7 years ago

@Laurae2 Have you tried the adult dataset? They do a benchmark vs. random forest in the paper, and this is much smaller and doesn't use multigrain scanning either. I am highly curious as to whether one could do better than the RF benchmark they quote, given thats what I got with 2000 trees forest that is out of the box sklearn. Perhaps I will use your implementation until I figure out what is different about mine.

On cursory look, your algorithm seems logically to be doing the same thing as mine, yet there is substantial difference in the performance (yours seems to work reasonably well). I guess there is some subtly not made clear in the description (or I messed up somewhere!).

I implemented multigrain and it did indeed add a massive amount of features - the idea works in CNN's because its a form of parameter tying - here it makes the computational cost much larger.

Laurae2 commented 7 years ago

@chrisorm Here on Adult dataset. I'm also getting the same issue you described for Adult dataset: better RF out of the box (86.57%) than their results (86.17%). I'm sure Deep Forest/Boosting can push it even further easily (87.58%+).

Train/Test set: as defined on the Adult dataset (32,561 training, 16,281 testing)
k-Folds for Cascade Forest: 3
4 threads on i7-4600U (usage of hyperthreading for more speed)
xgboost: depth=6, eta=0.10, tree_method="hist", grow_policy="depthwise", auto-threshold for Accuracy
xgboost Random Forest: "infinite" (65535) depth, eta=1.00, subsample=0.632, colsample_bylevel=ceiling(sqrt(13))/13=4/13, 2000 trees, auto-threshold for Accuracy
Cascade Forest: 2 Random Forest + 2 Complete-Random Tree Forests, 1000 trees per forest, early stopping of 4 (stops at 5th consecutive performance loss), auto-threshold for Accuracy

Cascade Forest improvements (reported: mean without standard deviation for the forests):

Layer	F1 (RF1)	F2 (RF2)	F3 (CRTF1)	F4 (CRTF2)	Avg Forest	Perf.
Layer 1	86.3972%	86.4238%	78.7237%	77.9620%	85.3142%	=
Layer 2	86.4095%	86.3972%	86.4443%	81.9299%	86.5549%	+
Layer 3	86.5262%	86.4955%	86.6286%	83.9117%	86.6900%	Best
Layer 4	86.4976%	86.4750%	84.0305%	83.7930%	86.5856%	-
Layer 5	86.3972%	86.4198%	83.7930%	86.5651%	86.5057%	-
Layer 6	86.3645%	86.3849%	86.4566%	80.6769%	86.4996%	-
Layer 7	86.3993%	86.4300%	80.6769%	86.3993%	86.4627%	-
Layer 8	86.3993%	86.4218%	86.3440%	86.3542%	86.6163%	-

Model size:

Model	Size (bytes)	Size (better unit)
8 Layers	2,108,466,480 bytes	1.96GB
3 Layers	734,047,020 bytes	700MB

yes, the 8 layer model size is 1.96GB, not a typo.

Training time:

Model	Iterations	Accuracy	Time (s)	Perf.
Official paper Deep Forest	see official paper	86.17xx%	?s	6th
xgboost Mode: Random Forest	Full: 2000 iterations	86.5733%	62.001s	5th
xgboost Mode: Boosted Trees	Best: 134 iterations Train: 184 iterations	87.5868%	5.737s	1st
Cascade Forest	Best: 3 layers Train: 8 layers	86.6900%	1601.794s	4th
Cascade Forest Stack: Random Forest	Full: 2000 iterations	86.7023%	65.434s	3rd
Cascade Forest Stack: Boosted Trees	Best: 99 iterations Train: 149 iterations	87.2797%	5.983s	2nd

Boosting speed is not typo, it is really ~6 seconds, and it gives the best accuracy out of the box without even any parameter tuning.

Laurae2 commented 7 years ago

@chrisorm Another Deep Forest implementation in Python here: https://github.com/pylablanche/gcForest

N.B: its benchmark uses 8x8 images, ~1800 samples, not comparable with results reported here.

asfix commented 7 years ago

Hello, i would like to apply gcForest on my logo dataset. However, I don't know where to start from. First of all, i need an implementation. Is there any implementation of gcForest rather than Python version. For example, c++ or R would be great. Can anybody help or guide me? I have seriously involved in this method.

Any kind of help will be appreciated.

pylablanche commented 7 years ago

@Laurae2 thanks for citing my work. The small test I ran was indeed on a very small data set and on 8x8 images I don't think the Multi-Grain scanning step is very relevant. If anyone have the resource to test the implementation on the whole MNIST data set and let me know the performance (I have limited resources available) that would be great!

I'm currently working on the memory usage issue and would hopefully come up with a solution soon.

pylablanche commented 7 years ago

@chrisorm I have just read what you went through while implementing gcForest and I don't think the paper is sufficient in the details it is providing. There are several technical points that aren't clear and I had to make some assumption especially about how the training is done. It was a pretty dull answer from the authors.

I also read that you use Extra Trees for the "complete random forest" but I actually think that what they use is Random Forest where feature selection at a node is not done from a subsample of the original set of features but from the whole set of features (set max_features=None in scikit).

Have you anyway tried to run your code on the scikit digits dataset? (just by curiosity)

Another general question : How many layers do you guys get when running gcForest ? In my case I never exceeded 5 layers and most of the time the construction stops after 2, 3 layers.

chrisorm commented 7 years ago

@pylablanche I have run it on the digits dataset, but the results were not much more exciting than I could get with a single random forest of reasonable size. I can rerun it later if you want the exact numbers.

You are perhaps correct they do not mean a extremely random tree now I reread. The seem to imply they use exactly this type of tree http://www.jair.org/media/2470/live-2470-3860-jair.pdf. I'm not sure if this has any exact analogue to either RF or ERF in scikit learn.

I too noticed it struggles to move past a few layers in my method, although @Laurae2 seems to have his models bloat out to 500mb +, which suggests theirs moves past a couple of layers.

For me the big red flag is that this needs a very particular structure to do anything that beats a single RF. Clearly changing out the type of diversity tree destroys the edge the method has - which is nothing like a neural network. The representational learning you get in the layers of a neural network is something very fundamental and powerful. The effectiveness of the neural network comes from the fact that each layer learns a feature of importance for the problem, and these are combined in a meaningful way. Changing softmax for tanh or relu doesn't make the method not work, because the actual important part is that each part of the NN learns some meaningful feature of the inputs, and these features are passed forwards and combined etc. I question whether there is any meaningful information passed by this method - I can see no method by which passing the class probabilites for each forest would actually represent some meaningful embedding of the inputs into some other space.

At best you could argue that the aribtary-dimensional decision boundaries that a forest learns have some shortcoming and that essentially stacking models like this allows you to get a better approximation of the true separating boundaries, hence your accuracy increases. (Think of an example of an RF trying to learn a perfectly circular decision boundary).

IMO I'm not sure the title and inference of the authors regarding the power of this method is in any way justified.

Laurae2 commented 7 years ago

@chrisorm The big red flag is indeed with the stacking method. As with any stacking methods, this is computationally intensive and does not have much meaning when predicting a layer: this is more like an attempt to mathematically converge to a better local optima, and letting "machine learning" doing the work, provided a known structure where it will work reasonably "fast" and well.

Using RF to beat RF apples to apples... is done using stacking unfortunately. Also, using RF loses already complete interpretability of the model.

As for the paper title, I would rather say it's a potential alternative to simple CNNs, without more. But also, I have six more issues with the paper:

I'm getting better and faster results with LeNet compared to Deep Forest on a portion of MNIST (2k).
Myself and @chrisorm are getting better results out of the box with a Random Forest on the Adult dataset than their Deep Forest, something is wrong/fishy (even worse: we are using two different Random Forest implementations).
I'm getting better results just by using a gradient boosted model instead of a Cascade Forest (why are we needing the complexity of Cascade Forest when we already know boosting methods are typically better? - not talking about diversity here).
Just putting Intel MKL on MXnet made LeNet training very fast on CPU (cf issue 1), faster than Cascade Forest, Boosting being "only" twice faster using a 10x speedup (fast histogram method). Even if we assume a log scale for gradient boosting and linear scale for neural network, this doesn't fit what they have.
The authors do not want to publish their code (what is research for, then? contributing or not contributing? reproducible or not reproducible? why can't we reproduce their results?).
The authors are technically too vague in their model description, which leads to many different implementations which differ in their way of working (reproducibility issue).

pylablanche commented 7 years ago

@Laurae2 @chrisorm Thank you both for your feedback. I definitely share the overall feeling around gcForest and the fact that the authors did not released the code is indeed very suspicious in addition of being a mistake.

I have also spotted a number of points in the paper that I find surprising such as the 1000 trees per Random Forests. On a data set like MNIST with 10 classes and 55000 samples you very likely don't need that many trees and I don't find any empirical or theoretical justification to this. On some personal datasets I also get results using Random Forest out of the box that are very close if not better than gcForest. I am not surprise that you get better results using simpler approaches.

The paper definitely lacks sensible justification for the complexity of the algorithm and the reasons why this (apparently) works is not straightforward to me. I have tried quickly to trace features importance at each layer but it hasn't really helped me to understand what is going on. At the same time, as far as I know, we are still struggling to really understand why neural networks as well. At least the idea of using Random Forests in a different structure is interesting.

I am not super confident that Deep Forest as it is has a great future. Has the paper actually been published/accepted in any journal ?

Laurae2 commented 7 years ago

Has the paper actually been published/accepted in any journal ?

I didn't see it in any journal, probably it was not submitted.

I am not super confident that Deep Forest as it is has a great future.

I think under the name "Stacking" (and if we change the models) it has a great future, but not under the name "Deep Forest". Currently, stacking is widely used for getting the maximum performance out of models, which is nothing new (especially on Kaggle competitions).

The paper definitely lacks sensible justification for the complexity of the algorithm and the reasons why this (apparently) works is not straightforward to me.

It also lacks how they compiled their machine learning programs, because it has a huge impact on the speed of models.

@pylablanche By the way, I saw you added recently a more efficient slicer to your Deep Forest implementation. Are you able to run it on MNIST with the 2k first samples? You can download MNIST in CSV here: https://pjreddie.com/projects/mnist-in-csv - it should fit easily a 8GB machine if it is efficient enough.

pylablanche commented 7 years ago

@Laurae2 I have indeed been able to improve the slicing part of my Deep Forest implementation and I can now train it on the 55000 MNIST training samples for a window size of 15x15 pixels on my laptop (8GB of RAM). I'll post results here as soon as I get them. Anyway, I still cannot reproduce the configuration used in the paper as I would need much more computing resource but at least it is much more memory efficient now. Anyone wants to try? I'll probably check if other parts of the code can be improved as well.

wxchan commented 7 years ago

NEWS: The 2nd author claimed today: they will release their codes before June 1st.

https://www.zhihu.com/question/56474891/answer/158293083

Laurae2 commented 7 years ago

Google translate is very difficult to get on it apparently (it hides all the content once translated for weird reasons), here it is (lot of stuff makes no sense with google translate):

Ji Feng Talk about machine learning and others 925 people agree with the answer Thank you As one of the participants in the paper, I do very limited, here to talk about my personal opinion on the little teacher gcForest week. Here is the statement: all views just represent me personally.

Let's say a few personal conclusions:

This is a Deep Model, but building blocks is a decision tree. This work is not to drop the depth of learning the field, the depth of learning just fine, and based on the depth of the neural network to explore the model almost, we would like to explore the depth of the tree based on the possibility of the model. It is futile to think that there is no reasonable reason to be sure that the attempt based on the depth model of the decision tree is futile, and the integration of the decision tree itself has many properties that the neural network does not have, so it is worthwhile to spend some time and effort Strong adjustment).

This is just the beginning. Decision tree has a lot of horrible nature, at present I just want to say that we know nothing about the power of the forest, which has great potential, to be developed. Depth study from 2006 to 12 years, spent in the middle of almost 6 years to keep flourishing, if you expect an article will be able to get so many years, so many scholars, and so much money things, it is impossible.

performance - sorry, for my personal laziness, did not adjust the structure of the forest. Because we focus on versatility. The A little while the forest complexity was done, MNIST was 99.39% (there was still room for me, because I just doubled the forest ..), I know a lot of friends do think visual CIFAR / ImageNet should run, here a little Say a few words: A) do visual friends think CIFAR / ImageNet model performance is the center of the universe, but it is not the center of the machine to learn, in this work, we are more concerned about a common framework.

B) also want to run, but did not achieve the distributed algorithm, and my stand-alone memory is limited, and later optimized the program, CIFAR10 is able to run up, plus every piece of memory, performance up 4 points , The current single to 70%. AlexNet is 83%, compared with 10% of the gap. Note, however, that the depth full-link neural network (MLP) is 47%, and Alex's Deep Belief Network (which is the first important model for in-depth learning) is 65% on cifar10, and all other non-neural networks are 55% or less (if the input does not make any changes). I personally guess the first work as a deep forest, the results can be considered enough? (At least than the depth of the Boltzmann machine performance and popularity of strong?)

C) This is the first work to propose a more general framework and direction that will be specifically optimized for future applications based on computer vision.

D) who guys send me a few pieces of memory to make

The other content, please look at the original teacher Zhou teacher just fine. And then respond to some comments:

"There is no BP certainly not."

Sorry, this does not agree. BP is a great algorithm, and also my personal favorite algorithm top5. BP is very important in the neural network, but if you think there is no BP certainly not, then some thinking fixed, and similar to no wings can not heaven This assertion. The Fly on the sky not only birds and planes, as well as rockets. Give a chestnut: Although not to do the neural network, but also clear that there are many ready-made neural network model without BP, such as the famous NEAT. (If I remember correctly, there is a GAN work is done using NEAT G?) And then give a little bit of the chestnut: Last month OpenAI's Evolution Strategies as a Scalable Alternative to Reinforcement Learning (this title is cool, laughing) ... no need for backpropagation. ES only requires the forward pass of the policy and does not require backpropagation (or value function estimation), which makes the code shorter and between 2-3 times faster in practice. So in fact, BP is not a lot of tasks in the sub-items, not to say that the BP did not turn on the play. The There are a lot of neural network to do the gods have been trying to do something to replace the BP thing, this is not something secret: even I know.

"can not feature transfer."

Sorry, this also disagreed. I feel that can not do, and others can not do two things. The The This can not tell too much now.

"can not End to End." This is a deep learning practitioner which is a common cliché. Similar to BP, E2E is not a necessary condition for the model to work. In addition to the phrase, if you look at the history of machine learning, there are many non-neural network e2e model does not work. The

"Hope that in this impetuous era, everyone, especially the researchers to maintain independent thinking, do not parrot.

The phrase itself is right. I am in favor. Need to be questionable is: what is the beginning, what is the end, what is the beginning of it? For accuracy, in the neural network repeatedly call the unspeakable implicit, this is the beginning of it? Two years ago Embedding + Everything, a year ago GAN + Everything and the water out of thousands of articles, this is the beginning of it? Holding the hammer all over the world looking for nails and water out of the tens of thousands of applications, this is the beginning of it? The depth of the neural network speculation into a strong era of artificial intelligence, with some misleading demo to "show" the machine already has the human emotions, this is the beginning of it?

Personally think that different groups, the beginning should be different:

As a computer vision / natural language processing to do the application of people, abandoned all areas of knowledge, only with the neural network, personally feel that forget the beginning of the bar . The Individual skeptics are the ultimate way to solve CV / NLP. The I am not doing the application, this is not a lot of talk, the courage to suggest that you read Professor UCLA, computer vision, one of the great god of Professor Zhu Songchun interview: "the first three sources of computer vision and artificial intelligence" http: // it. Sohu.com/20161129/n4 74464488.html I guess, Professor Zhu mentioned the source, probably should be counted as a computer vision of the "early heart" it

As a researcher in the study of this subject, the beginning should be the introduction of some innovative direction, theoretically enlightened, or the proposed model for the downstream application of the subject has been harvested. For example, when others are not optimistic about CNN, Yann LeCun can silently stick to it, the potential of the neural network to the extreme, it is not forget the beginning, this is really cattle fork.

And take a few demo brush to find the wind to find, water articles, work to do the balance, to ridicule and deny the real bite hard bones of the people, it seems not a very cool thing. Of course, irrigation is not shameful (weak food like me, this thing dry up very hard), for a theory of building blocks is a modern scientific spirit of the embodiment. But you have to allow the world to have a handful of people who have the right to try other things. This is the difference in value.

"If there is an intern to tell me such an idea, I will never agree to do the work .." This one. The The Heart is actually some lucky. I even think that can be used as the Southern LAMDA group enrollment propaganda: "In our subject group, you can do some in-depth learning for the core technology companies will be killed in the fun of the work." It is a few words to promote it, in fact, LAMDA group direction is very comprehensive, there are special proof of the theory of Professor Gao Wei, he spent five years to prove a boosting inside a pending big problem, is the machine learning theory One of the annual breakthrough (many people do not think that can be certified), there are specialized to do large-scale distributed machine learning algorithm Professor Li Wuchun (5 hours a week marathon group will not move), there are specialized to do the depth of learning, but also MINIEYE the chief scientist Professor Wu Jianxin (his students you certainly know, know the great god @ Wei Xiu Shen ), and in all the people are not optimistic about the study of the time in the field quietly cultivate the unique Professor Yu Yang (now RL fire A mess, really feng shui turn ..). The Of course, there are academic insightful Zhou teacher town building, which Needless to say. The group of professors in the direction of the whole, quite advantages, in common only one: do not follow the crowd, is a serious study of the Lord. As a group of students, in the machine learning most of the problems encountered in the field, within a distance of 30-50 meters can find the right professor, face to face to give you real professional help and counseling, which in other places is difficult to do To the. Undergraduate open day is coming, please learn younger brother who are concerned about the enrollment situation. Than heart.

For the first time in the knowledge of the answer, point praise five hundred put the code in the six before the holiday put, as the Children's Day gift. At least take the brush Kaggle game or very good to make. The Finally, I wish you do not forget the beginning of the heart, side was always (this piece of soup is really wild ah ...) forget, crossed out. your happiness is the sun in my world.

Please indicate the source (Point of praise is a joke, the spacious group of working code is not open source, is still review, 6.1 will certainly put before, nothing to do a few praise) Edited at 16:10

wxchan commented 7 years ago

NEWS: source codes released: http://lamda.nju.edu.cn/code_gcForest.ashx, last line

wxchan commented 7 years ago

github repo: https://github.com/kingfengji/gcforest

kingfengji commented 7 years ago

Hi there, I'm the 2nd author of gcForest. I would like to appriciate your interests on this paper especially @Laurae2 @guolinke @wxchan @pylablanche and all of you for either re-implementation and discussions.

As @wxchan pointed out, the code has been released at [https://github.com/kingfengji/gcforest], I'm just a coding apprentice and definitely NOT a professional programmer like you guys, so any suggestions is very welcomed on boosting up the performance. ( currently sklearn did a poor job when parallelizing trees, since they just copy the data N times, I plan to fix this ... if this got fixed, the memery issue will be eliminated I guess. Currently trying to figure out how that works, help is appriciated)

The google translation is hilarious. It missed every point I wish to make. But that's OK, no big deal. What I was trying to say is forests are pretty cool stuffs, people ignored many aspects of forest for so long. More research results are coming up ,ranging from representation learning to applications.

Thank you again and I give you my humble regards.

chrisorm commented 7 years ago

@kingfengji thanks for coming on here.

Has this paper passed peer review and been accepted to any journal?

As has been said before, stacking is not new and whilst you claim this is an alternative to deep nets, you don't actually provide any justification as to why this is the case. Why do you personally think this method is a step towards a new type of deep learning, as opposed to just an engineered solution that squeezes out a bit more performance?

Also your claim in the paper that random forests are easier to analyse is somewhat disingenuous. There are many blogs and papers where authors have shown the features a neural network has learnt or visualised the role of each of the neurons. Random forests are a collection of different decision trees - while we can inspect any one instance, it is not a simple matter to explore the features on an aggregate level. It is entirely possible that no 2 trees share a common path and so it is not clear how you can interpret this any better than a collection of weights and sigmoids.

kingfengji commented 7 years ago

@chrisorm Hi Chris. The paper has been accepted and will be presented at IJCAI-17. (See the paper link in the GitHub link I posted, it differs a little bit with the original draft) I'm not here to do academic rebuttals... I just came across this page after I posted the code online and saw you guys put some efforts on re-implement the model, which I feel ,personally ,touched and I appreciate that. Any issues related to the implementation is welcomed. Thanks!

pylablanche commented 7 years ago

@kingfengji Welcome in the discussion and thanks for joining! The paper raised a lot of questions and discussions! I found it interesting that the layered structure seems to work so well. I actually have some questions for you.

What is, according to you, the reason why this cascade structure works so well ? It is still something I am trying to understand better and it just seems like an empirical result. Just to make it clear, was the only motivation of using MGS+cascade to mimic the convolution+layers structure of a convolutional neural network ?

On a more practical aspect : The main issue noticed by users is the memory usage, did you experience it? Do you have any recommendation on this? Also : Why Python2.7 ???

Hope to hear from you soon! :)

pylablanche commented 7 years ago

Another paper using a similar cascade structure (citing gcForest) : https://arxiv.org/pdf/1705.07366.pdf

kingfengji commented 7 years ago

@pylablanche First I wish to thank you for your implementations and your time, I appriciate that very much.

Apart from the reasons mentioned in the paper, the cascade structure works better is due to the concatenation of the original input (intuition here is the forest for the next level won't forget the real parttern and in the same time see some predictions) and use class distributions(instead of pure predictions) , which both help to increase the generalization power. Boosting and stacking are two related ideas for cascading forest.
MGS should be only used when spatial relationships among features is at present.
The memory issue. Yes that's the currently the biggest issue here. One of main reasons is this:

*In scikit-learn, suppose there are 1000 trees, and there are 100 possible classes, the predict_proba() function is implemented by first calculate the [100,1] predictions obtained by each tree, then average them up. This is very very memory inefficient. Imagine the input data has dimention 500, then when doing predictions, this 500d vector will be transformed into 1000100 numbers in mem! And you got 60,000 input points to predict..**

I partially fix this problem by doing predictions while feeding the input chunck by chunck, as in _batch_predict_proba( /lib/gcforest/estimators/base_estimators.py ) and forest_predict_batch_size (lib/gcforst/estimators/sklearn_estimators.py).

However, I think if we re-implement the sklearn's predict_proba function in a smart way , we can reduce the mem issue a lot. Since storing all the class prob for all the trees is not necessary. Once this is done, I think we can reduce the mem by several factors.

The dev enviroment I had at that time was python 2.7, there are some package missing in my 3.5 env, so I just randomly picked 2.7.... no particular reasons..... I might give a 3.5 version.....
Finally, I plan to release an update which support the model.fit/ model.predict scheme. You are more than welcomed to comment/discuss/contribute. Feel free to leave comments at https://github.com/kingfengji/gcForest/issues/3

Thanks!

guolinke commented 7 years ago

refer to https://github.com/kingfengji/gcForest and https://github.com/pylablanche/gcForest