microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.71k stars 3.83k forks source link

Random Forest Mode improper? #2922

Closed mayer79 closed 4 years ago

mayer79 commented 4 years ago

I was trying to compare random forest modes of XGBoost, LightGBM and a true implementation (ranger) in R. While XGBoost predictions are very similar (correlation almost 1), predictions from LGB are very much different. What am I doing wrong? (Selecting parameters is not too easy and I could not find a guide in the docu).

set.seed(1)
n <- 1000
x1 <- seq_len(n)
x2 <- rnorm(n)
x3 <- rexp(n)
x4 <- runif(n)
X <- cbind(x1, x2, x3, x4)
y <- rnorm(n, x1 / 1000 + x2 / 10 + x3 / 5)

library(xgboost)
library(lightgbm)
library(ranger)

# XGB Random Forest
param_xgb <- list(max_depth = 10,
                  learning_rate = 1,
                  objective = "reg:linear",
                  subsample = 0.63,
                  lambda = 0,
                  alpha = 0,
                  colsample_bylevel = 1/3)

dtrain_xgb <- xgb.DMatrix(X, label = y)

fit_xgb <- xgb.train(param_xgb,
                     dtrain_xgb,
                     nrounds = 1,
                     num_parallel_tree = 500)

# LGB Random Forest
param_lgb <- list(boosting = "rf",
                  max_depth = 10,
                  num_leaves = 1000,
                  learning_rate = 1,
                  objective = "regression",
                  bagging_fraction = 0.63,
                  bagging_freq = 1,
                  reg_lambda = 0,
                  reg_alpha = 0,
                  colsample_bynode = 1/3)

dtrain_lgb <- lgb.Dataset(X, label = y)

fit_lgb <- lgb.train(param_lgb,
                     dtrain_lgb,
                     nrounds = 500)

# True Random Forest
fit_rf <- ranger(y = y, 
                 x = X, 
                 max.depth = 10, 
                 num.trees = 500)

# Evaluate predicitons
pred <- data.frame(
  pred_xgb = predict(fit_xgb, X),
  pred_lgb = predict(fit_lgb, X),
  pred_rf = predict(fit_rf, X)$predictions
)

summary(pred)
# pred_xgb          pred_lgb           pred_rf       
# Min.   :-1.4913   Min.   :-0.05034   Min.   :-1.4899  
# 1st Qu.: 0.4484   1st Qu.: 0.53052   1st Qu.: 0.4267  
# Median : 0.7375   Median : 0.77139   Median : 0.7270  
# Mean   : 0.7452   Mean   : 0.74428   Mean   : 0.7419  
# 3rd Qu.: 1.0437   3rd Qu.: 0.96095   3rd Qu.: 1.0669  
# Max.   : 2.3308   Max.   : 1.40742   Max.   : 2.3308  

cor(pred)
#           pred_xgb  pred_lgb   pred_rf
# pred_xgb 1.0000000 0.7779298 0.9924107
# pred_lgb 0.7779298 1.0000000 0.7813147
# pred_rf  0.9924107 0.7813147 1.0000000

sessionInfo()
# R version 3.6.1 (2019-07-05)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 18362)
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=English_Switzerland.1252  LC_CTYPE=English_Switzerland.1252   
# [3] LC_MONETARY=English_Switzerland.1252 LC_NUMERIC=C                        
# [5] LC_TIME=English_Switzerland.1252    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#   [1] ranger_0.12.1    lightgbm_2.2.4   R6_2.4.1         xgboost_0.90.0.2
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.6.1    magrittr_1.5      Matrix_1.2-17     tools_3.6.1       Rcpp_1.0.3       
# [6] stringi_1.4.6     grid_3.6.1        data.table_1.12.8 jsonlite_1.6.1    lattice_0.20-38 
guolinke commented 4 years ago

Did you try colsample_bynode in xgb?

guolinke commented 4 years ago

And you can set min_data=1 and min_data_in_bin=1 in lgb.

mayer79 commented 4 years ago

Hello Linke

Setting min_data_in_leaf = 1 is the solution - thanks for your help! And of course I should have been using colsample_bynode in xgb, but this has surprisingly little impact in this example. I have added insample RMSEs to show how well it works with the right parameters!

Do you think the other parameters are reasonable? Like number of boosting rounds of 500, learning rate of 1, low penalties, high number of leaves etc.?

Which script would I need to study how rf mode is implemented?

Best

set.seed(1)
n <- 1000
x1 <- seq_len(n)
x2 <- rnorm(n)
x3 <- rexp(n)
x4 <- runif(n)
X <- cbind(x1, x2, x3, x4)
y <- rnorm(n, x1 / 1000 + x2 / 10 + x3 / 5)

library(xgboost)
library(lightgbm)
library(ranger)

# XGB Random Forest
param_xgb <- list(max_depth = 10,
                  learning_rate = 1,
                  objective = "reg:linear",
                  subsample = 0.63,
                  lambda = 0,
                  alpha = 0,
                  colsample_bynode = 1/3)

dtrain_xgb <- xgb.DMatrix(X, label = y)

fit_xgb <- xgb.train(param_xgb,
                     dtrain_xgb,
                     nrounds = 1,
                     num_parallel_tree = 500)

# LGB Random Forest
param_lgb <- list(boosting = "rf",
                  max_depth = 10,
                  num_leaves = 1000,
                  learning_rate = 1,
                  objective = "regression",
                  bagging_fraction = 0.63,
                  bagging_freq = 1,
                  reg_lambda = 0,
                  reg_alpha = 0,
                  min_data_in_leaf = 1,
                  colsample_bynode = 1/3)

dtrain_lgb <- lgb.Dataset(X, label = y)

fit_lgb <- lgb.train(param_lgb,
                     dtrain_lgb,
                     nrounds = 500)

# True Random Forest
fit_rf <- ranger(y = y, 
                 x = X, 
                 max.depth = 10, 
                 num.trees = 500)

# Evaluate predicitons
pred <- data.frame(
  pred_xgb = predict(fit_xgb, X),
  pred_lgb = predict(fit_lgb, X),
  pred_rf = predict(fit_rf, X)$predictions
)

summary(pred)
 #2901   pred_xgb          pred_lgb          pred_rf       
# Min.   :-1.4913   Min.   :-1.4414   Min.   :-1.4899  
 #1st Qu.: 0.4484   1st Qu.: 0.4378   1st Qu.: 0.4267  
 #Median : 0.7375   Median : 0.7406   Median : 0.7270  
 #Mean   : 0.7452   Mean   : 0.7448   Mean   : 0.7419  
 #3rd Qu.: 1.0437   3rd Qu.: 1.0593   3rd Qu.: 1.0669  
 #Max.   : 2.3308   Max.   : 2.3252   Max.   : 2.3308  
cor(pred)
#                   pred_xgb  pred_lgb   pred_rf
# pred_xgb 1.0000000 0.9848699 0.9924107
# pred_lgb 0.9848699 1.0000000 0.9869734
# pred_rf  0.9924107 0.9869734 1.0000000

rmse <- function(y, pred) {
  sqrt(mean((y - pred)^2))
}

rmse(y, pred$pred_xgb) # 0.6170593
rmse(y, pred$pred_lgb) # 0.6047725
rmse(y, pred$pred_rf)  # 0.6003823
guolinke commented 4 years ago

I think these parameters look good to me.

mayer79 commented 4 years ago

Thx, then let's close this thread.