Closed mayer79 closed 4 years ago
Did you try colsample_bynode in xgb?
And you can set min_data=1 and min_data_in_bin=1 in lgb.
Hello Linke
Setting min_data_in_leaf = 1 is the solution - thanks for your help! And of course I should have been using colsample_bynode in xgb, but this has surprisingly little impact in this example. I have added insample RMSEs to show how well it works with the right parameters!
Do you think the other parameters are reasonable? Like number of boosting rounds of 500, learning rate of 1, low penalties, high number of leaves etc.?
Which script would I need to study how rf mode is implemented?
Best
set.seed(1)
n <- 1000
x1 <- seq_len(n)
x2 <- rnorm(n)
x3 <- rexp(n)
x4 <- runif(n)
X <- cbind(x1, x2, x3, x4)
y <- rnorm(n, x1 / 1000 + x2 / 10 + x3 / 5)
library(xgboost)
library(lightgbm)
library(ranger)
# XGB Random Forest
param_xgb <- list(max_depth = 10,
learning_rate = 1,
objective = "reg:linear",
subsample = 0.63,
lambda = 0,
alpha = 0,
colsample_bynode = 1/3)
dtrain_xgb <- xgb.DMatrix(X, label = y)
fit_xgb <- xgb.train(param_xgb,
dtrain_xgb,
nrounds = 1,
num_parallel_tree = 500)
# LGB Random Forest
param_lgb <- list(boosting = "rf",
max_depth = 10,
num_leaves = 1000,
learning_rate = 1,
objective = "regression",
bagging_fraction = 0.63,
bagging_freq = 1,
reg_lambda = 0,
reg_alpha = 0,
min_data_in_leaf = 1,
colsample_bynode = 1/3)
dtrain_lgb <- lgb.Dataset(X, label = y)
fit_lgb <- lgb.train(param_lgb,
dtrain_lgb,
nrounds = 500)
# True Random Forest
fit_rf <- ranger(y = y,
x = X,
max.depth = 10,
num.trees = 500)
# Evaluate predicitons
pred <- data.frame(
pred_xgb = predict(fit_xgb, X),
pred_lgb = predict(fit_lgb, X),
pred_rf = predict(fit_rf, X)$predictions
)
summary(pred)
#2901 pred_xgb pred_lgb pred_rf
# Min. :-1.4913 Min. :-1.4414 Min. :-1.4899
#1st Qu.: 0.4484 1st Qu.: 0.4378 1st Qu.: 0.4267
#Median : 0.7375 Median : 0.7406 Median : 0.7270
#Mean : 0.7452 Mean : 0.7448 Mean : 0.7419
#3rd Qu.: 1.0437 3rd Qu.: 1.0593 3rd Qu.: 1.0669
#Max. : 2.3308 Max. : 2.3252 Max. : 2.3308
cor(pred)
# pred_xgb pred_lgb pred_rf
# pred_xgb 1.0000000 0.9848699 0.9924107
# pred_lgb 0.9848699 1.0000000 0.9869734
# pred_rf 0.9924107 0.9869734 1.0000000
rmse <- function(y, pred) {
sqrt(mean((y - pred)^2))
}
rmse(y, pred$pred_xgb) # 0.6170593
rmse(y, pred$pred_lgb) # 0.6047725
rmse(y, pred$pred_rf) # 0.6003823
I think these parameters look good to me.
Thx, then let's close this thread.
I was trying to compare random forest modes of XGBoost, LightGBM and a true implementation (ranger) in R. While XGBoost predictions are very similar (correlation almost 1), predictions from LGB are very much different. What am I doing wrong? (Selecting parameters is not too easy and I could not find a guide in the docu).