microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.69k stars 3.83k forks source link

[R-package] Prediction early stopping not working? #760

Closed Laurae2 closed 7 years ago

Laurae2 commented 7 years ago

ping @cbecker @guolinke

OS: Windows Server 2012 R2 R 3.4.0 compiled with MinGW 7.1 LightGBM compiled with Visual Studio 2017

Prediction early stopping parameters are not working (or are not discoverable in R?)

Timings reported for 500 iterations on Bosch dataset:

Format 0 = NA Early Stop AUC Best Iter Time
Sparse FALSE FALSE 0.7095987 140 135683.676
Sparse FALSE TRUE 0.7095987 140 135971.854
Sparse TRUE FALSE 0.7097566 118 132794.350
Sparse TRUE TRUE 0.7097566 118 132671.168

Reproducible steps:

  1. Install LightGBM in R:
install_github("Microsoft/LightGBM@2e83a1c", subdir = "R-package")
  1. Run the following (adjust parameters to your needs):
setwd("E:/datasets")
sparse <- TRUE # dense is significantly slower
params <- list(num_threads = 40,
               learning_rate = 0.05,
               num_leaves = 63,
               max_bin = 255,
               pred_early_stop =  TRUE, # Change accordingly
               pred_early_stop_freq = 10, # Change accordingly
               pred_early_stop_margin = 10.0, # Change accordingly
               zero_as_missing = TRUE) # Change accordingly

library(data.table)
library(Matrix)
library(R.utils)

data <- fread(file = "bosch_data.csv")

# Do xgboost / LightGBM

# When dense:
# > sum(data == 0, na.rm = TRUE)
# [1] 43574349
# > sum(is.na(data))
# [1] 929125166

# Split
if (sparse == TRUE) {
  library(recommenderlab)
  gc()
  train_1 <- dropNA(as.matrix(data[1:1000000, 1:969]))
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- dropNA(as.matrix(data[1000001:1183747, 1:969]))
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
} else {
  gc()
  train_1 <- as.matrix(data[1:1000000, 1:969])
  train_2 <- data[1:1000000, 970]$Response
  gc()
  test_1 <- as.matrix(data[1000001:1183747, 1:969])
  test_2 <- data[1000001:1183747, 970]$Response
  gc()
}

# For LightGBM
# lgb.unloader(wipe = TRUE)
library(lightgbm)
train  <- lgb.Dataset(data = train_1, label = train_2)
test <- lgb.Dataset(data = test_1, label = test_2, reference=train)
# train$construct()
# test$construct()

gc()
Laurae::timer_func_print({temp_model <- lgb.train(params = params,
                                                  data = train,
                                                  nrounds = 500,
                                                  valids = list(test = test),
                                                  objective = "binary",
                                                  metric = "auc",
                                                  verbose = 2)})

perf <- as.numeric(rbindlist(temp_model$record_evals$test$auc))
max(perf)
which.max(perf)
guolinke commented 7 years ago

@Laurae2 Is num_threads working ?

Laurae2 commented 7 years ago

@guolinke yes it works. When using 20 threads, task manager reports only 50% CPU usage instead of 100%.

cbecker commented 7 years ago

Thanks for pinging me. Have you tried with a lower value of pred_early_stop_margin? I usually use pred_early_stop_margin = 1.5. If you try something very low like 0.1 it should be very fast to predict but the predictions should be almost completely off. Let me know, I'll be happy to take a look if it's indeed not working.

Laurae2 commented 7 years ago

@cbecker I used pred_early_stop_margin = 0.1 and I am getting identical results as without it. Am I missing a parameter that must be set to activate early stopping for predictions?

cbecker commented 7 years ago

That's weird. I never tried the R package but there could be a bug there or in the code I added for early stopping. Can you point me to a zip file with all the files needed to run this code, including the data? I've never used R but I can take a look if I have the whole working code.

Laurae2 commented 7 years ago

@cbecker I give you an easier example.

To run the example and install in R, you need Rtools (if Windows) + cmake (cmake must be in PATH, mandatory step):

install.packages(c("devtools", "matrixStats"))
devtools::install_github("Microsoft/LightGBM/R-package", force = TRUE)

Then you can try this very simplified example:

library(matrixStats)

generated <- matrix(nrow = 10000, ncol = 10)
for (i in 1:10) {
  set.seed(i)
  generated[sample.int(10000, 1000, replace = FALSE), i] <- 0
}

gen_labels <- as.numeric(rowAnys(generated, value = 0, na.rm = TRUE)) # 6534
to_sort <- order(gen_labels)
generated <- generated[to_sort, ]
gen_labels <- gen_labels[to_sort]

# devtools::install_github("Microsoft/LightGBM/R-package")
# lgb.unloader(wipe = TRUE)
library(lightgbm)

dtrain <- lgb.Dataset(generated, label = gen_labels)
valids <- list(test = dtrain)

model <- lgb.train(list(objective = "binary",
                        metric = "l2",
                        min_data = 1,
                        learning_rate = 0.1,
                        pred_early_stop =  TRUE,
                        pred_early_stop_freq = 1,
                        pred_early_stop_margin = 0.1),
                   dtrain,
                   1000,
                   valids,
                   early_stopping_rounds = 1)

plot(predict(model, generated))

image

With pred_early_stop_margin = 0.1 it should have stopped at the first iteration, instead it kept decreasing loss until getting near perfect results.

cbecker commented 7 years ago

Thanks. I think I know where the problem may come from: pred_early_stop is for prediction, and you are passing those parameters at training time. @guolinke how is this handled in R? Can we pass those parameters to the predict function?

luyongxu commented 7 years ago

I have some related insight. Commit ac975e734d6982ad94e6394908cea3bd4bd2744d introduced a bug on my system. I installed the version immediately before the commit above and compared it to current version on master.

Reproducible example below.

library(devtools)
install_github("Microsoft/LightGBM", ref = "402474f4063aff3cef9167ecb9f4a035df2736ea", subdir = "R-package")

library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(objective = "regression", metric = "l2")
model <- lgb.cv(params,
                dtrain,
                1000,
                nfold = 5,
                min_data = 1,
                learning_rate = 0.3,
                early_stopping_rounds = 10)

Produces a normal result. Returns the model object.

.rs.restartR()

library(devtools)
install_github("Microsoft/LightGBM", subdir = "R-package")

library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(objective = "regression", metric = "l2")
model <- lgb.cv(params,
                dtrain,
                1000,
                nfold = 5,
                min_data = 1,
                learning_rate = 0.3,
                early_stopping_rounds = 10)

But the current version on master returned this error when I tried to run it.

Error in env$model$best_score <- best_score[i] : 
  cannot add bindings to a locked environment
Laurae2 commented 7 years ago

@luyongxu The bug you have is unrelated to the current bug (your bug is a pure R bug, while my bug is a R/C++ wrapping bug or an issue on the C++ backend).

See https://github.com/Microsoft/LightGBM/pull/764 for a fix to your issue.

guolinke commented 7 years ago

@Laurae2 I think the predict function in R can accept the additional parameters as well.

Laurae2 commented 7 years ago

@cbecker With the help of @guolinke it now works.

It works if I use this to predict:

plot(predict(model, generated,
             pred_early_stop =  TRUE,
             pred_early_stop_freq = 1,
             pred_early_stop_margin = 0.1))

image

plot(predict(model, generated,
             pred_early_stop =  TRUE,
             pred_early_stop_freq = 1,
             pred_early_stop_margin = 1))

image

sugs01 commented 6 years ago

Please update lightgbm package, your problem will be resolved.

library(devtools) options(devtools.install.args = "--no-multiarch") # if you have 64-bit R only, you can skip this install_github("Microsoft/LightGBM", subdir = "R-package")