microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.63k stars 3.83k forks source link

[R-package] Error in data$update_params(params = params) : [LightGBM] [Fatal] Cannot change max_bin after constructed Dataset handle. #4019

Closed TalWac closed 3 years ago

TalWac commented 3 years ago

Description

When trying to create a model I get this error:

Error in data$update_params(params = params) : 
  [LightGBM] [Fatal] Cannot change max_bin after constructed Dataset handle.

Reproducible example

I run the script from Repit

devtools::install_github("rstudio/reticulate")
devtools::install_github("rstudio/tensorflow")
devtools::install_github("rstudio/keras")

library(reticulate)
library(keras)
library(tensorflow)
conda_create("myenv")
use_condaenv("myenv")
install_keras(method="conda", envname="myenv")
install_tensorflow(version = "nightly", method = "conda", envname="myenv") 
use_condaenv("myenv")

install.packages("lightgbm", repos = "https://cran.r-project.org")
library(lightgbm)
library(Retip)

#>Starts parallel computing
prep.wizard()

# import excel file for training and testing data
RP2 <- readxl::read_excel("Plasma_positive.xlsx", sheet = "lib_2", col_types = c("text", 
                                                                              "text", "text", "numeric"))
# import excel file for external validation set
RP_ext <- readxl::read_excel("Plasma_positive.xlsx", sheet = "ext", col_types = c("text", 
                                                                                  "text", "text", "numeric"))
#> or use HILIC database included in Retip
HILIC <- HILIC
#> Clean dataset from NA and low variance value
db_rt <- proc.data(descs)
preProc <- cesc(db_rt) #Build a model to use for center and scale a dataframe 
db_rt <- predict(preProc,db_rt) # use the above created model for center and scale dataframe

#> Split in training and testing using caret::createDataPartition
set.seed(101)
inTraining <- caret::createDataPartition(db_rt$XLogP, p = .8, list = FALSE)
training <- db_rt[ inTraining,]
testing  <- db_rt[-inTraining,]

When running this line, I receive the error above: lightgbm <- fit.lightgbm(training,testing)

However, if I remove the max_bin = 50 inside the functionlightgbm::lgb.cv (that inside the function fit.lightgbm) or change it to max_bin = 255, then there is no errro.

Environment info

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

LightGBM version or commit hash: version 3.1.1

Command(s) you used to install LightGBM

install.packages("lightgbm", repos = "https://cran.r-project.org")
library(lightgbm)

Additional Comments

Many Thanks!

jameslamb commented 3 years ago

Thanks for using {lightgbm}!

Where does the function fit.lightgbm() come from? There is no such functiom in {lightgbm}. Right now, your example code does not seem to contain any {lightgbm} code.

shiyu1994 commented 3 years ago

[LightGBM] [Fatal] Cannot change max_bin after constructed Dataset handle. This problem usually occurs when: First, a Dataset is created, and used for training once. Then if the user uses the same Dataset to train for the second time, and specify a different max_bin value from the first training, the error will be reported.

This is because, before Dataset is used for training, the feature values will be discretized into at most max_bin bins. And currently, we don't support discretize the Dataset for twice with different max_bin values.

To avoid this, just recreate the Dataset each time before your training.

TalWac commented 3 years ago

@jameslamb - Sorry for being unclear.

The fit.lightgbm() comes from the Repit (This is were all the code come from). This how fit.lightgbm() looks like:

> fit.lightgbm
function (training, testing) 
{
  train <- as.matrix(training)
  test <- as.matrix(testing)
  coltrain <- ncol(train)
  coltest <- ncol(test)
  dtrain <- lightgbm::lgb.Dataset(train[, 2:coltrain], label = train[,  1])
  lightgbm::lgb.Dataset.construct(dtrain)
  dtest <- lightgbm::lgb.Dataset.create.valid(dtrain, test[,2:coltest], label = test[, 1])
  valids <- list(test = dtest)
  params <- list(objective = "regression", metric = "rmse")
  modelcv <- lightgbm::lgb.cv(params, dtrain, nrounds = 5000, 
                              nfold = 10, valids, verbose = 1, early_stopping_rounds = 1000, 
                              record = TRUE, eval_freq = 1L, stratified = TRUE, max_depth = 4, 
                              max_leaf = 20, max_bin = 50)
  best.iter <- modelcv$best_iter
  params <- list(objective = "regression_l2", metric = "rmse")
  model <- lightgbm::lgb.train(params, dtrain, nrounds = best.iter, 
                               valids, verbose = 0, early_stopping_rounds = 1000, record = TRUE, 
                               eval_freq = 1L, max_depth = 4, max_leaf = 20, max_bin = 50)
  print(paste0("End training"))
  return(model)
}

If I change the max_bin = 50 in the modelcv to max_bin = 255 or remove it at all the function lightgbm::lgb.cv does work. Otherwise, there is the error mentioned.

TalWac commented 3 years ago

@shiyu1994 - Thank you for your time

Then if the user uses the same Dataset to train for the second time, and specify a different max_bin value from the first training, the error will be reported.

I do not think this is the case, since the fit.lightgbm function looks like that:

> fit.lightgbm
function (training, testing) 
{
  train <- as.matrix(training)
  test <- as.matrix(testing)
  coltrain <- ncol(train)
  coltest <- ncol(test)
  dtrain <- lightgbm::lgb.Dataset(train[, 2:coltrain], label = train[,  1])
  lightgbm::lgb.Dataset.construct(dtrain)
  dtest <- lightgbm::lgb.Dataset.create.valid(dtrain, test[,2:coltest], label = test[, 1])
  valids <- list(test = dtest)
  params <- list(objective = "regression", metric = "rmse")
  modelcv <- lightgbm::lgb.cv(params, dtrain, nrounds = 5000, 
                              nfold = 10, valids, verbose = 1, early_stopping_rounds = 1000, 
                              record = TRUE, eval_freq = 1L, stratified = TRUE, max_depth = 4, 
                              max_leaf = 20, max_bin = 50)
  best.iter <- modelcv$best_iter
  params <- list(objective = "regression_l2", metric = "rmse")
  model <- lightgbm::lgb.train(params, dtrain, nrounds = best.iter, 
                               valids, verbose = 0, early_stopping_rounds = 1000, record = TRUE, 
                               eval_freq = 1L, max_depth = 4, max_leaf = 20, max_bin = 50)
  print(paste0("End training"))
  return(model)
}

And the max_bin=50 is constant if I understand it correctly.

jameslamb commented 3 years ago

I see, that was the context we needed. Thank you. I think that if you pass max_bin as parameter to lgb.Dataset(), instead of lgb.train() and lgb.cv(), your code will run successfully. @shiyu1994 's explanation above explains why. Let us know if you have additional questions.

TalWac commented 3 years ago

@jameslamb and @shiyu1994 Thank you for the clear explanations. Following @jameslamb's last comment now it works.

Many Thanks!!

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.